Advanced Mode
Dry-run & A/B testing
Two Advanced Mode tools for iterating on your prompt without paying the cost (latency + tokens) of a full deploy → environment → session → run cycle.
The dry-run ("Test the agent")
A collapsible section at the bottom of the right-hand panel. You type a question, we call Anthropic's Messages API directly with the resolved system prompt — no tools, no agent created. The response arrives in 2-3 seconds.
What it's for
- Checking that the prompt's tone and intent are correct
- Seeing how the model interprets an ambiguous instruction
- Spotting a forgotten
{{x}}variable (it shows up as-is in the response) - Testing an alternative wording in 30 seconds
Without tools, it isn't a full testThe dry-run doesn't run
bash, web_fetch, or the other tools. If your agent depends on them to do anything at all, the dry-run only shows what the model would say without acting. For an end-to-end test, deploy and launch a real task.V1 limitations
- Single-turn — no history between calls
- 1024 tokens max in output (adjustable later)
- No streaming — the response arrives all at once
A/B testing
Even more powerful: comparing two versions of the system prompt on the same question, side by side. Lets you decide which wording works better before baking it into the agent.
How it works
- The "Prompt A/B testing" section, at the bottom of the right-hand panel, just under the dry-run
- Variant B = a textarea where you write the alternative version of the system prompt
- Shared question = the same message sent to both variants
- We run in parallel two
/v1/messagescalls, one with A (the current prompt), one with B - Both responses show up in a 2-column grid
- If you like B, an Adopt variant B button overwrites your current system prompt
Empty variant B = comparing A to AIf you leave field B empty, we compare A to itself. Useful for observing the model's variance between two runs on the same question. Spoiler: it's rarely identical.
Typical use cases
- Torn between "Be concise" vs "Be brief, max 3 sentences" — which one holds the format better?
- Testing a firmer vs a warmer tone for a support agent
- Comparing two breakdowns of a complex task into sub-steps
- Before touching a prompt in production, validating that the change doesn't degrade the response on your typical cases
Costs
Both tools are free — the platform funds them. A dry-run = 1 call. An A/B = 2 calls in parallel. There is nothing to pay and nothing to set up.