Advanced Mode

Dry-run & A/B testing

Two Advanced Mode tools for iterating on your prompt without paying the cost (latency + tokens) of a full deploy → environment → session → run cycle.

The dry-run ("Test the agent")

A collapsible section at the bottom of the right-hand panel. You type a question, we call Anthropic's Messages API directly with the resolved system prompt — no tools, no agent created. The response arrives in 2-3 seconds.

What it's for

Checking that the prompt's tone and intent are correct
Seeing how the model interprets an ambiguous instruction
Spotting a forgotten {{x}} variable (it shows up as-is in the response)
Testing an alternative wording in 30 seconds

Without tools, it isn't a full testThe dry-run doesn't run bash, web_fetch, or the other tools. If your agent depends on them to do anything at all, the dry-run only shows what the model would say without acting. For an end-to-end test, deploy and launch a real task.

V1 limitations

Single-turn — no history between calls
1024 tokens max in output (adjustable later)
No streaming — the response arrives all at once

A/B testing

Even more powerful: comparing two versions of the system prompt on the same question, side by side. Lets you decide which wording works better before baking it into the agent.

How it works

The "Prompt A/B testing" section, at the bottom of the right-hand panel, just under the dry-run
Variant B = a textarea where you write the alternative version of the system prompt
Shared question = the same message sent to both variants
We run in parallel two /v1/messages calls, one with A (the current prompt), one with B
Both responses show up in a 2-column grid
If you like B, an Adopt variant B button overwrites your current system prompt

Empty variant B = comparing A to AIf you leave field B empty, we compare A to itself. Useful for observing the model's variance between two runs on the same question. Spoiler: it's rarely identical.

Typical use cases

Torn between "Be concise" vs "Be brief, max 3 sentences" — which one holds the format better?
Testing a firmer vs a warmer tone for a support agent
Comparing two breakdowns of a complex task into sub-steps
Before touching a prompt in production, validating that the change doesn't degrade the response on your typical cases

Costs

Both tools are free — the platform funds them. A dry-run = 1 call. An A/B = 2 calls in parallel. There is nothing to pay and nothing to set up.