Advanced Mode

Dry-run & A/B testing

Two Advanced Mode tools for iterating on your prompt without paying the cost (latency + tokens) of a full deploy → environment → session → run cycle.

The dry-run ("Test the agent")

A collapsible section at the bottom of the right-hand panel. You type a question, we call Anthropic's Messages API directly with the resolved system prompt — no tools, no agent created. The response arrives in 2-3 seconds.

What it's for

  • Checking that the prompt's tone and intent are correct
  • Seeing how the model interprets an ambiguous instruction
  • Spotting a forgotten {{x}} variable (it shows up as-is in the response)
  • Testing an alternative wording in 30 seconds
Without tools, it isn't a full testThe dry-run doesn't run bash, web_fetch, or the other tools. If your agent depends on them to do anything at all, the dry-run only shows what the model would say without acting. For an end-to-end test, deploy and launch a real task.

V1 limitations

  • Single-turn — no history between calls
  • 1024 tokens max in output (adjustable later)
  • No streaming — the response arrives all at once

A/B testing

Even more powerful: comparing two versions of the system prompt on the same question, side by side. Lets you decide which wording works better before baking it into the agent.

How it works

  1. The "Prompt A/B testing" section, at the bottom of the right-hand panel, just under the dry-run
  2. Variant B = a textarea where you write the alternative version of the system prompt
  3. Shared question = the same message sent to both variants
  4. We run in parallel two /v1/messages calls, one with A (the current prompt), one with B
  5. Both responses show up in a 2-column grid
  6. If you like B, an Adopt variant B button overwrites your current system prompt
Empty variant B = comparing A to AIf you leave field B empty, we compare A to itself. Useful for observing the model's variance between two runs on the same question. Spoiler: it's rarely identical.

Typical use cases

  • Torn between "Be concise" vs "Be brief, max 3 sentences" — which one holds the format better?
  • Testing a firmer vs a warmer tone for a support agent
  • Comparing two breakdowns of a complex task into sub-steps
  • Before touching a prompt in production, validating that the change doesn't degrade the response on your typical cases

Costs

Both tools are free — the platform funds them. A dry-run = 1 call. An A/B = 2 calls in parallel. There is nothing to pay and nothing to set up.