Introducing Analysis Plans

A framework for verifiable analysis of AI behavior

The Docent Team*

^* Correspondence to: selena@transluce.org

Transluce | Published: June 17, 2026

Developing an AI agent is a complex data analysis problem. To know if the agent is working correctly, we need to track not just benchmark scores but the details of its behavior: how do the strategies change over the course of training? Why does the new scaffold perform worse? Is there reward hacking? Answering these questions requires a combination of quantitative and qualitative analysis tailored to the dataset at hand.

Coding agents have the potential to accelerate this work, but they’re prone to subtle mistakes: they might parse data incorrectly, make unjustified assumptions, or cherry-pick examples and present a misleading narrative. These mistakes aren’t apparent in the final output. To trust a conclusion, we need to verify exactly how it was produced. But reviewing all the actions taken by a coding agent is tedious—crucial methodology decisions get buried in hundreds of lines of logs.

Without analysis plans, it's difficult to understand and verify the behavior of a coding agent.

This problem motivated us to develop analysis plans, a framework for verifiable analysis of AI behavior. Analysis plans are specified in a Python API that any coding agent can work with. When they’re ready to run, they appear in a web interface that lets humans understand and verify every step that was taken.

With analysis plans, it's easy to understand and verify how a coding agent is reaching its conclusions.

Analysis plans contain two types of steps:

Query steps filter, group, and join your data using DQL (Docent’s subset of SQL). Each step is displayed with its query and an interactive table of the results.

Reading steps use an LLM to analyze data from a query step, producing a text summary and/or a structured judgment. Claims made by the LLM come with citations to specific items in its context.

These two step types can be customized and combined to build complex analysis pipelines. Readings can accept any data that a query produces, and queries can run over any reading results. At each step, results are traced to the exact computation that produced them, enabling you to inspect, audit, and refine the flow.

Let’s see what this looks like by detecting cheating on a popular software engineering benchmark.

Demo: identifying suspicious behaviors in SWE-rebench

Cheating is a common thorn when interpreting evaluation results: models famously hard-code tests, falsely claim success, and exploit unclean environments to copy solutions. Measuring rates of cheating is essential for understanding how much of a benchmark score represents a valid demonstration of model capability. In about 15 minutes, we used Docent to discover instances of cheating on SWE-rebench, a software engineering benchmark that measures how many recent GitHub PRs an agent can resolve. You can view the SWE-rebench traces in Docent at this link.

1. Explain to your coding agent what you want to learn

We start by prompting Claude Code to score agent runs for potential cheating. Thanks to the Docent skill, Claude knows how to turn our question into an analysis plan. It writes a short Python script like the following.

from docent import Docent
client = Docent()

runs = client.query(
  COLLECTION_ID,
  "SELECT agent_runs.id AS run FROM agent_runs "
  "WHERE CAST(metadata_json->'scores'->>'resolved' AS DOUBLE PRECISION) = 1.0 "
  "ORDER BY agent_runs.id LIMIT 200",
  name=f"Sample 200 resolved runs by UUID",
)

DETECTOR_PROMPT = "..." # omitted for brevity
OUTPUT_SCHEMA = { ... } # omitted for brevity

detect = client.read(
  prompt_template=[runs.run.as_type("agent_run"), DETECTOR_PROMPT],
  model="openai/gpt-5.5",
  reasoning_effort="medium",
  output_schema=OUTPUT_SCHEMA,
  name="Flag cheating from trajectory",
)

When Claude runs this script, the Docent SDK doesn’t execute readings immediately. Instead, it builds up a graph of all the readings that are being requested (in this case, just one) and uploads it to Docent as an analysis plan. This lets us review the proposed analysis before waiting for LLM calls to complete.

2. Anatomy of an analysis plan

This analysis plan starts with a DQL query to select successful runs by filtering the agent run metadata to resolved=1. After that, it passes the results to a reading step, which scores runs for cheating.

A reading step has the following components:

A prompt template with parameters. To detect cheating, we create a rubric that defines suspiciousness, taking one agent run as a parameter. The Docent UI shows the full prompt template, with parameters represented inline. Clicking on the "Context: run" chip below exposes additional detail on what run-level metadata is passed to the LLM. By default, no metadata is passed.
Arguments from a previous query step. The previous query step in our plan produced a table of runs where resolved=1. For each row in this table, Docent will substitute the full text of the agent run into the prompt template and call an LLM with it.
Custom output schema. This reading outputs an integer score from 1 to 10 that represents suspiciousness, and a string explanation containing citations. Docent ensures LLM output conforms to this schema so that later steps won’t run into data formatting issues.

Readings are designed to be expressive. Common use cases include judging runs for a specific behavior, clustering results of previous readings to extract high-level trends, summarizing long agent runs, or comparing two rollouts of the same task. Once you have a reading that works well, you can save it as a preset so agents can reuse it later.

The purple highlight indicates that this reading is still waiting for our approval, so let’s approve it. (The Docent SDK also provides a way to approve steps programmatically; you can tell your coding agent to “auto accept analysis plan” and skip the manual review.)

3. Review the reading result

Reading results table showing the cheating score and reasoning, with citations that link to parts of the transcript.

Reading results are designed for ease of verification. As a reading executes, its results appear in a table. Clicking on each row expands the result in a new pane with the full output and metadata like the reading model, reasoning effort, and tokens used.

Expanded reading result showing model reasoning and metadata

Claims in the reading output include citations to specific parts of the transcript. Click on the blue quote icons to jump to relevant parts of the transcript and double-check the claims.

Citation linking reading output to a highlighted part of the transcript

4. Quantifying the results

Count of runs that received each suspiciousness score — Count of runs that received each score, with the query used to generate the counts shown above.

To get an aggregate picture, we can ask our coding agent to write a DQL query for counting the number of agent runs that received each suspiciousness score.

We can verify the DQL query at a glance by clicking on the Query toggle above the final results. Reading over the precise DQL lets you check for errors like incorrect grouping, incorrect filtering, or a mean calculated using the wrong method.

5. Verifying suspicious examples

Reward hacking judges are famously prone to false positives. We can pass the results to a new reading and ask a follow-up question.

Follow-up prompt to verify suspicious cheating examples

Our agent uses DQL to extract the four suspicious runs and creates a reading to verify each of them. The verifier reading confirms that most of the cheating examples are actually suspicious, but at least one is a false positive.

One suspicious run appears to contain cheating by forward-looking: the agent examines the git history and restores code from a different commit. But the verifier correctly identifies this as a false positive: a past commit introduced a regression, so restoring the original state of the codebase is a valid patch.

More case studies

Analysis plans support a range of applications through one unified framework. Here’s how we used them for comparing models, identifying failure modes, and discovering unexpected behaviors.

▶

Identifying common failure modes on SWE-Bench Pro

▶

Comparing model performance on Terminal-Bench

Conclusion

Analysis plans are now available to all Docent users. To try them out, install the Docent SDK and agent skill. Run an analysis on our sample data, or ingest your own.

Questions or feedback? We'd love to hear from you. Chat with the team in Docent Community Slack.