Introducing Docent

A system for analyzing and intervening on agent behavior

Kevin Meng*,Vincent Huang,Jacob Steinhardt,Sarah Schwettmann*

^* Correspondence to: [email protected], [email protected]

Transluce | Published: March 24, 2025

We built a system, Docent, that accelerates analysis of AI agent transcripts. It helps users automatically identify corrupted tasks, fix scaffolding issues, uncover unexpected behaviors, and understand an agent's weaknesses—in only minutes.

Explore a technical preview of Docent here, check out our Quickstart Guide to use Docent with your own data, and join our user community Slack to ask questions and receive early updates.

Docent identifies issues in an agent's environment, like missing packages the agent invoked. On the InterCode [1] benchmark, simply adding those packages increased GPT-4o's solve rate from 68.6% to 78%.

Docent surfaces surprising behaviors. After repeatedly failing to solve a task, GPT-4o generates nonsense text about "chestnut facts" and tries to "re-examine for latent conceptual derivation."

Introduction

The goal of an evaluation is to suggest general conclusions about an AI agent's behavior. Most evaluations produce a small set of numbers (e.g. accuracies) that discard important information in the transcripts: agents may fail to solve tasks for unexpected reasons, solve tasks in unintended ways, or exhibit behaviors we didn't think to measure. Users of evaluations often care not just about what one individual agent can do, but what nearby agents (e.g. with slightly better scaffolding or guidance) would be capable of doing. A comprehensive analysis should explain why an agent succeeded or failed, how far from goal the agent was, and what range of competencies the agent exhibited.

The information we want is very difficult to obtain. A benchmark with 100 tasks, each run 10 times, produces millions of output tokens. Imagine spending hours reading through it all, in hopes of finding a few broken task environments that account for failures—you might think it's not worth it. Furthermore, drawing robust conclusions requires validation of hypotheses through counterfactual experimentation, for which there is little existing infrastructure.

Docent automates many labor-intensive parts of transcript analysis. It uses language models to summarize, cluster, and search over agent transcripts. It automatically surfaces interesting observations about transcripts, answers questions, and helps users form hypotheses. It also provides infrastructure for counterfactual experimentation to test those hypotheses; users can modify and resume agents from past checkpoints. In the longer term, we anticipate that this experimental loop will be executed by AI systems themselves, with humans providing guidance.

Using Docent, we provide in-depth analyses of several common agent benchmarks. Beyond the findings demonstrated above, Docent also uncovers indications of task vulnerability on an Inspect port of Cybench [2], and surprising behaviors in other agent benchmarks.

Today, we're releasing a live technical preview of Docent: a public web endpoint where you can explore agent transcripts for a set of preloaded benchmarks. We are also soliciting early users to receive white-glove support for running Docent on their own transcripts, ahead of a general release in the coming weeks. If you have a use case that might benefit from Docent, check out our Quickstart Guide to use Docent with your own data and join our user community Slack to ask questions and receive early updates. In this technical report, we'll walk through our findings and design decisions when building Docent, and examples of how we've used it so far.

Gaps in understanding

▶

The current evaluation paradigm is insufficient for understanding complex agent behaviors. Top-level numbers don’t communicate why an agent succeeded or failed, and broken setups often lead to misinterpretation of results.Click to explore

Most benchmarks report an error rate for each model, which is tempting to interpret as representing the fraction of tasks that were too difficult for the model to solve. However, this is not generally the case: nearly all benchmarks contain noise due to mislabeled, ambiguous, or ill-posed questions. For example, Vendrow et al. recently found 120 such poorly written questions in the popular GSM-8k benchmark that obscured differences in performance among frontier models. This has led to a call for verified "platinum" benchmarks that remove such noise [3].

Addressing errors quickly becomes untenable for agentic evaluations that require complex scaffolds. For example, when using the SWE-bench coding benchmark [4], OpenAI researchers discovered that incorrect unit tests and broken environments caused many correct solutions to be rejected [5]. They paid software engineers to screen out ambiguous or unsolvable problems, resulting in SWE-bench Verified. While this new benchmark more accurately reflects model capabilities, creating it took months.

Beyond reliability issues, aggregate success rates don't provide color on why a model succeeded or failed at a task. The more open-ended a task is, the more room there is for important information to be lost in a single success/fail metric [6]. While studying common agent benchmarks, we identified several important sources of information that do not show up in the overall success rate:

▶

Unsolved tasks can be quite close to being solved, for instance from simple improvements to prompting or environment tools. Click for example

▶

"Solved" tasks can be solved for the wrong reasons, for instance if an agent has memorized part of a solution or finds a shortcut. Click for example

▶

Models can exhibit interesting but unexpected behaviors that are difficult to specify ahead of time (unknown unknowns). Click for example

Without proper infrastructure, even meticulous researchers have trouble gaining insight. For instance, checking for nearly-solved tasks requires re-running commands, waiting for results, then struggling to identify what changed between runs. Identifying tasks solved for the wrong reasons requires detailed manual review of hundreds of megabytes of data for incorrect task shortcuts. And discovering unknown unknowns requires reviewing the same data without a clear goal in mind, which is even harder.

The key point is that interfaces for interacting with transcripts are bottlenecking even simple human exploration of agent behavior. Looking at the data is sufficiently difficult that surprising model behaviors and corrupted tasks easily go unnoticed. In the next sections, we’ll describe how Docent solves this problem.

Docent workflows

▶

Docent enables 4 key workflows: summarization, search, clustering, and intervention. These workflows help users developing tasks identify environment errors and potential scaffolding improvements, and help users analyzing large volumes of behavior traces understand patterns in their data.Click to explore

Findings

▶

So far, we've used Docent to identify and fix environment issues in InterCode CTF, detect cheating in Cybench, discover surprising behaviors in τ-bench, and fix scaffold issues in AgentHarm.Click to explore

Full InterCode analysis

InterCode [1] is a capture-the-flag (CTF) benchmark where agents must recover a text string hidden in a filesystem. It tests key cybersecurity skills at a pre-collegiate level, including cryptography, reverse engineering, and system exploration. GPT-4o (gpt-4o-2024-11-20), run with Inspect's scaffold and default agent ¹Inspect's implementation of InterCode is a subset that excludes certain tasks., achieves a 68.6% overall solve rate across 79 tasks with 5 runs each.

Instead of taking the 68.6% solve rate at face value, here's what we found with Docent:

Fixing the scaffold boosts absolute performance by 10%.

▶

We first asked Docent to cluster the various reasons the agent got stuck.

▶

Noticing a large number of missing external tools, we then asked Docent to cluster the types of tools the agent tried to invoke that were unavailable.

▶

We updated the scaffold to fix some issues we found.

Multiple tasks are broken or unsolvable.

There remained 11 tasks with a 0% pass@5. We found that 4 of them are broken. This process was manual, but we are looking into automated detection.

▶

Task 15: Solved after suggesting alternate decoding assumptions

▶

Task 55: Requires image capabilities, but solved after providing contents of the image

▶

Task 56: Requires image capabilities

▶

Task 59: Incorrect ground truth flag because IP registration details have changed

Most tasks are solvable with minor interventions.

Of the remaining 7 tasks, we find that most of them are actually solvable with minor interventions. We also find that the successful interventions provide qualitative insight into the agent's limitations.

▶

Task 47: Solved after encouraging the agent to consider different encodings

▶

Task 69: Solved after asking agent to think harder about the RSA equation

▶

Task 70: Solved after suggesting a filesystem mount tool, and hinting about slack space

▶

Task 71: Solved after reminding the agent to look at the file named "hidden"

▶

Task 79: Solved after editing agent outputs to redeclare and print variables

▶

Task 85: Difficult Powershell reverse engineering task

▶

Task 86: Difficult Vignere cipher problem

Task Vulnerability in Cybench Port

Cybench is a more difficult professional-level CTF. Claude 3.5 Sonnet (New) achieves a 25% solve rate across 40 tasks with 5 runs each, where 12 tasks are solved at least once. We ran Cybench on a January 2025 fork of the Inspect repo.

Upon asking Docent for cases of "unintended shortcuts", we immediately find one. The Dockerfile, which contains the flag, was accidentally exposed in the filesystem.

Curious whether the agent can solve the task legitimately, we intervene by removing the leaked flag from the tool call. The agent fails to solve the corrected task even after 10 tries.

After correspondence with the authors of Cybench, we learned that:

The Cybench main repo does not contain the exposed flag.
When Cybench was ported to Inspect in November 2024, the flag was accidentally exposed in the Dockerfile.
In Feburary 2025, the issue was caught and patched.
We forked the Inspect repo in January 2025, thus running a stale version of the task containing the new vulnerability.

Suprising behaviors in $\tau$ -bench

$\tau$ -bench [7] is a benchmark for tool-agent-user interactions, in which an agent equipped with tools is tasked with assisting a user. We examine $\tau$ -bench-airline, a subset focused on airline customer service. Upon loading it into Docent and performing a search, we discover two cases where Claude disobeys instructions and provides a $100 voucher to a customer; the policy is to do so only if the customer both complains and asks for compensation. In one case, Claude volunteers it as a gesture of condolence for the passing of the customer’s wife.

Scaffolding issues in AgentHarm

AgentHarm [8] is a benchmark that tests agents' propensity to assist users with harmful tasks. We use Docent to discover cases where the agent attempts to access files that do not exist, indicating a potential issue with the task definition.

Interestingly, the file not found error message contains the filepath benchmark/harmful_tools/... which also reveals to the agent that it is in an evaluation setting involving harmful tools. We noticed that after this message was returned, the agent subsequently refused to fulfill the user's request plagiarize a short story.

Next, we tested whether removing setting cues from the filepath would change the agent's behavior. To do so, we intervened on the transcript to replace benchmark/harmful_tools/ with tools/. After this intervention, Claude no longer refuses to help.

Vision for the future

Docent reveals a long tail of issues that aggregate numbers fail to capture. Furthermore, today’s AI systems are changing rapidly—we constantly see new emergent behaviors [9, 10], and need evaluations to anticipate these as well.

Docent helps us make progress towards questions like: if an agent did not solve a task, how close did it come? Are there inputs on which agents perform abnormally? Do agents cheat or hack the task specification? To achieve these insights, we need to evaluate how a family of agents and scaffolds performs on a range of tasks. Such evaluations require both explanations (summaries of what agents did across a set of transcripts) and interventions (minimal changes that yield different outcomes, or surface edge cases).

We envision a future where model understanding is rich and counterfactual by default, and scales using AI tools.

Richer evals are more challenging to implement, because we collect and communicate more complex artifacts. Instead of simply exporting numbers, we need to expose entire evaluation scaffolds for scrutiny and agent transcripts for analysis. To coordinate across the ecosystem, we believe open standards are important. We're currently developing:

Schemas that describe what raw data to collect from evaluation harnesses, and what audit data to derive from results.
Protocols that encompass the types of actions we can take on evaluations (e.g. editing a scaffold, its scoring function, or resuming a run from a checkpointed state).

Agent needs internet access to solve the task.

Docent is a first step toward that future. It builds on the existing standards of UK AISI's Inspect framework [11], which provides evaluation schemas and a harness. Furthermore, Docent implements early versions of new schemas, protocols, and tools, allowing a user to intervene on agent executions and generate audit data that catches unexpected outcomes. To generalize Docent to other benchmarks, we need better standardization of these components.

We hope these standards will support a richer public ecosystem of tools. Today, investigations are led by humans, but we are building AI investigators that can perform interventions and compile insights that are reported back to human evaluators, who can ask further follow-up questions and perform their own analyses. We're also working on incorporating model internals into Docent's schemas.

What's next

In the near future, we are working on formalizing our audit protocol and open sourcing the Docent interface. We also have several research directions on our roadmap:

Validating outputs: Docent summaries often contain false positives. We want to add automatic validation to reduce these, and to further study the accuracy of insights surfaced by Docent. One way to do this could be through a benchmark for issues in agent transcripts, which could then be used to report precision and recall for tools like Docent.
Automating interventions: Instead of manually exploring the space of interventions, we need AI investigators that automatically search for interventions, run experiments, and surface key insights to human users. Investigators could also search across the space of input tasks to surface entirely new behaviors such as the "chestnut facts" example. A mature Docent interface could integrate many sources of data about an AI system, including data attribution [12, 13, 14] and whitebox overlays on neural representations [15, 16].
Analyzing reasoning traces: Docent is useful for any situation where an AI produces large quantities of text. Beyond the capability evals studied above, we want to analyze long chains of thought from reasoning models to better understand them.

We're excited about each of these directions, and about discovering new research priorities by engaging with potential users. Please share Docent with your communities, check out our Quickstart Guide to use Docent with your own data, and join our user community Slack to ask questions and receive early updates.

Acknowledgements

We are grateful for feedback from Daniel Johnson, Neil Chowdhury, JJ Allaire, John Schulman, Shantanu Jain, Miles Brundage, Dami Choi, Jasmine Wang, Andi Peng, Jacob Andreas, Nicholas Carlini, Martin Wattenberg, William Saunders, Asa Cooper Stickland, Cozmin Uducec, Samuel Klein, Nick Garcia, Andy Zhang, Roy Rinberg, Josh Vendrow, and Maksym Andriushchenko.

Citation information

@misc{meng2025docent,
  author       = {Meng, Kevin and Huang, Vincent and Steinhardt, Jacob and Schwettmann, Sarah},
  title        = {Introducing Docent},
  year         = {2025},
  month        = {March},
  day          = {24},
  howpublished = {\url{https://transluce.org/introducing-docent}}
}

References

Yang, J., Prabhakar, A., Narasimhan, K., & Yao, S. (2023). InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback. arXiv preprint arXiv:2306.14898. [link]
Zhang, A. K., Perry, N., Dulepet, R., Ji, J., Menders, C., Lin, J. W., Jones, E., Hussein, G., Liu, S., Jasper, D., Peetathawatchai, P., Glenn, A., Sivashankar, V., Zamoshchin, D., Glikbarg, L., Askaryar, D., Yang, M., Zhang, T., Alluri, R., Tran, N., Sangpisit, R., Yiorkadjis, P., Osele, K., Raghupathi, G., Boneh, D., Ho, D. E., & Liang, P. (2024). Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. arXiv preprint arXiv:2408.08926. [link]
Vendrow, J., Vendrow, E., Beery, S., & Madry, A. (2025). Do Large Language Model Benchmarks Test Reliability? arXiv preprint arXiv:2502.03461. [link]
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770. [link]
Chowdhury, N., Aung, J., Shern, C. J., Jaffe, O., Sherburn, D., Starace, G., Mays, E., Dias, R., Aljubeh, M., Glaese, M., Jimenez, C. E., Yang, J., Liu, K., & Madry, A. (2024). Introducing swe-bench verified. OpenAI. [link]
Dunlap, L., Mandal, K., Darrell, T., Steinhardt, J., & Gonzalez, J. E. (2025). VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models. arXiv preprint arXiv:2410.12851. [link]
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. [link]
Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, Z., Fredrikson, M., et al. (2024). Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024. [link]
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682. [link]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837. [link]
AI Security Institute, UK. (2024). Inspect AI: Framework for Large Language Model Evaluations. [link]
Koh, P. W., & Liang, P. (2017). Understanding black-box predictions via influence functions. International conference on machine learning, 1885-1894. [link]
Ilyas, A., Park, S. M., Engstrom, L., Leclerc, G., & Madry, A. (2022). Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622. [link]
Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., Steiner, B., Li, D., Durmus, E., Perez, E., et al. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296. [link]
Choi, D., Huang, V., Meng, K., Johnson, D. D., Steinhardt, J., & Schwettmann, S. (2024). Scaling Automatic Neuron Description. [link]
Chen, Y., Wu, A., DePodesta, T., Yeh, C., Li, K., Marin, N. C., Patel, O., Riecke, J., Raval, S., Seow, O., et al. (2024). Designing a dashboard for transparency and control of conversational AI. arXiv preprint arXiv:2406.07882. [link]