← Our Work

Introducing Docent

A system for analyzing and intervening on agent behavior

Kevin Meng*,Vincent Huang,Jacob Steinhardt,Sarah Schwettmann*
Transluce | Published: March 24, 2025

We built a system, Docent, that accelerates analysis of AI agent transcripts. It helps users automatically identify corrupted tasks, fix scaffolding issues, uncover unexpected behaviors, and understand an agent's weaknesses—in only minutes.

Explore a technical preview of Docent here, and sign up to join our user community and receive early updates.

Docent identifies issues in an agent's environment, like missing packages the agent invoked. On the InterCode [1] benchmark, simply adding those packages increased GPT-4o's solve rate from 68.6% to 78%.

Docent surfaces surprising behaviors. After repeatedly failing to solve a task, GPT-4o generates nonsense text about "chestnut facts" and tries to "re-examine for latent conceptual derivation."

Introduction

The goal of an evaluation is to suggest general conclusions about an AI agent's behavior. Most evaluations produce a small set of numbers (e.g. accuracies) that discard important information in the transcripts: agents may fail to solve tasks for unexpected reasons, solve tasks in unintended ways, or exhibit behaviors we didn't think to measure. Users of evaluations often care not just about what one individual agent can do, but what nearby agents (e.g. with slightly better scaffolding or guidance) would be capable of doing. A comprehensive analysis should explain why an agent succeeded or failed, how far from goal the agent was, and what range of competencies the agent exhibited.

The information we want is very difficult to obtain. A benchmark with 100 tasks, each run 10 times, produces millions of output tokens. Imagine spending hours reading through it all, in hopes of finding a few broken task environments that account for failures—you might think it's not worth it. Furthermore, drawing robust conclusions requires validation of hypotheses through counterfactual experimentation, for which there is little existing infrastructure.

Docent automates many labor-intensive parts of transcript analysis. It uses language models to summarize, cluster, and search over agent transcripts. It automatically surfaces interesting observations about transcripts, answers questions, and helps users form hypotheses. It also provides infrastructure for counterfactual experimentation to test those hypotheses; users can modify and resume agents from past checkpoints. In the longer term, we anticipate that this experimental loop will be executed by AI systems themselves, with humans providing guidance.

Using Docent, we provide in-depth analyses of several common agent benchmarks. Beyond the findings demonstrated above, Docent also uncovers indications of task vulnerability on an Inspect port of Cybench [2], and surprising behaviors in other agent benchmarks.

Today, we're releasing a live technical preview of Docent: a public web endpoint where you can explore agent transcripts for a set of preloaded benchmarks. We are also soliciting early users to receive white-glove support for running Docent on their own transcripts, ahead of a general release in the coming weeks. If you have a use case that might benefit from Docent, please get in touch. In this technical report, we'll walk through our findings and design decisions when building Docent, and examples of how we've used it so far.

Gaps in understanding

The current evaluation paradigm is insufficient for understanding complex agent behaviors. Top-level numbers don’t communicate why an agent succeeded or failed, and broken setups often lead to misinterpretation of results.Click to explore

Docent workflows

Docent enables 4 key workflows: summarization, search, clustering, and intervention. These workflows help users developing tasks identify environment errors and potential scaffolding improvements, and help users analyzing large volumes of behavior traces understand patterns in their data.Click to explore

Findings

So far, we've used Docent to identify and fix environment issues in InterCode CTF, detect cheating in Cybench, discover surprising behaviors in τ-bench, and fix scaffold issues in AgentHarm.Click to explore

Full InterCode analysis

InterCode [1] is a capture-the-flag (CTF) benchmark where agents must recover a text string hidden in a filesystem. It tests key cybersecurity skills at a pre-collegiate level, including cryptography, reverse engineering, and system exploration. GPT-4o (gpt-4o-2024-11-20), run with Inspect's scaffold and default agent 1Inspect's implementation of InterCode is a subset that excludes certain tasks., achieves a 68.6% overall solve rate across 79 tasks with 5 runs each.

Instead of taking the 68.6% solve rate at face value, here's what we found with Docent:

Fixing the scaffold boosts absolute performance by 10%.
We first asked Docent to cluster the various reasons the agent got stuck.
Noticing a large number of missing external tools, we then asked Docent to cluster the types of tools the agent tried to invoke that were unavailable.
We updated the scaffold to fix some issues we found.
Multiple tasks are broken or unsolvable.

There remained 11 tasks with a 0% pass@5. We found that 4 of them are broken. This process was manual, but we are looking into automated detection.

Task 15: Solved after suggesting alternate decoding assumptions
Task 55: Requires image capabilities, but solved after providing contents of the image
Task 56: Requires image capabilities
Task 59: Incorrect ground truth flag because IP registration details have changed
Most tasks are solvable with minor interventions.

Of the remaining 7 tasks, we find that most of them are actually solvable with minor interventions. We also find that the successful interventions provide qualitative insight into the agent's limitations.

Task 47: Solved after encouraging the agent to consider different encodings
Task 69: Solved after asking agent to think harder about the RSA equation
Task 70: Solved after suggesting a filesystem mount tool, and hinting about slack space
Task 71: Solved after reminding the agent to look at the file named "hidden"
Task 79: Solved after editing agent outputs to redeclare and print variables
Task 85: Difficult Powershell reverse engineering task
Task 86: Difficult Vignere cipher problem

Task Vulnerability in Cybench Port

Cybench is a more difficult professional-level CTF. Claude 3.5 Sonnet (New) achieves a 25% solve rate across 40 tasks with 5 runs each, where 12 tasks are solved at least once. We ran Cybench on a January 2025 fork of the Inspect repo.

Upon asking Docent for cases of "unintended shortcuts", we immediately find one. The Dockerfile, which contains the flag, was accidentally exposed in the filesystem.

Curious whether the agent can solve the task legitimately, we intervene by removing the leaked flag from the tool call. The agent fails to solve the corrected task even after 10 tries.

After correspondence with the authors of Cybench, we learned that:

Suprising behaviors in τ\tau-bench

τ\tau-bench [7] is a benchmark for tool-agent-user interactions, in which an agent equipped with tools is tasked with assisting a user. We examine τ\tau-bench-airline, a subset focused on airline customer service. Upon loading it into Docent and performing a search, we discover two cases where Claude disobeys instructions and provides a $100 voucher to a customer; the policy is to do so only if the customer both complains and asks for compensation. In one case, Claude volunteers it as a gesture of condolence for the passing of the customer’s wife.

Scaffolding issues in AgentHarm

AgentHarm [8] is a benchmark that tests agents' propensity to assist users with harmful tasks. We use Docent to discover cases where the agent attempts to access files that do not exist, indicating a potential issue with the task definition.

Interestingly, the file not found error message contains the filepath benchmark/harmful_tools/... which also reveals to the agent that it is in an evaluation setting involving harmful tools. We noticed that after this message was returned, the agent subsequently refused to fulfill the user's request plagiarize a short story.

Next, we tested whether removing setting cues from the filepath would change the agent's behavior. To do so, we intervened on the transcript to replace benchmark/harmful_tools/ with tools/. After this intervention, Claude no longer refuses to help.

Vision for the future

Docent reveals a long tail of issues that aggregate numbers fail to capture. Furthermore, today’s AI systems are changing rapidly—we constantly see new emergent behaviors [9, 10], and need evaluations to anticipate these as well.

Docent helps us make progress towards questions like: if an agent did not solve a task, how close did it come? Are there inputs on which agents perform abnormally? Do agents cheat or hack the task specification? To achieve these insights, we need to evaluate how a family of agents and scaffolds performs on a range of tasks. Such evaluations require both explanations (summaries of what agents did across a set of transcripts) and interventions (minimal changes that yield different outcomes, or surface edge cases).

We envision a future where model understanding is rich and counterfactual by default, and scales using AI tools.

Richer evals are more challenging to implement, because we collect and communicate more complex artifacts. Instead of simply exporting numbers, we need to expose entire evaluation scaffolds for scrutiny and agent transcripts for analysis. To coordinate across the ecosystem, we believe open standards are important. We're currently developing:

  • Schemas that describe what raw data to collect from evaluation harnesses, and what audit data to derive from results.
  • Protocols that encompass the types of actions we can take on evaluations (e.g. editing a scaffold, its scoring function, or resuming a run from a checkpointed state).
Agent needs internet access to solve the task.

Docent is a first step toward that future. It builds on the existing standards of UK AISI's Inspect framework [11], which provides evaluation schemas and a harness. Furthermore, Docent implements early versions of new schemas, protocols, and tools, allowing a user to intervene on agent executions and generate audit data that catches unexpected outcomes. To generalize Docent to other benchmarks, we need better standardization of these components.

We hope these standards will support a richer public ecosystem of tools. Today, investigations are led by humans, but we are building AI investigators that can perform interventions and compile insights that are reported back to human evaluators, who can ask further follow-up questions and perform their own analyses. We're also working on incorporating model internals into Docent's schemas.

What's next

In the near future, we are working on formalizing our audit protocol and open sourcing the Docent interface. We also have several research directions on our roadmap:

  • Validating outputs: Docent summaries often contain false positives. We want to add automatic validation to reduce these, and to further study the accuracy of insights surfaced by Docent. One way to do this could be through a benchmark for issues in agent transcripts, which could then be used to report precision and recall for tools like Docent.
  • Automating interventions: Instead of manually exploring the space of interventions, we need AI investigators that automatically search for interventions, run experiments, and surface key insights to human users. Investigators could also search across the space of input tasks to surface entirely new behaviors such as the "chestnut facts" example. A mature Docent interface could integrate many sources of data about an AI system, including data attribution [12, 13, 14] and whitebox overlays on neural representations [15, 16].
  • Analyzing reasoning traces: Docent is useful for any situation where an AI produces large quantities of text. Beyond the capability evals studied above, we want to analyze long chains of thought from reasoning models to better understand them.

We're excited about each of these directions, and about discovering new research priorities by engaging with potential users. Please share Docent with your communities, sign up to be an early user, or let us know how it could be helpful for your work.

Acknowledgements

We are grateful for feedback from Daniel Johnson, Neil Chowdhury, JJ Allaire, John Schulman, Shantanu Jain, Miles Brundage, Dami Choi, Jasmine Wang, Andi Peng, Jacob Andreas, Nicholas Carlini, Martin Wattenberg, William Saunders, Asa Cooper Stickland, Cozmin Uducec, Samuel Klein, Nick Garcia, Andy Zhang, Roy Rinberg, Josh Vendrow, and Maksym Andriushchenko.

Citation information

@misc{meng2025docent,
  author       = {Meng, Kevin and Huang, Vincent and Steinhardt, Jacob and Schwettmann, Sarah},
  title        = {Introducing Docent},
  year         = {2025},
  month        = {March},
  day          = {24},
  howpublished = {\url{https://transluce.org/introducing-docent}}
}

References

  1. Yang, J., Prabhakar, A., Narasimhan, K., & Yao, S. (2023). InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback. arXiv preprint arXiv:2306.14898. [link]
  2. Zhang, A. K., Perry, N., Dulepet, R., Ji, J., Menders, C., Lin, J. W., Jones, E., Hussein, G., Liu, S., Jasper, D., Peetathawatchai, P., Glenn, A., Sivashankar, V., Zamoshchin, D., Glikbarg, L., Askaryar, D., Yang, M., Zhang, T., Alluri, R., Tran, N., Sangpisit, R., Yiorkadjis, P., Osele, K., Raghupathi, G., Boneh, D., Ho, D. E., & Liang, P. (2024). Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. arXiv preprint arXiv:2408.08926. [link]
  3. Vendrow, J., Vendrow, E., Beery, S., & Madry, A. (2025). Do Large Language Model Benchmarks Test Reliability? arXiv preprint arXiv:2502.03461. [link]
  4. Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770. [link]
  5. Chowdhury, N., Aung, J., Shern, C. J., Jaffe, O., Sherburn, D., Starace, G., Mays, E., Dias, R., Aljubeh, M., Glaese, M., Jimenez, C. E., Yang, J., Liu, K., & Madry, A. (2024). Introducing swe-bench verified. OpenAI. [link]
  6. Dunlap, L., Mandal, K., Darrell, T., Steinhardt, J., & Gonzalez, J. E. (2025). VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models. arXiv preprint arXiv:2410.12851. [link]
  7. Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. [link]
  8. Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, Z., Fredrikson, M., et al. (2024). Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024. [link]
  9. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682. [link]
  10. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837. [link]
  11. AI Security Institute, UK. (2024). Inspect AI: Framework for Large Language Model Evaluations. [link]
  12. Koh, P. W., & Liang, P. (2017). Understanding black-box predictions via influence functions. International conference on machine learning, 1885-1894. [link]
  13. Ilyas, A., Park, S. M., Engstrom, L., Leclerc, G., & Madry, A. (2022). Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622. [link]
  14. Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., Steiner, B., Li, D., Durmus, E., Perez, E., et al. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296. [link]
  15. Choi, D., Huang, V., Meng, K., Johnson, D. D., Steinhardt, J., & Schwettmann, S. (2024). Scaling Automatic Neuron Description. [link]
  16. Chen, Y., Wu, A., DePodesta, T., Yeh, C., Li, K., Marin, N. C., Patel, O., Riecke, J., Raval, S., Seow, O., et al. (2024). Designing a dashboard for transparency and control of conversational AI. arXiv preprint arXiv:2406.07882. [link]