Monitoring SWE-bench Agents

We're partnering with the SWE-bench team to enable reliable evaluation of AI coding agents.

The Docent Team*
* Correspondence to: kevin@transluce.org
Transluce | Published: November 19, 2025

Introduction

Transluce is partnering with SWE-bench to make their agent trajectories publicly available on Docent. You can now view transcripts via links on the SWE-bench leaderboard. 1SWE-bench has a "bash only" setting, which evaluates all models with the same minimal agent scaffold on the SWE-bench Verified subset. We're starting with that setting for simplicity.

Docent links on the SWE-bench Bash Only leaderboard
Docent links on the SWE-bench Bash Only leaderboard

SWE-bench is one of the most popular public benchmarks for evaluating coding agents on real-world tasks. The Verified subset, containing 500 manually-reviewed PRs from 12 open-source Python repositories, was downloaded from Huggingface over 1.3M times in the last month and cited in OpenAI, Anthropic, and Google system cards.

Because SWE-bench tasks are open-ended, agents sometimes pass the benchmark in an unintended way. For instance, they may find ways to access ground-truth solutions from within the bash environment. If we could continuously monitor and quantify these issues, we could improve the reliability of the leaderboard.

We've used our Docent tool to do this, in two ways. First, Docent provides an interactive, AI-backed transcript viewer that makes it easy to understand long transcripts and search for behaviors of interest:

The transcript viewer in Docent
The transcript viewer in Docent

We also used Docent to create a judge for situations where the agent solves the task in an unintended way, which gives us glanceable top-line metrics:

Docent can plot quantitative results from running judges
Docent can plot quantitative results from running judges

Through our partnership, these features are now integrated into the SWE-bench leaderboard. We encourage the community to conduct their own exploration of transcripts using Docent, available at swebench.com.

To learn more about Docent, visit our landing page. If you're interested in helping us build great products for scalably understanding AI systems, reach out to join us!

Preliminary demo: detecting cheating

Docent is a tool for analyzing complex agent transcripts. In addition to being a powerful transcript viewer, it enables users to scalably discover and measure behaviors of interest, particularly ones that aren’t captured by top-line metrics. Given a high-level query like “strange or unexpected model behavior,” “cheating or taking an unintended solution path,” or “scaffolding issues,” Docent can help operationalize a vague starting point into a concrete rubric, which enables precise measurements of how often a behavior occurs.

SWE-bench is a rich source of complex behaviors to measure. In this section, we conduct a preliminary exploration of how Docent can scalably detect one such behavior.

Recently, Meta researchers discovered a vulnerability in SWE-bench Verified, where agents could sometimes solve tasks by peeking ahead in git and reading gold solutions.

Kahn's Github issue that raised the vulnerability
Kahn's Github issue that raised the vulnerability

Researchers typically catch issues like these by manually reviewing agent transcripts, but that’s impractical at the scale of SWE-bench, which has 2000+ across many models and scaffolds, each quite long.

We used Docent to build a judge for this particular form of cheating. We focused on the transcripts of Qwen3-Coder-480B and Qwen3-Coder-30B on the OpenHands scaffold. Below, we briefly describe how we used Docent to create the judge.

1. Coming up with an intuitive descriptionClick to explore
2. Operationalizing it into a concrete rubricClick to explore
3. Tuning the rubric with feedbackClick to explore

Here are the results from running the judge over the examples found by the Meta researchers: link to Docent. The judge catches many of the same issues that humans raised; in addition, Docent provides detailed explanations and citations that help users validate judge outputs. We also noticed disagreements between humans and the judge, which would be interesting to further investigate.

While we've taken first steps towards creating a measurement, we haven't yet validated its accuracy, checked whether it has the right justifications for its answers, or debugged cases where it fails. Soon, we'll share our progress on that front; we've been working on methods and interfaces for intelligently collecting human feedback, surfacing potential ambiguities in rubrics, and identifying potential fixes to judge failures.

Getting started with Docent

Interested in analyzing your own runs? Read our quickstart guide to learn how you can upload your own trajectories via the SDK, uploading from Inspect, or using our tracing library. You can also contact kevin@transluce.org for a custom onboarding or to handle uploading your trajectories.