Docent is now in public alpha. Try the latest version of Docent here →

Diagnosing a performance regression on Terminal-Bench in 15 minutes with Docent

Why does GPT-5.1 Codex underperform GPT-5 Codex?

The Docent Team
* Correspondence to: docent@transluce.org
Transluce | Published: February 11, 2026

Terminal-Bench evaluates how well models carry out complex tasks in the terminal. At the time of writing, GPT-5.1 Codex underperforms GPT-5 Codex by 6.5 percentage points on the official leaderboard, even with the same scaffold. What explains the regression?

We used Docent, our tool for analyzing traces at scale, to compare failed runs from one model against successful runs from a different model on the same task. We analyzed traces from re-running Terminal-Bench ourselves. On our own dataset, GPT-5.1 Codex underperformed GPT-5 Codex by 11.4 percentage points, a discrepancy in the same direction as the 6.5 percentage point gap on the official leaderboard.

In 15 minutes, Docent finds that the regression probably stems from timeout errors, not performance. GPT-5.1 Codex times out at nearly twice the rate of GPT-5 Codex (49% vs. 29%), but Terminal-Bench's system prompt never mentions the time constraint. Although GPT-5 Codex tends toward engineering failures like corrupting file writes, GPT-5.1 Codex tends to time out during long-running processes like training runs, large package installs, or password cracking. GPT-5.1 Codex's strategies might more reliably solve the problem absent time constraints, but the evaluation setup cuts them short. When we filter to tasks where neither model times out, GPT-5.1 Codex actually outperforms GPT-5 Codex by 7 percentage points.

Below, we walk through how to replicate our analysis with 5 minutes of setup.

Docent result with summary of failures
Docent summarizes the performance difference by analyzing all pairs of traces in a dataset. In this case, Model A is GPT-5 Codex and Model B is GPT-5.1 Codex

Walkthrough

GPT-5.1 Codex scores 36.9% on Terminal-Bench, underperforming GPT-5 Codex's 43.4% by 6.5%
GPT-5.1 Codex underperforms GPT-5 Codex by 6.5% on the official Terminal-Bench leaderboard.

We re-ran Terminal-Bench with three rollouts per task per model. Our absolute scores are lower than the official leaderboard and the gap is wider (11.4 vs. 6.5 percentage points), but the direction and rough magnitude of the regression are consistent. Readers should treat this analysis as illustrative of Docent's workflow rather than a definitive characterization of either model. We report macro-average reward throughout this post to avoid overweighting tasks with uneven rollout counts.

Baseline success rates

In three steps, we'll (1) compare a failed run against a successful run on the same task to identify what went wrong, (2) synthesize across the pair analyses to surface systematic differences between the two models, and (3) validate the emerging hypothesis quantitatively using Docent's query language.

Want to follow along? You can replicate this analysis with 5 minutes of setup.

  1. Click "Clone Collection" to copy our Terminal-Bench collection into your account.
  2. Install Claude Code and cd into any directory where you want Claude to write your Docent scripts.
  3. Run claude. Then, run /plugin marketplace add TransluceAI/claude-code-plugins and /plugin install docent@transluce-plugins to install the Docent plugin.
  4. Restart Claude Code after installing the plugin by typing exit and then re-running claude.
  5. Verify the plugin is installed by starting to type /analysis, but avoid hitting enter or actually running the command. Claude Code should display the Docent plugin.
Plugin validation screenshot
  1. Find your collection ID and API key. Create docent.env and add them as DOCENT_API_KEY and COLLECTION_ID. Alternatively, run /analysis, or run /analysis collection ID: <YOUR_COLLECTION_ID> API Key: <YOUR_API_KEY>
Where to find your collection ID and API key
Questions? Chat with a dev by clicking in the bottom left of the Docent UI. Setup should take around 5 minutes, so let us know if you're encountering any difficulties.

Step 1: Analyzing each failure

Analysis is now as simple as describing what we want in natural language:

/analysis For each task, find runs where one model scored reward = 0 and the other scored reward = 1. For each failed run, compare it to a successful run from the other model on the same task and write a short paragraph explaining the cause of the failure. Group results by model.

Claude Code translates this into an SDK script that groups runs by task, filters for tasks with at least one success and one failure, and sends each pair to an LLM judge with a detailed prompt and both traces in context.

Generated script: Compare failed runs to successful runs

In Docent's web UI, we can use the Results tab to browse analysis of each pair. Every result is a short explanation of why the failed run diverged from the successful one. Since LLMs are stochastic, the exact order and wording of your list may look different.

Docent identifies the differences between a failed and successful run on the same task for all failed runs in the collection
Docent identifies the differences between a failed and successful run on the same task for all failed runs in the collection. Note that each analysis is unique, so your output will look slightly different.

Clicking on a citation jumps to the relevant section of the original trace, so we can verify the analysis against the source material.

Dig deeper by clicking on citations or using followup chat
Dig deeper by clicking on citations, which jump to the relevant block in the original transcript, or clicking followup chat, which immediately populates an LLM chat with the relevant transcripts in context.

Citations and the followup chat are useful for targeted reading, so we can investigate strategically instead of skimming a random sample. But we've only looked at individual pairs. To understand what's systematically different between the two models, we need to look across all of them.

Step 2: Synthesizing across all failures

Manually synthesizing across 128 pair analyses would mean reading each one and building a mental model of the recurring patterns. With Docent, it's another prompt to Claude Code. Enter the following in the same Claude Code session as the prompt before:

/analysis Based on the previous result, summarize the difference in failure modes between the two models.

Docent loads and compresses all of the per-pair analyses into context and produces a summary of the key differences.

Docent summarizes the conclusions from comparing each pair of runs across the entire collection
Docent summarizes the conclusions from comparing each pair of runs across the entire collection.

The two models fail in characteristically different ways. GPT-5 Codex failures cluster around execution mistakes: not writing or running verification steps and corrupted file writes. GPT-5.1 Codex failures cluster around poor time management: mismanaging long-running or heavyweight operations, such as installing GPU packages instead of CPU ones, or aborting training jobs instead of waiting, triggering a timeout error.

Step 3: Quantitative Analysis

To test whether timeouts explain the gap, we can query our run metadata directly. Docent supports ad hoc data exploration via Docent Query Language, a subset of SQL that our Claude Code skill can generate from natural language. We can try the following prompt:

Analyze the timeout rate per model. The exception field in the metadata indicates whether the agent timed out

Claude Code uses Docent Query Language to group by model and quantitatively compare timeout rates
Claude Code uses Docent Query Language to group by model and quantitatively compare timeout rates.

Nearly half of GPT-5.1 Codex's runs time out (49.4%), compared to fewer than one in three for GPT-5 Codex (29.1%). The pattern is broad rather than driven by a few outliers: 24 tasks produce timeouts only for GPT-5.1 Codex, while just 3 tasks time out exclusively for GPT-5 Codex.

If timeouts account for the gap, then the two models should perform more similarly on tasks where neither times out (see the footnotes for more detail). We can test this directly:

Filter to tasks where neither model times out across all rollouts. Create and display a pivot table that shows rollouts and reward per model per task. Then compute the macro-average reward per model on this subset of tasks.

When timeouts are removed, GPT-5.1 Codex leads by ~7pp
When timeouts are removed from the equation, GPT-5.1 Codex leads by ~7pp. Its overall deficit (0.27 vs 0.39 on all 89 tasks) is entirely attributable to its much higher timeout rate (49% vs 29%).

On the subset of tasks without timeouts, the performance difference actually flips! GPT-5.1 Codex outperforms GPT-5 Codex by about 7 percentage points.

Conclusion

A lower benchmark score is easy to interpret as a capability deficit, but Terminal-Bench's system prompt never mentions a timeout constraint. GPT-5.1 Codex may be choosing long-horizon strategies that the eval cutoff terminates prematurely. This kind of hypothesis is difficult to reach by skimming a handful of traces, but emerges from systematically analyzing the full dataset.

In 15 minutes, Docent analyzed traces containing 256M+ tokens and identified that GPT-5.1 Codex's 11.4 percentage point deficit is concentrated in timeout errors. On tasks where neither model times out, it actually outperforms GPT-5 Codex.

This walkthrough covered one common pattern (pairwise comparison of failed and successful runs), but Docent can be adapted to other questions as well: why a harness works with Opus 4.6 but not GPT-5.3 Codex, whether a score drop reflects broken environments or genuine capability differences, or detecting cheating that inflates eval results.

  • Want to go further? Use Docent's rubric refinement agent to quantify how often a failure pattern appears across your dataset — take a few examples of a behavior, and Docent will operationalize it into a precise classifier you can run over every trace.
  • Try Docent on your own data. Read our Quickstart to ingest your traces and start analyzing.
  • Questions or feedback? Get in touch. We'd love to hear what you're working on.

Footnotes

  • Filtering to tasks where no run from either model times out may bias the result toward tasks that are easier.
  • In rare edge cases, the agent may succeed despite timing out. In one trace we examined, the agent passed the Terminal-Bench tests but repeatedly failed on a suite of its own test cases. Since Terminal-Bench only grades the final file state, the run was marked as a success despite the timeout exception.