
research report
Constructing datasets and training decoders to extract user models from language models
25 November 2025

research report
A new technique for tracing sparse and faithful circuits directly on a model's MLPs
20 November 2025

technical demonstration
Partnering with SWE-bench to enable reliable monitoring of AI coding agents
19 November 2025

research report
We trained explainer models to verbalize the content of AI systems' internal computations
11 November 2025

research report
Discovering cost-effective attacks with reinforcement learning
3 September 2025

research report
Improving our investigator agents with propensity bounds
5 June 2025

research report
o3 frequently fabricates actions it took to fulfill user requests, and elaborately justifies the fabrications when confronted
16 April 2025

technical demonstration
A system for analyzing and intervening on agent behavior
24 March 2025

research report
Open-source AI systems trained to describe components of other AI systems at the level of a human expert
23 October 2024

research report
Language models trained to automatically surface harmful behaviors in language models
23 October 2024

technical demonstration
An interface designed to help humans observe, understand, and steer computations inside models
23 October 2024