Using AI to understand AI

We develop scalable, AI-driven techniques for understanding all parts of the machine learning pipeline, from internal representations to model behaviors.

Research Updates

Scalably Extracting Latent Representations of Users

research report

Scalably Extracting Latent Representations of Users

Constructing datasets and training decoders to extract user models from language models

25 November 2025

Language Model Circuits Are Sparse in the Neuron Basis

research report

Language Model Circuits Are Sparse in the Neuron Basis

A new technique for tracing sparse and faithful circuits directly on a model's MLPs

20 November 2025

Monitoring SWE-bench Agents

technical demonstration

Monitoring SWE-bench Agents

Partnering with SWE-bench to enable reliable monitoring of AI coding agents

19 November 2025

Training Language Models to Explain Their Own Computations

research report

Training Language Models to Explain Their Own Computations

We trained explainer models to verbalize the content of AI systems' internal computations

11 November 2025

Automatically Jailbreaking Frontier Language Models with Investigator Agents

research report

Automatically Jailbreaking Frontier Language Models with Investigator Agents

Discovering cost-effective attacks with reinforcement learning

3 September 2025

Surfacing Pathological Behaviors in Language Models

research report

Surfacing Pathological Behaviors in Language Models

Improving our investigator agents with propensity bounds

5 June 2025

Investigating Truthfulness in a Pre-Release o3 Model

research report

Investigating Truthfulness in a Pre-Release o3 Model

o3 frequently fabricates actions it took to fulfill user requests, and elaborately justifies the fabrications when confronted

16 April 2025

Introducing Docent

technical demonstration

Introducing Docent

A system for analyzing and intervening on agent behavior

24 March 2025

Scaling Automatic Neuron Explanation

research report

Scaling Automatic Neuron Explanation

Open-source AI systems trained to describe components of other AI systems at the level of a human expert

23 October 2024

Eliciting Language Model Behaviors with Investigator Agents

research report

Eliciting Language Model Behaviors with Investigator Agents

Language models trained to automatically surface harmful behaviors in language models

23 October 2024

Monitor: An AI-Driven Observability Interface

technical demonstration

Monitor: An AI-Driven Observability Interface

An interface designed to help humans observe, understand, and steer computations inside models

23 October 2024