Using AI to understand AI

We develop scalable, AI-driven techniques for understanding all parts of the machine learning pipeline, from internal representations to model behaviors.

Research Updates

research report

Predictive Concept Decoders

Training scalable end-to-end interpretability assistants

18 December 2025

→

research report

Scalably Extracting Latent Representations of Users

Constructing datasets and training decoders to extract user models from language models

25 November 2025

→

research report

Language Model Circuits Are Sparse in the Neuron Basis

A new technique for tracing sparse and faithful circuits directly on a model's MLPs

20 November 2025

→

technical demonstration

Monitoring SWE-bench Agents

Partnering with SWE-bench to enable reliable monitoring of AI coding agents

19 November 2025

→

research report

Training Language Models to Explain Their Own Computations

We trained explainer models to verbalize the content of AI systems' internal computations

11 November 2025

→

research report

Automatically Jailbreaking Frontier Language Models with Investigator Agents

Discovering cost-effective attacks with reinforcement learning

3 September 2025

→

research report

Surfacing Pathological Behaviors in Language Models

Improving our investigator agents with propensity bounds

5 June 2025

→

research report

Investigating Truthfulness in a Pre-Release o3 Model

o3 frequently fabricates actions it took to fulfill user requests, and elaborately justifies the fabrications when confronted

16 April 2025

→

technical demonstration

Introducing Docent

A system for analyzing and intervening on agent behavior

24 March 2025

→

research report

Scaling Automatic Neuron Explanation

Open-source AI systems trained to describe components of other AI systems at the level of a human expert

23 October 2024

→

research report

Eliciting Language Model Behaviors with Investigator Agents

Language models trained to automatically surface harmful behaviors in language models

23 October 2024

→

technical demonstration

Monitor: An AI-Driven Observability Interface

An interface designed to help humans observe, understand, and steer computations inside models

23 October 2024

→