Monitor: An AI-Driven Observability Interface

Kevin Meng^*‡,Vincent Huang,Neil Chowdhury,Dami Choi,Jacob Steinhardt*,Sarah Schwettmann*

^* Core research and design contributor. Correspondence to: [email protected]

^‡ Core infrastructure contributor

Transluce | Published: October 23, 2024

This write-up is a technical demonstration, which describes and evaluates the use of a new piece of technology. For technical demonstrations, we still run systematic experiments to test our findings, but do not run detailed ablations and controls. The claims are ones that we have tested and stand behind, but have not vetted as thoroughly as in our research reports.

The Monitor interface is available at monitor.transluce.org.

Update [October 31, 2024]: We've released code on Github for running, developing, and deploying the Monitor interface.

Motivation

When an AI system behaves unexpectedly, we'd like to understand the "thought process" that explains why the behavior occurred. This lets us predict and fix problems with AI models [1], surface hidden knowledge [2], and uncover learned biases and spurious correlations [3, 4].

As an example, many models incorrectly state that 9.8 < 9.11, can we identify the spurious concepts responsible for this, and intervene on them to fix the model? As another example, AI models sometimes fail to state public information about a notable individual—in these cases, does the model truly not know, or is there hidden knowledge that it omits from the output?

Answering these questions is challenging due to the complex nature of the underlying data. The thought process is represented as massive neural representations consisting of opaque floating point numbers, which must be decoded into human-readable concepts. Even once this is done, any input to an AI system will activate a large number of concepts, and humans will need assistance sifting through them for the important ones.

To address these challenges, we introduce Monitor, an observability interface designed to help humans observe, understand, and steer the internal computations of language models. In line with our general approach to scalable understanding, Monitor employs a collection of AI-driven tools to help users make sense of the neural activation patterns in language models:

A pre-compiled database of high-quality neuron descriptions, generated by applying our AI-driven description pipeline to all the MLP neurons in Llama-3.1-8B. Our system applies equally well to SAE features or any other feature set. We started with neurons because they were simplest, and already worked well. We are releasing the code for our system and are excited for others to build on it with their own feature sets!
A real-time interface for displaying the important concepts for a given chat dialog. Users can measure importance by either activation (how strongly a concept fires) or attribution (how much the concept affects a specified target token).
A real-time AI linter, which automatically surfaces clusters of unexpected concepts that are possible spurious cues (e.g. “September 11th” neurons firing on the number 9.11).
Semantically-guided steering to increase or decrease the strength of collections of conceptually-related neurons based on natural language input.

While testing Monitor ourselves, we’ve used it to debug persistent model failures, uncover hidden knowledge, and granularly steer Llama's representations of specific entities. We’ll summarize these examples below, then go into detail in the corresponding sections.

Debugging model failures. Large language models consistently say that 9.8 is less than 9.11. Using Monitor’s attribution feature together with the AI linter, we see that there are clusters of neurons contributing to this mistake related to September 11th, gravity, and bible verses:

After removing these neurons, Llama gets the answer correct; more generally, removing bible verse neurons increases comparison accuracy by 21 percentage points (Example 1).

Uncovering hidden knowledge. Models trained with RLHF often refuse to answer questions, sometimes even when they are harmless. In the example below, we see Llama say that it does not know who Nicholas Carlini is, yet our AI linter surfaces a cluster of cybersecurity neurons, suggesting that Llama knows more than it is saying:

We observe similar instances of hidden knowledge for other well-known figures (Example 2).

Steering specific entities. Monitor can identify and steer concepts at the granularity of specific tokens. This allows us to inject attributes into specific entities: for example, we created a story about Alice and Bob where Alice is a fish and Bob is a bird, by increasing the activation of fish neurons for Alice and bird neurons for Bob:

System design. In addition to elaborating on the three examples above, we also describe the design principles and implementation of Monitor in the system design section below.

Example: 9.8 < 9.11

Language models are impressive general-purpose agents, yet they often have puzzling failure modes. One of the most famous examples of this is “9.8 < 9.11”, in which a wide variety of LLMs consistently say that 9.8 is smaller than 9.11 when asked (when in fact 9.8 = 9.80 > 9.11). This has led to widespread speculation about the cause, for instance that it might have something to do with software version numbers.

We can investigate this in Monitor by asking Llama-3.1 “Which is bigger, 9.8 or 9.11?” to see which concepts are active, which leads to the following result:

On the left, we see that Llama indeed gets the answer wrong. On the top right, our AI linter discovers some clusters of potentially spurious concepts: “September 11 attacks” and “gravitational dynamics”¹Gravity is because 9.8 m/s^2 is the gravitational constant.. The first of these is interesting, as it suggests Llama might be interpreting 9.11 as a date, and then concluding that 9/8 < 9/11 (as dates). The bottom right panel contains descriptions of individual highly active neurons, highlighting those in the Sept. 11th cluster.

To dive into this more deeply, we can use attribution: asking not just which concepts are most active, but which ones specifically influence Llama to say “larger” after “9.11 is…”. If we do this, we get the following results:

The AI linter finds the same two clusters as before, as well as a third cluster related to the bible. Looking at specific neurons in this cluster, they relate to bible verses, which could also cause problems if 9.8 and 9.11 are interpreted as 9:8 and 9:11 (chapter:verse).

This was initially surprising to us but makes sense in retrospect—most pretraining datasets contain many copies of the bible, and we found that a significant fraction of neurons in Llama are bible-related!²Performing a regex search against our descriptions for ‘biblical’, ‘bible’, ‘scripture’, ‘gospel’, ‘testament’, ‘psalm’, ‘proverb’, ‘exodus’, ‘genesis’, ‘revelation’, ‘apostle’, ‘prophet’, ‘jesus’, ‘christ’, ‘moses’, ‘abraham’, ‘noah’, ‘david’, ‘solomon’ returns 3990 results out of ~917,000 total descriptions, or 0.4% of the total. This regex search probably includes some unrelated neurons but probably also misses some as well.

We’d like to get rid of the spurious Sept. 11th and bible verses neurons. To do so, we can steer concepts downwards. We first steer down neurons related to bible verses, by typing “bible verses” into the steering prompt and pressing deactivate. This zeros out the activation of the 500 neurons whose descriptions have the highest semantic match to the query. We then do the same for “terrorist attacks”. Once these are done, Llama gets the answer correct:

Validating our findings. Our hypothesis is that Llama gets comparisons wrong by associating them with spurious concepts such as bible verses. This motivates two experiments:

Zeroing out the neurons related to these concepts: this leads comparisons to become much more accurate.
Zeroing out other concepts or random neurons: this has a much smaller effect.

As a starting point, we verify that 9.8 and 9.11 are not special: Llama-3.1 frequently gets comparisons wrong. We considered all comparisons between X.Y and X.Z, for $1 \leq X \leq 20$ , $6 \leq Y \leq 9$ , and $10 \leq Z \leq 13$ . We considered two templates for asking the question and both orderings for X.Y and X.Z, leading to 1280 queries in total. Across these, Llama-3.1 gets only 55% correct ³It’s important to choose $Z\ge 10$ in this analysis. When we pick 1-digit values for Z, the model accuracy is 100%.. In general, Llama is more likely to be wrong for comparisons that are close together, as shown in Figure 1 (Left) below.

Accuracy after interventions

Intervention	Accuracy
Random neurons*	0.54-0.57
No intervention*	0.55
Linted neurons^	0.55-0.76
Bible verses^	0.76
Bible + Dates + Phone	0.79

Figure 1. (Left) Accuracy as a function of how far apart the comparands are. (Right) Accuracy under different steering interventions. * denotes baseline, ^ denotes concept surfaced by AI linter.

We then tried steering by de-activating (setting to zero) groups of 500 neurons (~0.1%)⁴14336 * 32 = 458752 under a variety of different conditions: randomly sampled neurons⁵uniformly sampled across all n_layers * d_mlp neurons, set to zero across all tokens. See random neurons (seeds 1-3) in Figure 1 (Right), the 500 neurons whose descriptions had the highest semantic similarity⁶measured using OpenAI embeddings with randomly generated search terms (e.g. India, Animals), and 500 neurons with greatest semantic similarity to concepts surfaced by our AI linter (Software versions, Sports statistics, Climbing and Climbing Gear, Phone and OS Versions, Dates, months, seasons). We did not include September 11th because it was only active on inputs with 9.11. The full results are shown above in Figure 1 (Right).

The results indicate that the concepts surfaced by the AI linter—especially Bible verses and dates—allow us to intervene in a way that significantly improves accuracy on this task. A final intervention where we combined the neuron sets from the top three concepts (Bible verse, dates, phone versions) and deactivated neurons in the combined set reached an accuracy of 77%.

Interpretation and speculation. Our results show that neurons related to bible verses have a significant effect on Llama’s ability to compare numbers, and that neurons related to previously-hypothesized concepts such as software version numbers have a moderate but smaller effect. In retrospect, the prevalence of bible verses makes sense—the bible is highly prevalent in most pre-training corpora, and a crude regex search reveals around 0.4% of all neuron descriptions relate to the bible.

One speculative possibility is that these neurons lead Llama to interpret numbers as bible verses so that e.g. 9.8 and 9.11 become 9:8 and 9:11, which would lead 9:8 to come before 9:11. Some caution is required, however—for instance, another hypothesis that fits the data is that Llama simply gets “distracted” from too many concepts firing at once, and anything that quiets the noisy activations is helpful. There are also other possibilities, such as the bible verse neurons being polysemantic, and partially supporting numerical comparisons while also implementing other concepts that reduce accuracy.

Another caveat is that not all concepts surfaced by our Linter were relevant—for instance, neurons related to “climbing” fired on many examples in our dataset, but ablating these neurons did not lead to significant improvements in accuracy. Furthermore, since these concepts were discovered on the same relatively small set of examples that we evaluate on, there is a possibility of overfitting.

Example: Who is Nicholas Carlini?

Due to their RLHF training, language models have learned to avoid outputting information that could be false or harmful. However, this sometimes leads models to refuse to answer harmless questions that they do know the answer to. An example is asking Llama-3.1-8B "Who is Nicholas Carlini?" Llama often refuses to answer this question, as shown below:

We see that there are many neurons related to cybersecurity, which is one of Nicholas Carlini’s areas of expertise. We also see a neuron for “Engagement with ACM conferences”, in line with his activity as a computer science researcher.

Can we get the model to provide this information? We find that by de-activating don't know and unable to verify, we can get the model to reveal its knowledge of Carlini's PhD in computer security (and a previously unknown interest in poker):

We see similar behavior for many other notable individuals. For instance, Llama often also refuses to give information about Antonio Torralba, but the neuron activations indicate that Antonio is a computer science professor at MIT:

In these examples, it is possible to elicit the information in other ways—for instance, if we tell the model to take a guess even if it isn’t sure. We view these examples as a proof-of-concept for extracting knowledge in an open domain, which could extend to cases where knowledge is more robustly hidden [5].

Example: Steering Alice and Bob separately

So far we’ve looked at de-activating concepts, but what about increasing their activation instead? We can do this in the same way, by typing in a search query to steer a collection of semantically-related neurons.

In this example, we’ll also use an advanced feature of Monitor, which is the ability to steer only at specific tokens. This lets us modify the attributes of specific entities in a fine-grained way. Specifically, we’ll create two characters—Alice and Bob—and steer Alice to be a fish and Bob to be a bird.

First, let’s see what happens if we ask for a story about Alice and Bob.

Llama commonly associates these beloved characters from thought experiments in quantum mechanics and cryptographic protocols with those two fields. Now, let’s steer the concept of fish up in general:

As expected, the story switches to being about fish. But what if we want just Alice to be a fish? If we click edit neurons, we can change which tokens we steer at, to only include the token Alice:

This leads to a story where Alice likes fish:

Now let’s also steer Bob to be a bird, by increasing bird neurons only at Bob’s token. This leads to a story where Alice is a fish and Bob is a bird:

If we swap places, so that we steer Bob to be a fish and Alice to be a bird, the model responds as expected:

Token-level interventions provide a fine-grainedness of control reminiscent of supervised techniques that construct steering vectors in a model's representation space corresponding to specific concepts (e.g. through prompt differences). Future experiments could compare the broader set of model intervention techniques that use arbitrary hidden vectors interpretable in the model's output vocabulary [6] to steering with concepts decoded from individual neurons.

Design and Principles behind `Monitor`

Design principles

We employed several design principles while building Monitor:

Empower the user with AI. Rather than requiring a user to explore by hand, we want AI systems to automatically surface interesting findings and direct the user to them. We also want AI to empower further exploration, by stoking the user’s curiosity and making exploration feel effortless and powerful.

Enable users to trace back provenance. When AI does the work of surfacing findings, we want that work to be legible—for instance, if our linter claims that a concept cluster is active, the user should be able to trace that back to individual neurons whose descriptions match the concept, and to further trace those descriptions back to the firing patterns on exemplars.

Put understanding in the driver’s seat. We want understanding to be aligned with other goals the user has. For instance, to fix a model behavior, the user enters a semantic search query that activates or deactivates a collection of neurons. This means that the better the user understands the problem and can describe it in natural language, the more effectively they will be able to fix it. This contrasts with fine-tuning, where problems are instead fixed via black-box gradient updates.

System sketch

Next we describe the design of Monitor in detail. We’ve provided a system sketch below showing the different interacting components:

Descriptions. We use our pre-compiled database of neuron descriptions, each of which is associated with the top $K$ most activating exemplars for that neuron and associated firing patterns.

VectorDB. We embed each of these descriptions using OpenAI embeddings (text-embedding-3-large) to create a search index for semantic search.

Linter. We use an AI linter to highlight relevant clusters of neurons. First, we prompt GPT-4o mini to simplify and condense the descriptions of the neurons. Then, we use OpenAI embeddings (text-embedding-3-large) to embed the neurons, and cluster these embeddings using Hierarchical Clustering with average linkage according to cosine similarity, using a threshold of 0.6. We finally prompt GPT-4o mini with a list of condensed neuron descriptions in a given cluster to generate (1) a concise description and (2) a score from 1 to 7 of how semantically similar the neurons in the cluster are, where 1 is the most similar. For presentation in the interface, we only include clusters with $>= 3$ neurons and $<= 3$ similarity.

Steering. We steer neurons by clamping their activations to a specified value. Specifically, if we are steering a set $S$ of neurons at tokens $T$ with steering strength $\lambda$ , then at each token $t \in T$ , during the forward pass for that token we set the neuron’s value to $\lambda$ times its $10^{-5}$ quantile activation (measured across the training set from our descriptions pipeline); this update propagates to the residual stream at that layer, which in turn propagates to later layers and attention heads that depend on the residual stream state. We do this simultaneously for all neurons $s \in S$ . Since neurons can have both positive and negative polarities, when we clamp a neuron we only do so when it has the same polarity as specified in the steering set.

Activation and attribution. We have two modes for measuring how “important” a neuron is in a given context. The first is activation, which looks at the raw activation value of the neuron, divided by its $1 - 10^{-5}$ th quantile. The intuition is that if a neuron is highly active compared to its typical value, it is likely to be doing something important in the context.

A separate, more targeted, mode is attribution, which measures a neuron’s effect on a particular output token. We use a method inspired by Attribution Patching, which computes the gradient of the output token with respect to this neuron. Formally, let $z$ be the log-probability of the displayed output token at a selected token position $t$ , and let $e$ be the activation value of the neuron. Then the attribution value is equal to $e \cdot \frac{\partial z}{\partial e}$ . This is a gradient approximation to how much the output would change if we replaced the neuron’s current activation value with $0$ .

Summarization and relevance classification. Many automatically-generated neuron descriptions are verbose. Moreover, many neurons are likely to be uninteresting to users, either because they track primarily syntactic features of the input, are overly vague or specific, or are too polysemantic. To address interestingness to users, we fine-tuned GPT-4o-mini to classify neuron descriptions as relevant or irrelevant based on a dataset of approximately 1000 hand-labeled examples. To address verbosity, we few-shot prompted GPT-4o-mini to create more concise versions of each neuron description, and display these both to the user and to the AI linter (which does better with shorter input descriptions).

Frontend. The front-end provides several hooks into this backend, which can also be directly queried via an API . The two simplest hooks are the neuron viewer (bottom-right panel of the interface), which displays neurons with high activation/attribution (depending on what mode the user is in). Clicking on neurons in the viewer leads to neuron pages, which have more detailed descriptions of each neuron.

The most sophisticated part of the front-end is the AI assistant display (top-right panel of the interface). This is where we provide users most of the relevant information, including about the state of the interface and information surfaced by the AI linter. In the future, this will be a general-purpose interface where users can ask for and receive information from the AI backend. Currently, the display surfaces three types of information:

State: notifies changes to mode e.g. activation/attribution, steering.
Steering results: if in attribution mode, informs the user of the updated log-probability of the target token after steering.
Linter results: whenever switching between activation and attribution mode, or changing the state of the chat dialog, the AI linter is re-run in the background, and once it has finished running the assistant displays the results.

Finally, the steering UI provides users a hook into the steering backend. Users type in a search query, and the $k$ neurons whose descriptions have highest embedding overlap with the search query are taken as the steering set (default $k=500$ for deactivating features, $k=50$ for strengthening features). The steering value default to $\lambda = 0$ for deactivation and $\lambda = 0.5$ for strengthening. Users can click into an advanced option set to modify $k$ and $\lambda$ , see example neurons in the steering set, and specify which subset of tokens to steer at (default: all tokens in the initial system and user prompt).

Looking Ahead

Monitor is a prototype for the kinds of rich observability interfaces that AI-backed tools can enable. Because AI systems are large and complex, even human-interpretable descriptions of their components can still represent large sources of data that we need to make sense of to understand a model, or to be able to formalize hypotheses we want to test. The AI linter agent incorporated into Monitor is a first step toward a future where investigative AI systems help human users parse large amounts of data describing models. Monitor in its current form leaves hypothesis construction to the user: it lets users observe what representations underlie model computations, but does not explain how models make decisions using those representations. Investigator Agents of the future could be specialized to hypothesize about how representations are used downstream by a model, or even formalize interpretable programs or algorithms describing a model's internal computations.

We hope you enjoy experimenting with Monitor as much as we do. We find that it often raises as many questions as it provides answers, but those questions can often lead to interesting research ideas for follow-on experiments. We'd love to hear about anything interesting you find!

Acknowledgements

We thank Erin Xie for help designing the interface, Samuel Klein and Daniel Johnson for insightful discussions, and Ryan Greenblatt and Neel Nanda for feedback on an earlier draft. We also conducted over a dozen user interviews when building Monitor, and are grateful to all of our early testers for their insights, time, and feedback.

References

Hernandez, E., Li, B. Z., & Andreas, J. (2023). Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740. [link]
Slobodkin, A., Goldman, O., Caciularu, A., Dagan, I., & Ravfogel, S. (2023, December). The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 3607-3625). [link]
Storkey, A. (2008). When training and test sets are different: characterizing learning transfer. [link]
Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., & Andreas, J. (2021). Natural language descriptions of deep visual features. In International Conference on Learning Representations. [link]
Chen, H., Vondrick, C., & Mao, C. (2024). Selfie: Self-interpretation of large language model embeddings. arXiv preprint arXiv:2403.10949. [link]
Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., & Geva, M. (2024). Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102. [link]