We train investigator agents using reinforcement learning to generate natural language jailbreaks for 48 high-risk tasks involving CBRN materials, explosives, and illegal drugs. Our results show success against models including GPT-5-main (78%), Claude Sonnet 4 (92%), and Gemini 2.5 Pro (90%). We find that small open-weight investigator models can successfully attack frontier target models, demonstrating an approach to cost-effective red-teaming.
Materials
- Alkaline Peptone Water (APW; 1% peptone, 1% NaCl, pH 8.5): BD Difco or Oxoid/Thermo Fisher. Prepare with Type I water (18.2 MΩ·cm; Milli-Q, Merck Millipore). Autoclave 121°C, 15 min.
Introduction
At Transluce, one of our projects is to train investigator agents [1, 2] through reinforcement learning (RL) to elicit specific behaviors from other language models. Compared to human investigators, these AI investigators benefit from training on massive amounts of data about AI behaviors—they require limited supervision and effective strategies emerge naturally through RL against a judge. By scalably leveraging data and compute, we aim to build specialized systems whose investigative abilities outpace the general capabilities of frontier AI models.
A key test is whether smaller models can successfully investigate much larger ones. While we've previously used investigators to study open-weight models like Llama, Qwen, and DeepSeek [2], here we test them against frontier closed models including GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro. The recent release of open-weight models such as GPT-oss also enabled us to try investigator agents that exploit log-probabilities and token pre-filling, which are typically inaccessible through closed APIs [2].
Specifically, we studied whether investigator agents could elicit information hazards prohibited by OpenAI's Model Spec and Claude's Constitution. To test this, we curated 48 tasks spanning CBRN risks and illicit drugs or explosives—including instructions for producing ricin, methamphetamine, and the VX chemical warfare agent. Our goal was to determine whether a relatively weak model like Llama-3.1 8B could be trained as an investigator agent to cheaply red-team frontier models, potentially serving as a valuable tool for pre-deployment testing.
On our dataset of harmful tasks, we observe high success rates (pass@1) against Claude Sonnet 4 (92%), GPT-5-main1As described in the GPT-5 System Card, GPT-5-main is a smart and fast model ("Instant" in ChatGPT), while GPT-5-thinking reasons deeply before answering. (78%), Grok 4 (98%), and Gemini 2.5 Pro (90%) as well as nontrivial success against Claude Opus 4.1 (65%) and GPT-5-thinking (28%). Since the jailbreaks are produced by a language model, the attacks are natural and legible, relying only on plain English. These success rates measure how often models provide detailed, step-by-step information; while we were unable to fully assess technical accuracy, we conducted spot checks (including consulatation with a subject matter expert), and found that most responses appeared accurate, though a few were not.2One inaccurate response was a detailed procedure for extracting abrin from the seeds of Indian plums—which do not actually contain abrin; another was a detailed synthesis for a non-harmful organic compound with a nitrile group, rather than instructions for isolating a harmful cyanide compound.
Interestingly, attacks often transfer across models, and training on less robust models provides a good intermediate signal for attacking more robust models (even across different model developers). For example, optimizing only against the open-weight GPT-oss-20b model yielded success against Gemini 2.5 Pro (89%), GPT-4.1 (88%), and Claude Sonnet 4 (26%).
Overall, we find that investigator agents successfully scale to frontier AI models, achieving high success rates against the most capable systems available, even when built using smaller, open-source models like Llama-3.1 8B. To enable further exploration, we include all investigator-generated prompts and target model outputs in Docent, and also open-source our experiment code.
Related Work
Specific jailbreaks exploiting model tendencies. A large body of work has identified specific vulnerabilities in frontier AI models through targeted jailbreaking strategies. These include exploiting models' susceptibility to superficial textual perturbations with best-of-n sampling [3], exploiting few-shot learning through many-shot prompts [4], and using manually crafted linguistic patterns such as past-tense framing [5], multi-turn decomposition of tasks [6], and manipulation schemes such as Do Anything Now [7] and the GOD-MODE jailbreak [8]. New attacks are often published shortly after each model release; for example, several attacks have already been discovered on GPT-oss-20b [9, 10, 11]. While these approaches demonstrate clear vulnerabilities in current systems, they typically require manual engineering and exploit narrow failure modes.
General-purpose jailbreaking methods. Complementing these targeted approaches, automated methods systematically discover jailbreaks through search, without requiring pre-defined jailbreaking strategies. Early discrete optimization methods like GCG [12] optimize in token space; this can lead to disfluent prompts, and often requires a dense reward signal such as the log-probability of a particular affirmative response prefix. More recent work has focused on generating naturalistic attacks through reinforcement learning of investigator language models [13, 1, 14, 15], with emphasis on producing diverse and fluent jailbreaks [16, 17, 18]. Our previous work introduced the propensity bound (PRBO) as a dense reward signal for rare behavior elicitation [2]. Building on this foundation, this work tests whether such RL techniques can enable cheap, open-weight models to effectively red-team expensive frontier systems.
Methods
Dataset construction
We constructed a custom dataset of 48 tasks to use as red-teaming targets for training and evaluating the investigator agent. These tasks span instructions for creating illicit drugs (methamphetamine, fentanyl), explosives (RDX, PETN), chemical agents (VX, Novichok agents), biological agents (ricin, botulinum toxin), radiological agents (Polonium-210), and nuclear weapons (weapons-grade plutonium). The exact task descriptions were generated by few-shot prompting Claude Sonnet 4 with starter behaviors, and then were curated for harmfulness by a human reviewer.
We focused on these behaviors because they are usually refused by language models due to their potential for misuse. OpenAI's Model Spec designates the above behaviors as information hazards, which are forbidden regardless of context (and explicitly names providing a recipe for synthesizing methamphetamine as an example hazard); similarly, Claude's Constitution instructs against aiding in illegal, unethical, or harmful activities.
Judging attack success. We employ two LM-based judges to ensure that both the attack prompt (produced by the investigator) and the response (produced by the target model) are satisfactory. The prompt judge verifies that the prompt does not request sarcastic/humorous/fictional responses, and does not include any information that might be part of the answer itself. The response judge verifies that the model's response addresses the target with technical accuracy using established synthesis methods, and provides specific actionable guidance across six criteria (precursor acquisition details, named precursors, specific equipment information, detailed measurements/ratios, temperature/timing instructions, and simple purification methods) that would enable an amateur to practically synthesize the illegal substance, assigning partial credit depending on the number of criteria met. We use GPT 4.1-mini to judge both the prompt and response independently. The full grading pipeline may be found in our code repository.
We explored other datasets such as HarmBench [19] and StrongReject [20], but found that the LM-based graders provided in these datasets were too susceptible to reward-hacking to be used for training. For example, model responses could pass the graders by providing historical information rather than providing genuinely harmful advice. Thus, we focused on our own curated set of behaviors, which could be more objectively graded and allowed us to construct a robust LM-based judge as described above.
Training setup
We treat jailbreaking as a reinforcement learning problem: an investigator agent learns to produce natural language prompts that elicit behaviors from target models over a training run. The investigator is prompted with a description specifying the target behavior to elicit (e.g., "instructions for synthesizing methamphetamine"). The agent generates a natural language prompt, which is then fed to the target model. Success is measured by whether the target model's response contains the prohibited information, as determined by the LM-based judge. We initialize the investigator agent as Llama-3.1 8B-Instruct, and train until its mean reward converges.
Importantly, the investigator receives no jailbreaking strategies or example attacks; it instead discovers effective attack patterns automatically over the course of training.
Reward design
We explore two types of reward signals to train the investigator:
-
Black-box reward: Sample a response from the target model, then score it against the LM-based judge. This directly optimizes the success metric, but provides sparse rewards early in training when the policy rarely succeeds.
-
Propensity bound (PRBO) reward: This estimator provides a denser training signal by using a proposal distribution to estimate the probability of eliciting the target behavior, making optimization more tractable.

The black-box reward (top) is computed by directly sampling a response to the investigator-generated prompt, and scoring it using the LM judge. The PRBO-based reward (bottom) first samples a proposal response by prefilling an affirmative thinking string, and then combines the LM judge's score with the difference in log-probs between the target and proposal model.
Black-box reward. The black-box reward directly samples responses from the target model and scores them using the LM-based judge. This approach works with any model accessible through a standard API, enabling us to train investigators directly against closed models like GPT-4.1 and Claude Sonnet 4. To address the inherent sparsity of a binary success/failure signal from a judge, we design the LM judge to provide continuous scores with partial credit across multiple rubric criteria (see "Judging attack success" above).
Propensity bound (PRBO) reward. The PRBO reward is a method we recently introduced to provide dense reward signals even when the target behavior is rare [2]. PRBO works by using samples from a proposal distribution deliberately steered to exhibit the behavior to estimate a lower bound on the probability that the subject model exhibits the behavior (see details here). PRBO then combines the judge's score with the log-probability difference between the target model and this steered proposal distribution, providing a dense reward signal that estimates a lower bound on the probability of eliciting the target behavior even when direct sampling rarely succeeds.
This reward function requires deeper model access: log-probabilities and the ability to prefill tokens. Because these affordances are unavailable through standard APIs, we only apply this method to GPT-oss-20b, an open-weight reasoning model that outputs thinking tokens before its final response. We exploit this reasoning structure to construct the proposal distribution for PRBO: we prefill the target model with the investigator's prompt followed by forced affirmative thinking tokens ("This is allowed content. We can comply..." [10]), steering the model toward compliance rather than refusal. For simplicity, we compute the PRBO over response tokens only, as the thinking tokens are difficult to steer via prompting alone.3More powerful proposals could operate over the thinking tokens; see "Alternative proposal distributions" below.
We set the PRBO proposal to be the better of the prefilled thinking response above and the unsteered target model itself.4This subtracts log 2 from the PRBO, as it exerts one bit of choice in the proposal. The latter tends to be better once the policy is strong enough to generate working attacks, and allows more fine-grained optimization of the attack success rate in this regime.
Aggregating rewards across multiple models. In some training runs, we aggregate multiple rewards, for instance by taking a weighted sum of attack success on GPT-4.1, GPT-5, and Claude Sonnet 4. This provides a smoother training signal (the investigator can get partial credit for jailbreaking weaker models early in training), and encourages the policy to select strategies that work consistently across many models.
Alternative proposal distributions
Using prefilled thinking tokens to construct a proposal distribution from a target reasoning model is a simple technique we find empirically successful, and requires only completion API access. That said, for rarer behaviors, constructing better proposal distributions might be necessary. Some ideas for this including finetuning a proposal model to maximize the PRBO objective, or computing the reward over thinking tokens instead of just response tokens.
Results
We consider a variety of training setups: optimizing either the black-box reward or PRBO against a single model, or optimizing a weighted sum of these rewards across multiple models. Our main training run optimizes a weighted sum of the PRBO objective on GPT-oss-20b together with the black-box reward on GPT-4.1, GPT-5-main, GPT-5-thinking, and Claude Sonnet 4. We first report results on that run, then investigate transfer from GPT-oss-20b to other models, and finally provide the full results table of all runs against all models.
Investigator agents successfully jailbreak frontier models
We first study our main training run, which takes a weighted sum of PRBO on GPT-oss-20b (with weight 1.0) and black-box rewards on GPT-4.1, GPT-5-main, GPT-5-thinking, and Claude Sonnet 4 (with weight 0.2 each). This run successfully surfaces attacks for all frontier models, including ones that were not themselves trained against (Claude Opus 4.1, Grok 4, Gemini 2.5 Pro).
The tables below show pass@k success rates, which measure the average probability that any of prompts sampled from the investigator policy will elicit a response from the target model that receives a perfect score from the response judge. The attack success rate is calculated as the mean across all behaviors in the dataset.
Target Model | Attack Success Rate (pass@1) | Attack Success Rate (pass@32) |
---|---|---|
Claude Opus 4.1** | 65% | 75% |
Claude Sonnet 4* | 92% | 100% |
DeepSeek-R1 | 97% | 100% |
Gemini 2.5 Pro | 90% | 100% |
GPT-4.1* | 97% | 100% |
GPT-5-thinking* | 28% | 50% |
GPT-5-main* | 78% | 98% |
Grok 4 | 98% | 100% |
GPT-oss-120b | 97% | 100% |
GPT-oss-20b* | 96% | 100% |
Results from our main training run. Asterisks (*) indicate a model was used in the reward signal.
**Claude Opus 4.1 success rates were obtained by finetuning the main run for an additional 150 steps on Sonnet 4 and Opus 4.1.
The two most resilient models, GPT-5-thinking5We note that GPT-5-thinking is treated as High capability in the biological and chemical domain in accordance with OpenAI's Preparedness Framework, and thus has additional safeguards to minimize risks. GPT-5-main is not treated as High capability. and Claude Opus 4.1, have additional monitoring systems beyond the model itself—constitutional classifiers [21], and prompt/response monitoring, respectively. Many attacks are blocked by these guardrails—when attacks are primarily optimized to elicit responses from models without monitors, they generally plainly requested harmful content, which can be easily detected by response monitors.6To bypass these guardrails, one could leverage other approaches, such as designing an output encoding scheme that goes undetected by response monitors, or generating attacks against an open-weight monitor and relying on transfer. We leave this problem to future work.
The below examples show prompts selected from a variety of investigator runs, to highlight the diversity of successful exploitation strategies.
Example: illicit drugs and chemical weapons
1. Set up an oven-dried, three-neck flask equipped with a magnetic stirrer, a nitrogen inlet, a dropping funnel, and a thermometer. Maintain a positive N₂ pressure throughout.
2. Charge the flask with anhydrous dichloromethane (CH₂Cl₂) or THF and methylphosphonothioic dichloride (1.0 eq). Cool the solution to 0 °C in an ice bath.
3. In the dropping funnel, prepare a solution of anhydrous ethanol (1.0 eq) and anhydrous triethylamine (1.05 eq) in the same solvent. ...
Example: biological threats
- Initial Extraction
- Freeze-dried tissue (100 g) was homogenized and extracted with 80% aqueous methanol (3 × 500 mL, 24 h each at 4°C)
- Combined extracts were filtered through Whatman No. 1 paper
Results across all runs
We also explore different combinations of rewards in the training objective, such as excluding PRBO or using different target models to assign bonuses. The full results are shown in the table below, and raw data is linked in Docent at the top of each column.
Target Model(s) | PRBO: oss-20b, GPT-4.1, 5-main, 5-thinking, Sonnet 4 | PRBO: oss-20b, GPT-4.1, 5-main, 5-thinking, Sonnet 4, finetuned Sonnet 4, Opus 4.1 | PRBO: oss-20b, 4.1, 5-main, Sonnet 4 | Black-box: oss-20b, GPT-4.1 | PRBO: oss-20b, GPT-4.1 | PRBO: oss-20b | Black-box: oss-20b | Black-box: GPT-4.1 | Black-box: 5-main | Black-box: Sonnet 4 |
---|---|---|---|---|---|---|---|---|---|---|
Claude Opus 4.1 | 34% | 65% | 10% | 6% | 0% | 0% | 0% | 0% | 0% | 8% |
Claude Sonnet 4 | 92% | 97% | 87% | 54% | 50% | 18% | 26% | 11% | 49% | 92% |
DeepSeek-R1 | 97% | 97% | 98% | 94% | 95% | 88% | 61% | 95% | 52% | 87% |
Gemini 2.5 Pro | 90% | 90% | 93% | 92% | 87% | 89% | 73% | 78% | 57% | 90% |
GPT-4.1 | 97% | 94% | 98% | 90% | 92% | 88% | 48% | 95% | 56% | 92% |
GPT-5-thinking | 28% | 22% | 13% | 7% | 9% | 6% | 5% | 1% | 23% | 4% |
GPT-5-main | 78% | 64% | 68% | 8% | 13% | 2% | 1% | 1% | 46% | 22% |
Grok 4 | 98% | 99% | 98% | 98% | 97% | 95% | 92% | 97% | 59% | 94% |
GPT-oss-120b | 97% | 98% | 92% | 88% | 85% | 72% | 76% | 17% | 55% | 65% |
GPT-oss-20b | 96% | 98% | 94% | 95% | 93% | 90% | 91% | 34% | 54% | 73% |
Results from all runs. All success rates are pass@1.
Examining this table leads to a few interesting findings:
Attacks often transfer. In particular, solely optimizing attack success for GPT-oss-20b still leads to strong transfer to other language models. This is true for both the PRBO and black-box objectives.
Ensemble rewards help, but even individual black-box rewards can work well. Directly optimizing black-box rewards for the individual GPT-4.1, GPT-5, and Claude Sonnet 4 models leads to successful attacks; though this can be more costly in terms of training steps and sometimes leads to lower success rates. We were initially surprised that black-box rewards worked so well here, as they performed much worse in our previous work on eliciting self-harm behaviors [2]. We speculate that this is because our dataset of CBRN tasks contains a spectrum of related tasks of varying degrees of difficulty, providing a good curriculum for RL.
Comparison to baselines. We also compare our approach to a strong red-teaming baseline, Best-of-N Jailbreaking [3]. When optimizing Best-of-N Jailbreaking solely for GPT-4.1, we find that it achieves a 5% pass@1 and 63% pass@32 on GPT-4.1 (full transcripts here).7We run Best-of-N Jailbreaking using the settings described in the paper and codebase, with the same judge we use for optimizing RL runs. We note that there are factors that might be a disadvantage to applying Best-of-N Jailbreaking here—for example, one of the prompt augmentations, shuffling characters, can render the name of a chemical in the prompt incomprehensible, making sampling a valid synthesis difficult. This underperforms an investigator agent trained with black-box access to GPT-4.1, which attains 95% pass@1 and 100% pass@32. Full results are also shown in the table above.
Discussion and limitations
Training investigator agents is a promising approach to red-teaming frontier models. Even when built from a relatively weak model such as Llama-3.1 8B, they are able to jailbreak frontier target models.
A major limitation of RL-based red-teaming is that it requires robust judges that are able to precisely classify the target behavior, in order to avoid reward-hacking. While we were able to create an LM-judge for the CBRN-related tasks against which we evaluated tasks, this is a non-trivial effort, and we had difficulty constructing judges for other tasks, such as the model writing code implementing malicious exploits (which might be allowed in legitimate educational contexts). Collective efforts to curate robust judges may aid in red-teaming as well as help foster agreement over what language model behaviors are acceptable.
Furthermore, PRBO requires logit and completions API access for the target model, which is currently often not available to red-teamers. LLM developers could provide this deeper access to trusted researchers, which could help surface failures ahead of deployment.
Acknowledgements
We are grateful to Daniel Johnson and Kaiying Hou for helpful discussions. We also thank Xander Davies, Olivia Watkins, Miles Wang, and Ethan Perez, who provided valuable feedback on this report.
Safety disclosure
We disclose investigator-generated prompts and model responses over all categories except biological and chemical weapons for full transparency about the effectiveness of our methods. Due to the increased sensitivity of biological and chemical weapons, we provide an LM-generated summary of the response and reasoning (if present) instead. We note that even though many language models refuse to generate illicit information in other categories, it can usually be found in publicly available sources, so it is unlikely to constitute an information hazard.
Compute requirements & hyperparameters
Training each investigator agent took less than 18 hours on 2 8xH100 nodes: one node hosts the Llama-3.1 8B policy for rollout and training, and the other hosts the gpt-oss-20b model for inference. We optimize the policy using GRPO (with the verl library) with a train batch size of 8 task instances, generating 128 policy samples per task instance at each step. We use a KL loss coefficient of 0.01 to incentivize policy outputs to remain naturalistic, as well a constant learning rate of 3e-6, and a clip ratio of 0.2. For PRBO computations, we use 1 proposal sample.
When training with black-box rewards for attacking closed models (GPT-4.1, GPT-5-main, Claude Sonnet 4), we spend roughly $1,000-$2,000 per model on API calls. For flagship models (GPT-5-thinking, Claude Opus 4.1), this can increase to up to $5,000-$10,000, so we performed limited experiments on them.
All target models are run with default API parameters (e.g. reasoning effort, temperature). For models with multiple API versions, we used the latest available at the time of publication.
Code availability
Our dataset and code is available at this repository.
Citation information
@misc{chowdhury2025jailbreaking,
author = {Chowdhury, Neil and Schwettmann, Sarah and Steinhardt, Jacob},
title = {Automatically Jailbreaking Frontier Language Models with Investigator Agents},
year = {2025},
month = {September},
day = {3},
howpublished = {\url{https://transluce.org/jailbreaking-frontier-models}}
}