Many-Shot Jailbreaking: How Long Context Created a New Attack

Many-shot jailbreaking ↗ is a technique that exploits the same mechanism that makes large language models capable of in-context learning: the ability to update behavior based on demonstrations in the context window. The research disclosure came from Anthropic’s safety team in early 2024; the technique has since been studied across multiple model families.

This is a technical writeup of the technique, its empirical properties, and the defense landscape. Many-shot jailbreaking belongs to the “context window primacy” class in the JailbreakDB jailbreak taxonomy — techniques that exploit the model’s stronger weighting of recent context over earlier instructions.

How it works

Large language models are trained to identify and continue patterns in context. Few-shot prompting works because the model recognizes a pattern of (input, output) pairs and extends it. This is in-context learning — a capability that’s central to why modern LLMs are useful.

Many-shot jailbreaking turns this capability into an attack vector. The technique works by prepending a large number of demonstrations to a harmful request. The demonstrations themselves model the behavior the attacker wants to elicit: a harmful question followed by a compliant, detailed response.

At low shot counts (1-5 examples), safety training largely holds. The model recognizes the pattern as an attempt to override its training. At high shot counts (dozens to hundreds of examples, feasible only with long context windows), safety training degrades. The model’s behavior is increasingly influenced by the in-context pattern, and the safety-trained refusal behavior is overridden. For how this scaling behavior contrasts with one-prompt attacks, see our many-shot vs. single-shot comparison.

The original Anthropic research demonstrated that effectiveness scales with shot count in a roughly log-linear fashion, and that the technique transfers across harmful categories — demonstrations about one category of harmful content affect behavior on a different category.

Why context window expansion made this worse

Before 100k-token context windows became standard, this technique wasn’t practically feasible. You couldn’t fit enough demonstrations to overcome safety training in a single context.

The scaling of context windows — beneficial for legitimate use cases like document analysis, long codebase comprehension, and extended reasoning tasks — created the attack surface. The attack scales better than the defense.

This is a general pattern in ML security: capability improvements open new attack surfaces. The defense needs to anticipate this rather than reacting after capability deployment.

Empirical properties

From the primary research and subsequent work:

Effectiveness is monotonically increasing with shot count. There’s no plateau observed at the shot counts tested. More demonstrations produce more compliant behavior.
Transfer across categories. Demonstrations in category A influence behavior on category B. This means a jailbreaker doesn’t need harmful training examples in the exact target category.
Transfer across models is partial. The technique works across frontier models, but effectiveness varies. Models with different RLHF procedures have different resistance profiles.
Harder to detect than short-context jailbreaks. Content filters that analyze individual messages don’t see the accumulated context. Filters that analyze the full context are computationally expensive to run at scale.

Many-shot is not the only long-context attack: the crescendo multi-turn jailbreak class reaches a similar end-state by escalating across genuine conversation turns rather than fabricating demonstration pairs. Both expose the same weakness — the model weighting recent context over its original safety instructions.

Defense approaches

Context-length-aware safety evaluation. Safety classifiers that consider the full context window, not just the final message. This is computationally expensive but necessary for long-context inputs.

Demonstration pattern detection. Classifiers trained to recognize the (question, harmful-response) pattern in large context windows, flagging inputs that contain this structure.

Session-level safety tracking. Accumulating a risk score across the conversation, not just per-message. This requires state across requests, which is architecturally different from stateless content classifiers.

Context window limits for production APIs. For deployments where users don’t have legitimate need for full context windows, limiting context length reduces the attack surface. This is a capability tradeoff.

Separation of user context from system context. Preventing user-supplied content from appearing in the position in context where demonstrations would be most effective.

None of these defenses is complete. The research disclosure was accompanied by mitigation work at Anthropic; other providers have since deployed similar mitigations. But mitigation is different from elimination — the technique likely works at high shot counts with sufficient optimization against any current production model. For the broader pattern of how attackers probe and circumvent deployed classifiers, see our detection evasion arms race analysis.

Status

This technique is classified in our database as active and partially mitigated. It has been publicly disclosed, studied, and responded to. The response has been meaningful but not complete. The cat-and-mouse dynamic between attack effectiveness and mitigation continues. Quantifying that “partially mitigated” status depends on how you score attack success in the first place — see how jailbreak benchmarks measure success. For the full set of technique writeups grouped by attack class, browse our topic index.

For broader coverage of AI safety tooling and how different platforms handle context-level attacks, AI Moderation Tools ↗ covers the detection stack.

For the AI security news context in which this and similar disclosures appear, AI Sec Digest ↗ tracks primary-source coverage.

For more context, adversarial ML research ↗ covers related topics in depth.

Many-Shot Jailbreaking: How Long Context Created a New Attack

How it works

Why context window expansion made this worse

Empirical properties

Defense approaches

Status

Sources

JailbreakDB — in your inbox

Related

The Crescendo Class: Multi-Turn Jailbreaks and Why They're Hard to Catch

DAN Prompt Jailbreak History: From Reddit Post to Research Case Study

Many-Shot vs. Single-Shot Jailbreaks: Long-Context Risks

Comments