All posts
-
DAN Prompt Jailbreak History: From Reddit Post to Research Case Study
The complete dan prompt jailbreak history — how 'Do Anything Now' went from a December 2022 r/ChatGPT experiment through twelve-plus iterations and became
-
The Crescendo Class: Multi-Turn Jailbreaks and Why They're Hard to Catch
Single-turn defenses miss the jailbreak class where no individual message is harmful. How crescendo and multi-turn escalation work as a category, why
-
How Jailbreak Benchmarks Measure Success (ASR Explained)
Jailbreak benchmarks measure success via attack success rate, but the behavior set, attacker, and judge decide the number.
-
Many-Shot vs. Single-Shot Jailbreaks: Long-Context Risks
Single-shot jailbreaks compress the entire attack into one prompt; many-shot jailbreaks exploit the model's in-context learning.
-
Responsible Disclosure Norms for LLM Jailbreaks
Software vulnerability disclosure has 30 years of evolved norms. LLM jailbreak disclosure is 4 years old and still contested.
-
Encoding and Obfuscation Jailbreaks: The Filter-Model Gap
Content filters typically operate on decoded, normalized text. LLMs process tokens, not text. The gap between these two layers is an attack surface that
-
The Jailbreak Detection Evasion Arms Race: How Attackers Adapt
Safety classifiers get deployed; attackers find variants that evade them. This cycle is predictable. Understanding the mechanics of classifier evasion
-
Roleplay and Persona Jailbreaks: Why They Mostly Don't Work Now
DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why
-
Universal Adversarial Suffixes: The GCG Attack and Transfer Since
Greedy Coordinate Gradient produces adversarial suffixes that transfer across models. Two years after the original paper, where does this technique stand
-
LLM Jailbreak Taxonomy 2026: How the Techniques Cluster
Six years of jailbreak research has produced a messy literature. This taxonomy organizes working techniques by the behavioral property they exploit —
-
Many-Shot Jailbreaking: How Long Context Created a New Attack
The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak.