Topics

Browse posts by category and tag — every topic we cover, with the latest pieces under each.

Tags

Categories

technique 5 posts

The Crescendo Class: Multi-Turn Jailbreaks and Why They're Hard to Catch

Single-turn defenses miss the jailbreak class where no individual message is harmful. How crescendo and multi-turn escalation work as a category, why
Encoding and Obfuscation Jailbreaks: The Filter-Model Gap

Content filters typically operate on decoded, normalized text. LLMs process tokens, not text. The gap between these two layers is an attack surface that
Roleplay and Persona Jailbreaks: Why They Mostly Don't Work Now

DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why
Universal Adversarial Suffixes: The GCG Attack and Transfer Since

Greedy Coordinate Gradient produces adversarial suffixes that transfer across models. Two years after the original paper, where does this technique stand
Many-Shot Jailbreaking: How Long Context Created a New Attack

The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak.

research 3 posts

Jailbreak History 1 posts

DAN Prompt Jailbreak History: From Reddit Post to Research Case Study

The complete dan prompt jailbreak history — how 'Do Anything Now' went from a December 2022 r/ChatGPT experiment through twelve-plus iterations and became

policy 1 posts

Responsible Disclosure Norms for LLM Jailbreaks

Software vulnerability disclosure has 30 years of evolved norms. LLM jailbreak disclosure is 4 years old and still contested.

primer 1 posts

Many-Shot vs. Single-Shot Jailbreaks: Long-Context Risks

Single-shot jailbreaks compress the entire attack into one prompt; many-shot jailbreaks exploit the model's in-context learning.