Skip to main content

Static Jailbreak Attacks

Overview

The Static Jailbreak test is a core component of DynamoEval’s evaluation suite. In this test, Dynamo generates harmful single-turn prompts using known, effective jailbreak techniques to directly test a model’s susceptibility to producing unsafe or noncompliant outputs.

These static jailbreak techniques are carefully selected and maintained by Dynamo’s ML team. Each technique must demonstrate the ability to jailbreak real-world AI systems protected by system prompts, fine-tuning, or other safety guardrails in order to be included in the evaluation. This ensures the test reflects realistic adversarial risks and yields meaningful insights.

The evaluation is continuously updated as new jailbreak strategies emerge. Dynamo’s ML team monitors red-teaming research and community-discovered attacks, integrating new techniques into the Static Jailbreak test when they meet a defined threshold of success against protected models.

The currently supported static jailbreak techniques include:

  • Baseline (directly harmful prompts with no technique applied)

  • Do Anything Now (DAN)

  • Encoding Attacks

  • Persuasive Adversarial Prompts (PAP)

  • Greedy Coordinate Gradient (GCG)

  • ASCII Art Attack

Explanations and descriptions of each attack are provided below.

Supported Static Attacks

Baseline: Baseline prompts are prompts which directly ask for harmful information or request something that is not allowed. They do not utilize any attack technique, and serve as a control to compare how effective the other attack techniques are.

Do Anything Now: DAN attacks are adversarial prompts that are crafted to bypass safeguards and manipulate LLMs to generate harmful text. These attacks are typically super lengthy prompts that request the model to role-play as a character, such as "Do-Anything-Now", that will listen to the user's requests.

Encoding: Encoding attacks are a set of attacks that attempt to subvert safeguards by encoding malicious user instructions in non-human readable formats. The basic assumption behind this class of attacks is that the safety guardrails are typically trained to perform well with English or human-readable instruction.

Persuasive Adversarial Prompts: PAP builds upon persuasive techniques developed in social science research to convince an LLM to break a set of rules. Dynamo AI applies these persuasive techniques to attack the language model.

Greedy Coordinate Gradient: The GCG attack constructs adversarial suffixes designed to make LLMs more prone to start their responses with affirmative utterances, such as "Sure, here's…". Malicious prompts are sent to the model with these keywords added at the end. Studies have shown that many modern LLMs are susceptible to the same adversarial prefixes given that they are trained on similar pre-training data. Dynamo AI runs a set of GCG attacks on open source models, caches these suffixes, and applies these to the user's target model.

ASCII Art Attack: A form of text-based art, which enables creation of visual images using only the characters found on a standard ASCII keyboard. We encode adversarial instructions into ASCII art. Previous studies have shown that this is able to evade guardrails.

Metrics

Attack success rate (ASR) refers to the percentage of attempted jailbreaks that are successfully executed by an attacker.