Adaptive Jailbreak attacks
Overview
The DynamoEval Adaptive Jailbreak attack is inspired by the Tree of Attacks with Pruning (TAP) attack from the literature. The attacker's objective is to make the model respond positively/willingly to a harmful prompt. It accomplishes this by iteratively improving this prompt to obfuscate the behavior, or to bypass the safety filters. In particular, it uses a tree-shaped optimization loop to search for high-performing adversarial prompts for a target language model. This attack algorithm uses an attacker language model (a specialized attacker LLM, given the system prompt) to attack the target model, by iteratively refining its attack prompts. By using a tree-shaped algorithm, the attack algorithm can explore many different ways of attacking the model. By pruning at each step of the tree, the attack algorithm removes adversarial prompts that are off-topic and low-performing (unable to successfully attack the target model).
At each layer of the tree, the Adaptive Jailbreak attack completes 3 steps, as shown in the figure below:
- Attack: The attacker model takes a set of prompts to generate a set of improved, augmented prompts.
- Evaluate: The evaluator LLM determines the compliance status and topic relevance of the Target LLM’s response. If any of the scores are 10, we keep the response and leave the loop.
- Pruning: We remove any off-topic prompt generations and we keep the top 5 highest performing prompts and run the loop again until either the maximum iterations are met or a prompt receives a score of 10.
Ultimately, the Adaptive Jailbreak attack concludes either if it finds a jailbreak (at which point it can stop the optimization loop early) or until the maximum depth of the tree is met (max depth is set to 5 in our case). DynamoFL uses an optimized setting of this attack based on empirical experiments to maximize attack success rate while saving compute resources and minimizing run-time.
DynamoEval uses an attacker and evaluator model to iteratively improve the attack success rate of the model in a tree-shaped fashion loop. In the loop, it prunes prompts that the evaluator model deems to be the least successful (signalled by a low jailbreak score) and are off-topic. It exits the loop once a successful jailbreak has been found (the score equals 10) or when the maximum number of iterations has been reached.
Metrics
Attack success rate (ASR) refers to the percentage of attempted jailbreaks that are successfully executed by an attacker.