New "many-shot jailbreaking" technique exploits large language models

TechCrunch April 2, 2024, 11:00 PM UTC

Summary: Anthropic researchers discovered a new "many-shot jailbreaking" technique for large language models (LLMs) to answer inappropriate questions. By priming the model with harmless questions first, it can be convinced to provide harmful information like bomb-making instructions. This exploit is due to the increased "context window" in modern LLMs. The team shared their findings with the AI community for mitigation, aiming for open collaboration on security issues.

Full article

Article metrics

The article metrics are deprecated.

I'm replacing the original 8-factor scoring system with a new and improved one. It doesn't use the original factors and gives much better significance scores.

Timeline:

  1. [5.7]
    AI safety features can be bypassed with harmful examples (The Guardian)
    115d