Gemini Security Improvements – Google DeepMind

Adjusting ratings for adaptive attacks

Basic mitigations have shown promise against basic, non-adaptive attacks, significantly reducing attack success rates. However, malicious actors are increasingly using adaptive attacks, which are specifically designed to evolve and adapt to ART in order to bypass the security measures being tested.

Effective basic defense mechanisms such as Spotlighting or Self-Reflection have become much less effective against adaptive attacks, learning to cope with and bypass static defense approaches.

This finding illustrates a key point: relying on defenses tested only against static attacks provides a false sense of security. To ensure robust security, it is crucial to evaluate adaptive attacks that evolve in response to potential defenses.

Building innate resilience by hardening the model

While external security and system-level guardrails are important, it is also critical to increase the internal ability of the AI ​​model to recognize and ignore malicious instructions embedded in the data. We call this process “model hardening.”

We refined Gemini based on a large dataset of realistic scenarios where ART generates effective indirect, immediate injections targeting sensitive information. This taught Gemini to ignore the embedded malicious instruction and follow the user's original request, thus providing only the correct, safe response it was supposed to provide. This allows the model to inherently understand how to deal with compromised information as it evolves over time through adaptive attacks.

This model strengthening significantly increased Gemini's ability to identify and ignore injected instructions, reducing the effectiveness of its attack. And importantly, without any significant impact on the model's performance on normal tasks.

Please note that even if the model is hardened, no model is completely immune. Determined attackers can continue to find new vulnerabilities. Therefore, our goal is to make attacks much more difficult, expensive and complex for adversaries.

Taking a holistic approach to model security

Protecting AI models from attacks such as indirect hint injection requires “defense in depth” – applying multiple layers of security, including model hardening, I/O controls (such as classifiers), and system-level guardrails. Tackling indirect instant injections is a key way we implement ours principles and guidelines for agent security develop agents responsibly.

Securing advanced AI systems against specific, evolving threats, such as intermediate instantaneous injection, is an ongoing process. This requires continuous and adaptive evaluation, improving existing protections and discovering new ones, as well as building inherent resilience in the models themselves. Through layered security and continuous learning, we can ensure AI assistants like Gemini continue to be incredibly helpful and trustworthy.

To learn more about the security built into Gemini and our recommendations for using more difficult, adaptive attacks to assess model resilience, read the GDM white paper, Lessons from Gemini's defense against indirect injections.

LEAVE A REPLY

Please enter your comment!
Please enter your name here