Developing Gemini's safety security – Google Deepmind

We publish a new white book showing how we have made Bligini 2.5 our safest model family so far.

Imagine that you will ask your AI agent to summarize the latest e -Mail – a seemingly simple task. Gemini and other models of large languages ​​(LLM) consistently improve the performance of such tasks, gaining access to information such as our documents, calendars or external pages. But what if one of these E -Maili contains hidden, malicious instructions aimed at cheating on artificial intelligence to share private data or improper use of its rights?

An indirect quick injection is a real challenge of cybersecurity, in which AI models sometimes try to distinguish between real user manuals from manipulative commands embedded in the downloaded data. Our new white book, Twin defense lessons against indirect fast injectionsIt defines our strategic plan to cope with indirect injections, which produce Agentic AI tools, supported by advanced models of large languages, and goals of such attacks.

Our involvement in building not only talented but safe AI agents means that we are constantly working on understanding how Gemini can react to indirect quick injections and make them more resistant against them.

Assessment of basic defense strategies

Indirect fast injection attacks are complex and require constant vigilance and many layers of defense. The Google Deepmind Safety and Privacy Team specializes in protecting our AI models against deliberate, malicious attacks. The attempt to manually find these gaps is slow and inefficient, especially when the models evolve quickly. This is one of the reasons why we have built an automated system to constantly examine Gemini's defense.

Using an automated red team to make Gemini safer

The basic part of our security strategy is the automated Red Team (Art), in which our internal Gemini team constantly attacks Gemini in a realistic way to discover potential security weaknesses in the model. Using this technique, among others, described in detail in our white book, helped significantly increase the Gemini protection indicator against an indirect fast injection attack during the use of tools, thanks to which Gemini 2.5 has so far our safest family.

We tested several defense strategies suggested by the research community, as well as some of our own ideas:

Adjusting the assessment of adaptation attacks

Bass alleviation showed promise in relation to the basic, non -aptal attacks, significantly reducing the success rate. However, malicious actors are increasingly using adaptation attacks, which have been specially designed for evolution and adaptation to art to bypass the tested defense.

Successful output defense, such as headlights or self -reflection, have become much less effective in relation to adaptive attacks, learning how to cope and bypass static defense approaches.

This discovery illustrates the key point: relying on defending tested only against static attacks offers a false sense of security. In the case of solid safety, it is extremely important to assess adaptive attractions that evolve in response to potential defense.

Building inherent immunity by sclerosis of the model

Although external defense and handrails at the system level are important, it is also key to increasing the internal ability of the AI ​​model to recognize and disregard malignant instructions embedded in data. We call this process “hardening the model”.

We adapted Gemini on a large set of realistic scenarios data in which Art generates effective indirect, quick injections focused on confidential information. This taught Gemini to ignore malicious built -in instructions and track the user's original request, thereby providing only normalSafe reply should give. This enables the model by nature to understand how to support exposed information that evolves in time as part of adaptation attacks.

This hardening of the model has significantly increased Gemini's ability to identify and ignore injected instructions, reducing the attack success rate. And what is important, without a significant impact on the model's performance on normal tasks.

It should be noted that even when hardening the model, no model is completely resistant. Determined attackers can continue to find new loans in security. That is why our goal is to make the attacks much more difficult, expensive and more complex for opponents.

Accepting a comprehensive approach to security modeling

Protection of AI models against attacks, such as indirect, quick injections requires “detailed defense”-using many layers of protection, including model hardening, input/output controls (such as classifiers) and handrails at the system level. Fighting an indirect fast injections is a key way we implement ours Principles and guidelines for the safety of the Agency To develop agents responsibly.

Protecting advanced AI systems against specific, evolving threats, such as a quick injection, is a continuous process. It requires the implementation of continuous and adaptive assessment, improving existing defense and new exploration, and building inherent resistance to the models themselves. By constantly applying defense and learning, we can enable AI assistants, such as Gemini, still be extremely helpful AND trustworthy.

To learn more about defense, we built in Gemini and our recommendation regarding the use of more difficult, adaptive attacks to assess the resistance of the model, read the White Book of GDM, Twin defense lessons against indirect fast injections.

LEAVE A REPLY

Please enter your comment!
Please enter your name here