Early warning system for innovative AI threats

New research proposes the evaluation framework for general purpose models towards new threats

To be pioneering responsibly in the latest artificial intelligence studies (AI), we must identify new possibilities and new risk in our AI systems as soon as possible.

AI researchers are already using a series Evaluation indicators Identifying unwanted behavior in AI systems, such as AI systems, which mislead statements, biased decisions or repetition of copyrights protected by copyright. Now that the AI ​​community is building and implementing an increasingly powerful artificial intelligence, we must extend the assessment portfolio to take into account the possibility Extreme risk from general AI models that have strong manipulation skills, fraud, cyber criminals or other dangerous possibilities.

In ours latest articleWe introduce the ratings of these new threats, co -author with colleagues from the University of Cambridge, University of Oxford, University of Toronto, Université de Montréal, OpenAI, Antropic, Alignment Research Center, Center for Long Term resistance and AI Management Center.

Model safety assessments, including assessing extreme risk, will be a key element of AI's safe development and implementation.

Review of our proposed approach: To assess the extreme risk of new, general AI systems, programmers must assess dangerous possibilities and equalization (see below). Earlier, by identifying the risk, this will take place the possibilities of greater responsibility during training of new AI systems, the implementation of these AI systems, transparent describing their risk and the use of appropriate cybernetic safety standards.

Assessment for extreme threats

General purpose models usually learn their abilities and behavior during training. However, the existing learning process methods are imperfect. For example, previous studies at Google Deepmind have examined how AI systems can learn to achieve unwanted goals, even if we reward them correctly for good behavior.

Responsible AI programmers must look into the future and predict possible future achievements and new risk. After further progress, future general models can learn various dangerous possibilities by default. For example, it is likely (though uncertain) that future AI systems will be able to carry out offensive cybernetic operations, skillfully cheat people in dialogue, manipulate people to perform harmful actions, design or acquire weapons (biological, chemical), tuning and supporting other high -risk AI systems on cloud platforms or support with any of these desks.

People with malicious intentions to access to such models could thing their capabilities. Or, due to the failure of equalization, these AI models can take harmful actions even without the fact that nobody intends.

The model assessment helps us identify this risk in advance. In our framework, AI programmers would use the model assessment to discover:

  1. To what extent does the model have some “dangerous possibilities” that could be used to threaten safety, exert influence or avoid supervision.
  2. To what extent is the model susceptible to the use of its abilities to cause damage (i.e. leveling the model). Equalization assessments should confirm that the model behaves as intended even in a very wide range of scenarios and, as far as possible, should examine the internal operation of the model.

The results of these grades will help AI programmers understand whether sufficient components for extreme risk are present. Most high -risk cases will include many dangerous possibilities combined together. The AI ​​system does not have to provide all ingredients, as shown in this diagram:

Extreme risk ingredients: Sometimes specific options can be outsourcing or people (e.g. users or crowds) or other AI systems. These possibilities must be used for damage, or due to improper use or failure of the alignment (or mixture of both).

Principle: The AI ​​community should treat AI as highly dangerous if it has a sufficient ability profile to cause extreme damage, conceited It is incorrectly used or poorly even. To implement such a system in the real world, the AI ​​developer would have to demonstrate an extremely high level of security.

Model assessment as a critical management infrastructure

If we have better tools to determine which models are risky, companies and regulatory bodies can better ensure:

  1. Responsible training: Responsible decisions are made regarding whether and how to train a new model that shows early signs of risk.
  2. Responsible implementation: Responsible decisions are made regarding whether, when and how to implement potentially risky models.
  3. Transparency: Useful and useful information is reported to stakeholders to help them prepare for a potential risk or soften.
  4. Appropriate safety: Strong information security controls and systems are used for models that can be an extreme risk.

We have developed a plan on how model assessments for extreme risk should transform into important decisions regarding training and implementation of a highly talented general purpose model. The developer carries out the whole Access to the structural model to external safety researchers and Model auditors So that they can lead Additional grades The results of the assessment may then inform about the risk assessment before training and implementation of the model.

Model assessment plan for extreme threats in important decision -making processes during training and implementation of the model.

Looking to the future

Important early Work Model ratings for extreme threats are already on Google Deepmind and elsewhere. But much greater progress – both technical and institutional – is needed to build a assessment process, which attracts all possible risk and helps to protect against future challenges.

The assessment of the model is not a panacea; For example, some risks can slip through the network, because they depend too much on external factors on the model, such as complex social, political and economic forces in society. The assessment of the model should be combined with other risk assessment tools and wider involvement in security in the industry, government and civil society.

Google's last blog on responsible artificial intelligence He states that “individual practices, common industry standards and reasonable government policies would be necessary to improve artificial intelligence.” We hope that many other people working in artificial intelligence and sectors to which this technology affected will be combined to create approaches and standards in the field of safe development and implementation of artificial intelligence for everyone.

We believe that having tracking processes of risky real estate in models and reacting to independent results is a key part of being a responsible programmer operating on the limit of AI's capabilities.

LEAVE A REPLY

Please enter your comment!
Please enter your name here