GEMMA SCOPE: Helping the security community in throwing light on the internal functioning of language models

By announcing a comprehensive, open package of rare autoencoders to interpret the language model.

To create a language model of artificial intelligence (AI), scientists are building a system that learns on the basis of huge amounts of data without human tips. As a result, the internal effects of language models are often a secret, even for researchers who train them. Mechanical interpretation This is a field of research focusing on deciphering these internal actions. Scientists use in this field Rare coders As a kind of “microscope” that allows them to see in the language model and better understand how it works.

Today, We announce the range of gemmaA new set of tools that help researchers understand the internal functioning of Gemma 2, our light family open models. Gemma Scope is a collection of hundreds of freely available, open rare autoencoders (SAE) for Gemma 2 9b AND Gemma 2 2b. We are also open acquisition MishaxThe tool we built, which enabled most of the interpretative work behind the gemma range.

We hope that today's edition will allow more ambitious interpretation research. Further research can help build more solid systems, develop better protection against model hallucinations and protection against risk against autonomous AI agents, such as fraud or manipulation.

Try our interactive Gemma Scope demoThanks to the kindness of neuronpedia.

Interpretation of what is happening in the language model

When you deal a language model, it turns the input data into a series of “activation”. These activations are mapped between the introduced words, helping the model to create connections between different words he uses to write answers.

When the model processes the introduction of the text, activations in various layers in the neural network of the model represent many more and more advanced concepts known as “functions”.

For example, the early layers of the model can learn Remind the facts Yes Michael Jordan plays basketballwhile later layers can recognize more complex concepts such as textual.

The stylized representation of the use of a rare autoencoder to interpret the activation of the model, because it reminds the fact that the city of light is Paris. We see that French concepts are present while they are not unrelated.

However, interpretation researchers encounter a key problem: the model activation is a mix of many different features. At the beginning of a mechanistic interpretation, scientists hoped that the features of the activation of the neural network would be in line with individual neurons, ie, Information nodes. But unfortunately, in practice, neurons are active for many unrelated features. This means that there is no obvious way of finding which features are part of the activation.

It is here that rare autoencoders appear.

A given activation will only be a mixture of a small number of functions, although the language model can probably detect millions or even billions – of them – i.e.The model uses the function rarely. For example, the language model will consider relativity when answering the question about Einstein and consider the eggs when writing about omelets, but probably will not consider relativity when writing about omelets.

Rare autoencoders use this fact to discover a set of possible functions and divide each activation into a small number. Scientists hope that the best way for a rare autoencoder to perform this task is to find real basic functions that the language model uses.

Importantly, we do not at any time of this process – scientists – inform the rare autoencoder, whose features to look for. As a result, we are able to discover rich structures that we have not foreseen. However, because we do not know immediately meaning We are looking for discovered functions Significant patterns In examples of a text in which a rare autoencoder says that the function “fires”.

Here is an example in which tokens in which functions fires are highlighted in blue gradients in accordance with their strength:

Examples of activations for a function found by our rare autoencoders. Each bubble is a token (word or fragment of the word), and the variable blue color illustrates how strong the current function is. In this case, the function is apparently associated with idioms.

Which makes Gemma Scope special

Earlier research with rare autoencoders focused mainly on examining internal action small models Or A single layer in larger models. But more ambitious interpretation studies include decoding of layered, complex algorithms in larger models.

We trained rare autoencoders in everyone Exit of the layer and sublet with Gemma 2 2b AND 9b To build a gemma range, producing over 400 rare autoencoders with a total of over 30 million learned functions (although many functions probably cover). This tool will enable researchers to study how the functions evolve throughout the model, and interaction and compose to create more complex features.

Gemma Scope is also trained with our new, most modern Jumprelu Sae Architecture. The original rare autoencoder architecture tried to balance double detection goals that are present and estimate their strength. Jumprel's architecture makes it easier to properly hit this balance, significantly reducing the error.

The training of so many rare autoencoders was a significant engineering challenge that required high computing power. We used about 15% of GEMMA 2 9B training calculations (excluding calculations to generate distillation labels), we saved about 20 PebiBets (PIB) activation on disk (more or less the same as A million copies of English Wikipedia) and produced a total of hundreds of billions of rare autoencoder parameters.

Pushing the field forward

By dropping the range of gemma, we hope to make Gemma 2 the best family model for open mechanical interpretation research and accelerating the work of the community in this field.

So far, the interpretation community has made great progress in understanding small models with rare autoencoders and developing appropriate techniques, for example causal interventionsIN automatic encirclement analysisIN interpretation of featuresAND rate Rare coders. In the scope of Gemma, we hope that the scale community for modern models will analyze more complex possibilities, such as chain thoughts and we will find real applications of interpretations, such as solving problems such as hallucinations and jailbreak, which arise only in the case of larger models.

Thanks

Gemma Scope was a joint effort of Tom Lieberum, a dream of Rajamanharan, Arthur Conma, Lewis Smith, nothing Sonenerat, Vikrant Varma, Janos Kramar and Nifa Nanda, sat down by Roomhan Shh and Anca Dragan. We would especially like Johnny Lin, Joseph Bloom and Curt Tigges in Neuronpeedia for their help in interactive demo. We are grateful for the help and contribution of Phoebe Kirk, Andrew Forbes, Arielle Bier, Arielle Bier, Arielle Bier, Yotam Doron, Tris Warken, Lodovic Peran, Kat Black, Anand Rao, Meg Risal, Samuel Albani, Dave Orr, Matt Miller, Alex Turner, Tobi Ijatoy, Shluti, Jeremy, Jeremy, Jeremy, Jeremy, Seti, Jeremy, Seti, Jeremy, Samuel. Tobi Ijooye, Alex Tomala, Javier Ferrando, Oscar Obeso, Kathley Kenealy, Joe Fernandez, Omar Sanseviero and Glenn Cameron.

LEAVE A REPLY

Please enter your comment!
Please enter your name here