Gemma Scope: Helping the security community shed light on the inner workings of language models

We announce a comprehensive, open set of sparse autoencoders for language model interpretation.

To create an artificial intelligence (AI) language model, researchers build a system that learns from huge amounts of data without human intervention. As a result, the inner workings of language models often remain a mystery even to the researchers who train them. Mechanistic interpretability is a field of study focused on deciphering these internal mechanisms. Researchers in this field use sparse autoencoders as a kind of “microscope” that allows them to look inside the language model and better understand how it works.

Today, we announce Gemma Scopea new set of tools to help researchers understand the inner workings of Gemma 2, our family of lightweight open models. Gemma Scope is a collection of hundreds of freely available, open source, sparse autoencoders (SAE) for Gemma 2 9B AND Gemma 2 2B. We are also open to sources Mishaxa tool we built that enabled much of the work on the Gemma Scope interpretation.

We hope that today's version will enable more ambitious interpretive research. Further research could help build more robust systems, develop better safeguards against model hallucinations, and protect against threats from autonomous AI agents such as fraud or manipulation.

Try our interactive Gemma Scope democourtesy of Neuronpedia.

Interpreting what is happening inside the language model

When you ask the language model a question, it turns your text input into a series of “activations.” These activations map the relationships between the words you input, helping the model make connections between the different words it uses to write the response.

As the model processes the text input, activations at different layers of the model's neural network represent many increasingly sophisticated concepts, called “features.”

For example, early layers of the model can learn this recall the facts Yes Michael Jordan plays basketballwhile later layers can recognize more complex concepts such as text validity.

LEAVE A REPLY

Please enter your comment!
Please enter your name here