From Spatial Navigation to Spectral Filtering

Author(s): Erez Azaria

Originally published on Towards AI.

Image generated by Author using AI

In the world of machine learning, one of the most enigmatic and elusive concepts is “latent space”, the semantic hyperspace in which a large language model operates.

Most of the time, this concept is utterly ignored or lightly defined when discussing the model operations. But what is even more surprising is that while the standard “spatial” analogy works for static embeddings, it falls short when we dive into the dynamics of inference in transformer models.

The common model: A map of meaning

The most abstract definition of a latent space is that it is a compressed representation of data where semantic concepts are located near each other, creating areas of semantic meaning.

In this space, the location coordinates are the encoded meaning of the concept. During training, the model moves these points until similar concepts are clustered together, effectively creating a map where the distance between two points represents how related they are.

The Fracture

When we deal with static raw embeddings, the spatial map is very intuitive.

The famous math done with Word2Vec embeddings, where the vector operation of “King”-“Man” + ”Woman” results in a location very similar to “Queen”, is appealing.

The first question mark, though, arises even in the realm of static raw embeddings.

Every embedding vector is essentially an arrow shooting out from the origin. It has a direction (where it points) and a magnitude (how long the arrow is). In a standard spatial map, one might assume that calculating “King” — “Man” + “Woman” moves us to the precise coordinates of “Queen”. However, in high-dimensional space, this calculation rarely lands on the exact spot.

Raw embeddings have different volumes. If we compare them using standard distance, a loud vector will seem far away from a quiet one, even if they mean the same thing. But, if we force all arrows to have the same length, projecting every token onto the surface of a sphere (normalization), the spatial analogy suddenly clicks. On this spherical surface, “King” — “Man” + “Woman” and “Queen” are neighbors.

Figure 1: The role of normalization in static embeddings. Raw embeddings (Left) have varying magnitudes (lengths). Only when projected onto a hypersphere (Right) do the spatial geometric relationships like “King — Man + Woman = Queen”, become intuitive. (Image generated by Author using AI)

So, in order to figure out similarity between two semantic concepts, we must ignore the magnitude and look only at the direction (cosine similarity). Why we must ignore the “loudness” of the vector to find its meaning in spatial analogy remains a point of confusion for many.

The Dynamics of Inference

Diving into the dynamics of inference in a transformer model, two very odd phenomena emerge.

A transformer model has several layers ranging from 32 in small models to more than a hundred in foundation models. Each layer has an attention block and a feed-forward network (FFN) block. The raw token embeddings enter the first layer, and every subsequent layer produces a new embedding called a “hidden state.”

1. Growing Magnitude: If we measure the magnitude of the generated embeddings vector after each transformer layer, we see that it keeps growing. (Xiong et al., 2020)

2. Directional Rigidity: If we check the cosine similarity (the direction) between each newly generated hidden state, we get a very high similarity (Often bigger than 0.9!). This means the product of every layer rarely strays from the general direction of the original embeddings vector. (Ethayarajh, 2019).

The bottom line: if you try to visualize this spatially, the model travels during inference from the location of the token, moving in a general direction outward in bigger and bigger steps.

Table 1: Internal states of Mistral-7B instruct during inference. Note how the Direction (Cosine Similarity) stabilizes quickly above 0.88, while the Magnitude (Volume) grows exponentially by nearly 90x.

The Technical Reality vs. The Intuition

The technical reason for the increasing magnitude and small angle changes is well known. The input of every sequential layer is comprised of the output of the former layer, but it also gets the raw input of the former layer added back in. This “bypass” is called the residual stream connection, formally expressed as:

This mechanism was originally designed to preserve the gradient during training, ensuring that the original signal doesn’t vanish as it passes through deep layers. But the side effect is signal accumulation.

However, the fact that we know the technical reason doesn’t help our intuition at all. There is nothing in this explanation that sits right with the standard spatial analogy.

Where is the compounded semantic movement? If each layer produces a vector that points in the same direction as the previous layer, where is the spatial map navigation? What did the model learn?

If it learned so little that the direction barely changed, how can the model infer a completely different next token?

It becomes obvious that the geometric navigation analogy is not the best fit for the description of the internal mechanics of inference. If we want to understand the nature of the model inference, we ought to adopt a new mental model.

Token embeddings as a spectral envelope

I want to offer a different mental model, one that better captures the full phenomena observed. This model borrows from sound signal processing.

The music spectrum display of a sound system has channels (frequency bars). Each channel gauges the level of signal in a narrow frequency band. In our new mental model, we describe the vector dimensions not as coordinates, but as channels.

Figure 2: Visualizing embeddings as a spectral envelope. In this model, dimensions are viewed as frequency channels on a spectrum analyzer rather than coordinates in space. (Image generated by Author using AI).

If you have a model with D dimensions, you can imagine a sound spectrum display of D bars. The raw embedding vector for a token is simply a specific preset of values in each of these bars (AKA a “spectral envelope”).

Channel Selectors and Mixers

To understand how the signal evolves, we can look at the two main components of every layer through this new lens.

To whom are autonomous agents accountable? The problem of identity and management

The Attention Mechanism acts essentially as a dynamic channel selector. It scans and samples the spectral envelopes of other relevant tokens in the context window and selects which specific channels (informational notes) should influence the current token. It effectively decides: “To understand this concept, I need the ‘timbre’ of that previous word.”

Once these new notes are collected, the Feed-Forward Network (FFN) steps in. If Attention is the selector, the FFN is the mixer. It takes these collected notes and mixes them together to generate a new harmonic chord. This new chord (the calculated hidden state) is then added back to the main carrier signal via the residual stream, enriching the original token with new context.

Aligning concepts to the new model

The first problem we set to solve is: Why, when we check if two tokens are similar, do we use cosine similarity (ignoring the magnitude)?

The magnitude, as noted, is the length of the vector. In our new model, magnitude is volume. Turning the volume up and down affects the height of our music spectrum display, but it doesn’t change the song you hear.

The same goes with our model, if you need to compare two token signals, their volume doesn’t change the nature of the signals. If you compare a quiet signal to a loud one, the loud one will overpower the comparison. If you level the two signals, you can better “hear” the similarities. This is exactly what cosine similarity does: it normalizes the channels so you can compare the underlying “chord.”

The second problem is the increasing magnitude and the lack of direction change.

We start with the raw token embedding. That is our initial channel preset, or our “carrier signal”.

When the raw embeddings pass through the model layers, each layer adds its own small modulations over the different channels. The residual stream makes sure that the original Carrier Signal “harmonics” sustain, while each layer only ADDS capped modulations over the available channels.

This explains the observations perfectly:

  1. High Cosine Similarity: The original Carrier Signal (the raw token identity) resonates strongly through every layer because it is carried and amplified via the residual stream. The “direction” doesn’t change because the base chord still dominates.
  2. Growing Magnitude: As new modulations are added to every channel layer by layer, the signal levels get increasingly louder.

The next token prediction

The core of the functionality in the model inference is the prediction of the next token. While the spatial map analogy struggles to explain the dynamics of the next token prediction, the spectral model, fits these observations intuitively.

The unembedding layer is essentially a bank of Matched Filters for every possible token. It takes the final hidden state with its loud Carrier Signal and added modulation signals, and measures how well it fits with each embedding filter preset.

The filtering is done by taking the magnitude of each dimension in the hidden state (the “volume” in each channel) and multiplying it by the weight of the unembedding layer (the “volume” of the filter), then summing the result (dot product). The token with the highest score wins. In signal processing, this operation is remarkably similar to how a radio tuner locks onto a frequency.

In other words, after the filter is applied, the token that “resonates” with the highest volume is selected as the prediction.

(Technically, scores are turned into probabilities via SoftMax, but this doesn’t change the description of the underlying filtering method).

The “Phase” of the signal

Looking at this, you might think that there is something very odd here. If it’s just about volume, wouldn’t tokens with overall higher volume (higher norms) always win?

For example, common tokens (like “the”, “a”, “or”) have lower norms because, during training, they appear in varied contexts and are pulled in every direction (the “Tug-of-War” effect). If they have low norms, how are they ever selected?

Your home sound system displays positive-only bars. You can have zero volume, but “-5 volume” makes no sense physically. However, sound is a wave, and waves have phase.

If you emit a sound at a certain frequency and then emit the same frequency in the opposite phase, they cancel each other out. This is how noise-canceling headphones work.

Token embedding vectors do have a minus sign. Unlike a volume slider which only goes up, signal amplitude can be positive or negative.

When the model performs a dot product between the hidden state and the unembedding layer, it effectively balances these phases:

  • Aligned Phases: Positive × Positive (or Negative × Negative) results in a positive score. The signal resonates.
  • Misaligned Phases: Positive × Negative results in a negative score. The signal is canceled out.

This allows the model to filter out high-volume noise and precisely select the correct next token, regardless of its raw “loudness.”

Validation from the Field

This shift from navigating a spatial map to refining a spectral signal is not just a theoretical exercise. It suggests that the most effective toolkit for analyzing and improving Large Language Models comes not from cartography, but from signal processing.

Independent research has recently converged on this very path, applying core signal-processing principles to achieve state of the art results. Two striking examples include:

In an article called “Towards Signal Processing In Large Language Models”, a team successfully replaced standard attention mechanisms with learnable frequency decomposition. Their approach achieved faster convergence, proving that signal processing operations are natural for transformer internals. (Verma & Pilanci 2024)
The spectral framework suggests an explanation as to why this works: transformers are already acting as signal processors, making it explicit just makes them more efficient.

Simultaneously, Anthropic’s interpretability team documented “superposition”, where models pack more features than dimensions by encoding them in nearly orthogonal directions. They measured this geometrically, but the spectral framework reveals this is literal wave interference. Just as multiple radio frequencies can occupy the same airwaves without collision, multiple feature “frequencies” can occupy the same latent space. (Anthropic 2023–2024 Sparse Autoencoders: Towards Monosemanticity, Scaling Monosemanticity)

While these research teams documented signal processing evidence, none proposed a representational framework. Their findings, however, suggest that the signal processing framing has a founded basis in the model’s physics. The spectral framework serves to connect these isolated observations into a unified why.

Published via Towards AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here