Home Machine Learning Google’s Nested Learning: The Brain-Inspired AI That Never Forgets 🧠✨

Machine Learning

Google’s Nested Learning: The Brain-Inspired AI That Never Forgets 🧠✨

December 13, 2025

Author(s): Sai Insights

Originally published on Towards AI.

Google’s Nested Learning: The Brain-Inspired AI That Never Forgets 🧠✨

Discover how Google’s Nested Learning framework, inspired by neuroscience, solves AI’s biggest problem — catastrophic forgetting. Learn about HOPE, multi-frequency memory, and the future of continual learning in deep learning models.

Have you ever wondered why your brain can learn something new without completely forgetting what you learned yesterday? Meanwhile, AI models struggle with this exact problem. 😅

Google just dropped something fascinating at NeurIPS 2025 that might change everything. It’s called Nested Learning, and it’s not just another incremental improvement — it’s a complete rethinking of how AI learns.

Let me walk you through what makes this so special.

The Problem: AI Has Amnesia 🤯

Imagine you trained an AI model on medical data for months. It works beautifully. Then you update it with new information about a recent drug discovery. Suddenly… it forgets half of what it knew before.

This is called catastrophic forgetting, and it’s the Achilles’ heel of modern AI.

Current large language models are essentially frozen after training. Sure, they can use information within their context window (like what you tell them in a conversation), but they can’t actually learn new things and retain them long-term. It’s like having anterograde amnesia — you can remember your past but can’t form new permanent memories.

Figure 1: The uniform and reusable structure as well as multi time scale update in the brain are the key components to unlock the continual learning in humans. Nested Learning (NL) allows for multi time-scale update for each component of the brain, while showing that well-known architectures such as Transformers are in fact linear layers with different frequency updates.

The Google Research team draws a powerful parallel: today’s models behave like people with anterograde amnesia, able to retain old memories but unable to form new long-term ones. They process information within a limited window, but once that window ends, no new knowledge actually sticks.

What Jensen Huang Knew All Along 💡

Remember when Nvidia’s CEO Jensen Huang said he’d study physics if he were 22 today? Not computer science — physics.

At first, that seemed weird. But here’s the thing: the next breakthrough in AI won’t come from software alone but from truly understanding the physical world, including how our brains actually work.

And Google’s Nested Learning proves he was onto something.

How Your Brain Actually Learns (And Why AI Doesn’t) 🔬

Let’s talk neuroscience for a second — but don’t worry, I’ll keep it simple.

Your brain doesn’t operate at one speed. Different parts update at different rates:

Gamma waves (30–150 Hz): Handle rapid sensory processing — what you’re seeing right now
Beta waves (13–30 Hz): Active thinking and problem-solving
Theta/Delta waves (0.5–8 Hz): Memory consolidation during sleep

Think of it like this: Your brain has a multi-lane highway where some information zooms by at 100 mph (immediate reactions), while other information travels slowly at 10 mph (deep, long-term learning).

Current AI models? They’re stuck in one lane, one speed. Everything updates at the same rate during training, then nothing updates afterward.

What Is Nested Learning? The Big Idea 🎯

Here’s where it gets exciting.

Instead of treating a neural network as one giant block that learns all at once, Nested Learning views it as a collection of nested optimization problems, each running at its own speed.

Think of it like Russian nesting dolls, but each doll is learning independently:

Level 1 (Fast): Learns from immediate context — like answering your current question
Level 2 (Medium): Learns your conversational patterns and style
Level 3 (Slow): Stores stable knowledge like grammar rules and facts

When you fine-tune the model on something new (say, finance), the fast inner level adapts quickly, but the slow outer levels stay stable. So it learns new things without forgetting old ones. 🎉

Figure 3: An example of comparing a FFN (e.g., MLP) with linear attention in a Transformer-based backbone, optimizing with gradient descent. The red components are blocks in the first level (with frequency 1), while blue components are blocks in the second level (frequency 𝐿). Linear attention with learnable initial memory state (referred to as Linear Attention++) is the same as an MLP layer but with in-context learning ability and adaptation to the input sequence.

☕ Hey! Are you finding this useful?

If this deep dive is making those neural pathways fire up and you’re finding value in the breakdown, I’d genuinely appreciate it if you could buy me a coffee. These research deep-dives take hours to craft, and your support helps me keep exploring and explaining the cutting edge of AI. No pressure though — bookmark this and come back anytime! 🙏

The Math Made Simple

Okay, I promise to keep this beginner-friendly. Here’s the core idea:

Traditional training: You have one objective function, one optimization process. Everyone marches to the same drum.

Nested Learning: You have multiple optimization problems, each with its own:

Context flow: The data it’s learning from (could be tokens, gradients, or anything)
Update frequency: How often it changes
Learning objective: What it’s trying to optimize

Each “level” is basically an associative memory — it learns to map inputs to outputs. But here’s the magic: existing deep learning methods learn from data through compressing their own context flow.

Three Game-Changing Innovations 🚀

Google’s paper introduces three major breakthroughs:

1. Deep Optimizers: Learning How to Learn

Remember Adam, SGD, those optimizer algorithms? Turns out they’re secretly associative memories that compress gradients.

Here’s what that means in plain English: When training a model, the optimizer doesn’t just mindlessly update weights. It’s actually learning patterns in how the model should change. But current optimizers (like Adam) are pretty simple — they just average recent changes.

Nested Learning shows we can make optimizers much smarter. Instead of a simple average, use a deep neural network as the optimizer itself. It can:

Remember which changes worked long ago
Adapt its learning strategy based on what it’s seen
Handle complex scenarios like learning multiple unrelated tasks

Think of it like this: Instead of following a fixed recipe, the optimizer becomes a chef that improvises based on everything it’s cooked before.

2. Self-Modifying Titans: Models That Improve Themselves

This is straight out of science fiction. 🎬

Current models are static. You train them, they’re done. Maybe you fine-tune them later.

Self-Modifying Titans can literally change how they change. Lower levels learn from data, while higher levels learn how the lower levels should learn.

It’s recursive learning — the model becomes its own teacher.

Table 1: Needle-In-A-Haystack experiments with: (1) Single needle with three levels of difficulty: single-needle tasks — S-NIAH-1 (passkey retrieval), S-NIAH-2 (numerical needle), and S-NIAH-3 (UUID-based needle); (2) multi-query; (3) multi-key; and (4) multi-value settings of the benchmark.

3. HOPE Architecture: Memory That Actually Works

The crown jewel of this research is HOPE (Higher-Order Processing Engine). It’s not just a theory — it’s a working model that proves Nested Learning works in practice.

HOPE introduces something called a Continuum Memory System (CMS). Instead of the old binary split (short-term vs. long-term memory), HOPE has a spectrum:

Ultra-fast memory: Adapts instantly to new tokens
Fast memory: Learns patterns across sentences
Medium memory: Captures document-level themes
Slow memory: Stores permanent knowledge

The Results: HOPE Crushes the Competition 📊

Okay, enough theory. Does this actually work?

Hell yes. 💪

HOPE demonstrates lower perplexity and higher accuracy compared to modern recurrent models and standard transformers across multiple benchmarks:

Language Modeling & Common Sense:

Beats standard Transformers
Outperforms modern recurrent models (Samba, Titans)
Lower perplexity = better predictions

Long-Context Understanding:

Crushes the Needle-in-Haystack test (finding specific info in massive contexts)
Handles BABILong benchmark with ease
Can actually remember things from way earlier in the conversation

Continual Learning:

Learn new languages without forgetting old ones
Add new knowledge without catastrophic forgetting
Actually gets better over time instead of degrading

The experimental results span:

✅ Language modeling tasks
✅ Long-context reasoning (up to massive context windows)
✅ Continual learning scenarios
✅ Knowledge incorporation
✅ Few-shot generalization

Figure 9: BABILong Benchmark. Red points are the results of fine-tuned models, and Blue points are the large models’ zero-shot results.

For The Future of AI

This is bigger than just better models. Nested Learning enables models that do not just infer but acquire, consolidate, and retain knowledge over time, just as biological systems do.

We’re talking about AI that:

Learns from experience like humans do
Doesn’t need to be “retrained” from scratch
Can actually accumulate wisdom over time
Might eventually achieve genuine continual learning

The Technical Deep Dive (For The Curious) 🤓

Alright, for those who want to geek out a bit more, let’s talk about how this actually works under the hood.

Associative Memory: The Foundation

Every component in Nested Learning — including the optimizer — is an associative memory. What does that mean?

An associative memory maps keys to values. Your brain does this constantly:

See a face (key) → recall a name (value)
Smell coffee (key) → remember the café (value)
Read “2 + 2” (key) → think “4” (value)

In Nested Learning, everything from the model itself to the training algorithm is framed as:

“Given input X, what’s the best output Y, and how do I compress this pattern into my parameters?”

The paper shows that even backpropagation — the standard way we train neural networks — can be viewed as an associative memory that maps data to “surprise” (how unexpected the prediction was).

The Update Frequency Hierarchy

Components are organized by how often they update:

Frequency ∞ (Attention mechanisms)
Updated every single token. Fastest adaptation, no persistent memory.

Frequency 1 (Standard layers during training)
Updated during training, frozen afterward. This is where most neural network parameters live.

Frequency 0 (Frozen pre-trained weights)
Never updated. Core knowledge that shouldn’t change.

Frequency between 0 and 1 (The innovation!)
This is the sweet spot HOPE exploits — memory that updates sometimes, creating that continuum.

Why Optimizers Are Memories Too

Here’s a mind-bender: gradient-based optimizers are associative memory modules that aim to compress the gradients’ information.

When Adam updates your model, it’s not just following a formula. It’s:

Remembering recent gradients (momentum)
Remembering recent gradient magnitudes (adaptive learning rate)
Using both to decide how to change weights

That’s memory! It’s compressing past gradient information into a few parameters.

Nested Learning says: “Why stop at simple averaging? Let’s use deep neural networks as optimizers.” That’s what Deep Optimizers do — they’re like giving your optimizer a brain upgrade.

The Bottom Line 🎬

Google’s Nested Learning isn’t just another paper. It’s a fundamental rethinking of what learning means in artificial intelligence.

By drawing inspiration from neuroscience — specifically how the brain operates at multiple timescales — the team has cracked a problem that’s plagued AI for decades: catastrophic forgetting.

The key insights:

🧠 Multi-frequency updates mimic how brains actually work
🔄 Nested optimization allows learning at different abstraction levels
🎯 Associative memory framework unifies architectures and optimizers
⚡ HOPE architecture proves it works in practice

We’re moving from models that are trained once and frozen, to neural learning modules that truly learn over time. Models that don’t just process information but accumulate wisdom.

Jensen Huang was right. The future of AI isn’t just about better code — it’s about understanding the principles of learning itself, whether in silicon or in carbon.

And Nested Learning? It’s showing us the way. 🚀

References & Further Reading 📚

Primary Paper:
Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. (2025). Nested Learning: The Illusion of Deep Learning Architectures. Advances in Neural Information Processing Systems (NeurIPS) 2025.

Official Resources:

Related Work:

Vaswani et al. (2017) — “Attention Is All You Need” (The original Transformer paper)
Finn et al. (2017) — Model-Agnostic Meta-Learning (MAML)
Behrouz et al. (2025) — Titans: Learning to memorize at test time

Community Implementations:

🚀 Want to Master More AI?

Subscribe to my YouTube channel for in-depth tutorials, hands-on coding sessions, and the latest AI insights! 📺✨

👆 Hit that subscribe button and ring the notification bell to never miss cutting-edge content!

🔗 Let’s Connect & Collaborate!

I’m passionate about sharing knowledge and building amazing AI solutions. Let’s connect:

🐙 GitHub: Link — Check out my latest projects and code repositories

📧 Email: (Sai Insights) — Reach out directly for inquiries or collaboration

☕ Support me: Buy Me a Coffee Link — Help me create more content

What do you think? Is Nested Learning the future of AI, or just another interesting experiment? Drop your thoughts in the comments below! 👇

And if you found this article helpful, don’t forget to share it with fellow AI enthusiasts. Let’s spread the knowledge! ✨

Published via Towards AI

Google’s Nested Learning: The Brain-Inspired AI That Never Forgets 🧠✨

Author(s): Sai Insights

Google’s Nested Learning: The Brain-Inspired AI That Never Forgets 🧠✨

The Problem: AI Has Amnesia 🤯

What Jensen Huang Knew All Along 💡

How Your Brain Actually Learns (And Why AI Doesn’t) 🔬

What Is Nested Learning? The Big Idea 🎯

☕ Hey! Are you finding this useful?

The Math Made Simple

Three Game-Changing Innovations 🚀

1. Deep Optimizers: Learning How to Learn

2. Self-Modifying Titans: Models That Improve Themselves

3. HOPE Architecture: Memory That Actually Works

The Results: HOPE Crushes the Competition 📊

For The Future of AI

The Technical Deep Dive (For The Curious) 🤓

Associative Memory: The Foundation

The Update Frequency Hierarchy

Why Optimizers Are Memories Too

The Bottom Line 🎬

References & Further Reading 📚

🚀 Want to Master More AI?

🔗 Let’s Connect & Collaborate!

LEAVE A REPLY Cancel reply

APLICATIONS

Nvidia’s AI Summit Showcases the Exciting Future of AI Applications and...

New Machine Learning Model Sets Record with 97.97% Accuracy

Why the reasoning models are genius in mathematics, but stupid in...

The Role of Artificial Intelligence in Kitchen and Bathroom Design

HOT NEWS

Market Forecast Predicts Strong 46% Compound Annual Growth Rate, Projecting $53.17...

The new model provides for a chemical reaction point without returning...

Here’s What Happened When ChatGPT Experienced a Major Outage

Enhancing Cybersecurity through Federated Learning: A Guide to How and Why

POPULAR POSTS

Advantages and Disadvantages of the Top 14 AI Applications in 2024

National Recognition for GPHA Takoradi Hospital’s A.I. Application Focus Lab Week...

KRISP uses artificial intelligence to help Indians sound like Americans on...

POPULAR CATEGORY

Quick Engineering Guide | Towards artificial intelligence