Grokking | Shortening the Loss Plateau

AI models sometimes memorize training data perfectly but still can't answer new questions for a very long time. This phenomenon is called grokking. We tested three strategies to shorten this delay, achieving speedups ranging from 8x to 316x.

All experiments use a small model on controlled arithmetic tasks. We cannot guarantee these results transfer to large-scale language models; generalization to other domains is left for future work.

What we did

Trained a small AI model on simple math problems and tested three strategies to help it generalize faster: training on more varied problems, switching training algorithms, and changing how the model starts.

What we didn't do

We did not test large models, real-world tasks, or practical applications. Results are specific to small models on a controlled arithmetic setting used here as a research tool.

Our contributions

We built the full training pipeline from scratch and ran original experiments on task diversity. Other experiments extend prior research (Power et al., 2022; Lyu et al., 2024) with new results.

What is Grokking?

Imagine a student who studies for a test by memorizing every answer in a textbook. At first, they would fail any test that asks new questions, because they memorized answers, not concepts. But after reviewing the material long enough, something clicks: they truly understand the underlying ideas and can answer any version of the test.

Grokking is the AI equivalent of this experience. First described by Power et al. (2022), it refers to a two-phase learning pattern where a model nails the training examples almost immediately, then appears completely stuck before suddenly generalizing perfectly to new examples.

Phase 1: Memorization

The model quickly learns to answer every training question correctly, but only by remembering them like a lookup table. Ask it anything new and it fails. This happens very early in training.

Phase 2: Generalization

Much later, something shifts internally. The model stops relying on memorized answers and starts understanding the underlying pattern. Accuracy on new questions then jumps sharply.

The gap between Phase 1 and Phase 2 can be enormous, sometimes hundreds of thousands of additional training steps. Training AI is expensive, so shortening this gap has real practical value.

Our baseline: Our model memorized the training data almost immediately, but needed 334,000 total training steps before it could correctly answer new questions.

Line chart showing training accuracy reaching near 100% almost immediately, while accuracy on new questions stays near 0% for a very long time before suddenly jumping to 100% near the end of training. — Figure 1 - Baseline results (division only). The orange dashed line shows how well the model does on training questions: it memorizes them almost instantly. The red line shows accuracy on new questions, which stays near 0% for the vast majority of training before the sudden jump. The x-axis is stretched so the enormous gap is visible; on a regular scale, the transition would appear as a tiny sliver at the far right.

Why Does Grokking Happen?

When an AI model trains, it is constantly trying to reduce its mistakes. Early on, the easiest way to do this is to memorize: remember exactly what answer goes with each training question. This works perfectly for the training set, but teaches the model nothing useful about the underlying pattern.

Over time, a training technique called weight decay, which gently discourages the model from growing overly complicated, slowly pushes it toward a simpler, more general solution. When the model finally finds that simpler solution, generalization happens suddenly and dramatically.

In short: the model takes the easy route first (memorization), and only later, under pressure, finds the right route (understanding). Our goal was to find ways to make that transition happen faster.

Our Approach

We used clock (modular) arithmetic as our testing ground: a type of math where numbers wrap around after reaching a limit, just like a clock that resets after 12. For example, 10 + 5 on a 12-hour clock is 3, not 15. Our version wraps around at 97. The model's job is to predict the correct wrap-around result given two numbers and an operation.

All results are reported as training progress %, how far through training the model was when it generalized, rather than raw step counts. This makes it easy to compare experiments that ran for different lengths. We define "generalized" as when the model first answers at least 95% of new questions correctly, following the standard used in prior work. Small differences below ~1% training progress should be interpreted cautiously.

The Model

A small Transformer (the same family of architecture behind ChatGPT) with around 400,000 internal connections, compared to billions in production systems.

The Training Algorithm

By default we used AdamW, a popular and stable training algorithm. One experiment swapped this for SGD (Stochastic Gradient Descent), a simpler and noisier alternative.

The Tasks

Four arithmetic operations using wrap-around math:
Division (hardest to learn)
Multiplication
Addition
Subtraction

Experiment 1: Task Diversity

Our first strategy: train on multiple types of arithmetic at the same time rather than just one. A model trained on all operations at once cannot rely on memorizing shortcuts for any single task. It has to find the deeper pattern they all share.

Key observation: Not all tasks are equally hard to generalize. When trained alone, addition generalizes at ~14% training progress, multiplication at ~16%, subtraction at ~37%, but division takes until 83.5%. Division is the hardest because it requires computing a mathematical inverse, a more complex operation than the others.

Training all four tasks together

We scaled total training time so each task received the same amount of training examples as in a single-task run. Every task generalized dramatically faster:

1st: Multiplication

Generalized at 4.3% training progress (68,250 steps)

2nd: Division

Generalized at 4.7% training progress (75,750 steps)

3rd: Addition

Generalized at 6.5% training progress (104,150 steps)

4th: Subtraction

Generalized at 7.8% training progress (124,650 steps)

Line chart showing four colored accuracy curves all reaching 95% before 10% training progress, compared to a black baseline curve that does not reach 95% until 83.5%. — Figure 2 - All four tasks trained together. The black curve is the single-task baseline (division only). Every colored curve crosses the 95% threshold well before 10% training progress, compared to the baseline's 83.5%. Training on variety pushed the model to understand rather than memorize.

Two-task combinations

Not every pairing helped equally. Pairing division with multiplication was by far the best result: both tasks generalized at just ~0.7% training progress, a ~119x speedup. Pairing division with addition or subtraction actually made things worse than training on division alone.

We think this is because division and multiplication are mathematically similar (both involve a kind of inverse operation) while addition and subtraction work differently. When two tasks are similar enough, the model finds shared patterns that help both. When they are too different, they may pull the model in conflicting directions.

Line chart showing division and multiplication both reaching 95% accuracy at around 0.7% training progress, compared to a black baseline reaching 95% at 83.5%. — Figure 3 - Division and Multiplication trained together. Both generalize at just ~0.7% training progress, roughly 119x faster than division alone. This was the fastest result across all task combination experiments.

What we learned along the way

In early runs, randomly mixing tasks in each training batch caused one task to dominate and hurt the others. We fixed this by scaling total training time with the number of tasks, ensuring each task always received equal representation.

Experiment 2: Introducing Noise

Our second strategy: swap the default training algorithm for a noisier one. AdamW carefully smooths each update to keep training stable. SGD is less careful: its updates are rougher and less predictable. That roughness can help shake the model out of the memorization trap, much like jostling a stuck drawer can suddenly free it.

Too much noise: Unstable

The model generalized very quickly but then immediately fell apart. Too much noise is destabilizing and not usable in practice.

Moderate noise: Stable

Generalized at step 44,900, roughly 8x faster than the baseline. Final accuracy: 86%. A real speedup, but with a tradeoff.

The tradeoff: The stable run was 8x faster, but final accuracy capped at ~86% rather than ~100%. The same roughness that helped escape memorization also prevented the model from fully converging later on. Whether this tradeoff is worth it depends on whether speed or accuracy matters more for a given use case.

Line chart showing the noisy training algorithm reaching 95% accuracy at step 44,900 but leveling off around 86% final accuracy, compared to the default algorithm reaching 95% at step 334,000. — Figure 4 - Moderate noise (SGD, LR=0.005). Generalization happens 8x faster than the baseline, but final accuracy levels off at ~86% rather than ~100%. Notice the validation curve plateaus rather than continuing to climb.

Experiment 3: Starting Small

Our third strategy: limit the model's capacity right from the start. Normally, a model begins training with all its internal connections active, giving it plenty of room to build a memorization circuit. What if we dramatically reduced that starting capacity?

The idea: If the model starts with very few active connections, it cannot afford to memorize. It has to find the most efficient, general solution right away. Think of it like giving a student a tiny notecard instead of a full textbook: they are forced to write down only the key concepts, not every answer.

Sparse start (90% inactive)

90% of connections set to zero at the start.
Generalization delay: 1,050 steps
Final accuracy: 99.68%

Tiny-scale start

All connections initialized at a very small value.
Generalization delay: 1,050 steps
Final accuracy: 73.75%

Line chart showing both training and validation accuracy reaching near 100% within 1,050 steps, with almost no gap between them. — Figure 5 - Sparse start (90% inactive). Generalizes at just 1,050 steps with 99.68% final accuracy. Training and validation curves are nearly identical: the memorization phase has essentially vanished.

Line chart showing accuracy on new questions reaching 95% at 1,050 steps but leveling off around 73.75% final accuracy. — Figure 6 - Tiny-scale start. Also generalizes at 1,050 steps, but final accuracy levels off at 73.75%, suggesting that starting too small may limit how much the model can ultimately learn.

Both approaches reduced the generalization delay from ~332,000 steps to just 1,050 steps, a ~316x speedup. The sparse start also maintained near-perfect final accuracy, making it the strongest overall result we found.

Open Question: It is unclear whether starting sparse would still work for larger, more powerful models. Those models may genuinely need their full capacity to learn at all. Testing this is an important direction for future work.

Summary & Takeaways

Across three experiments, we showed that the delay between memorization and generalization is not a fixed feature of how AI learns. It can be dramatically shortened. All three strategies share a common thread: they work by preventing the model from over-committing to memorization in the first place.

Task Diversity

Train on all four operations at once.
83.5% to 4.7% progress for Division.
~18x speedup. Full accuracy maintained.

Noisier Training

Swap training algorithm to SGD.
332,000 to 41,500 step delay.
~8x speedup. Final accuracy capped at ~86%.

Sparse Start

Begin with 90% of connections inactive.
332,000 to 1,050 step delay.
~316x speedup. Full accuracy maintained.

Why does this matter?

Training AI models is slow and expensive. Anything that helps models generalize faster without sacrificing accuracy has direct practical value. Our results point to two particularly promising levers: training on diverse, related tasks and starting with a constrained model. Both are simple to apply and produce large speedups in our setting. Whether these benefits carry over to larger, real-world models is the key open question.

Limitations

Our experiments used a small, controlled research setting with a fixed model size and a single type of wrap-around arithmetic. We cannot say whether the same strategies would work for large AI models used in real-world products. Our explanation for why task diversity helps is a hypothesis we did not directly verify by examining the model's internals. And our 95% accuracy threshold for defining "generalized" is a convention: small differences in timing between experiments should not be over-interpreted.

References

Power et al. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv:2201.02177.

Lyu, Jin, Li, Du, Lee & Hu (2024). Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking. ICLR 2024.

Kim et al. (2025). Task Diversity Shortens the ICL Plateau. arXiv preprint.

Lee et al. (2024). Grokfast: Accelerated Grokking by Amplifying Slow Gradients. arXiv:2405.20233.

Grokking: Shortening the Delay

What we did

What we didn't do

Our contributions

What is Grokking?

Phase 1: Memorization

Phase 2: Generalization

Why Does Grokking Happen?

Our Approach

The Model

The Training Algorithm

The Tasks

Experiment 1: Task Diversity

Training all four tasks together

1st: Multiplication

2nd: Division

3rd: Addition

4th: Subtraction

Two-task combinations

What we learned along the way

Experiment 2: Introducing Noise

Too much noise: Unstable

Moderate noise: Stable

Experiment 3: Starting Small

Sparse start (90% inactive)

Tiny-scale start

Summary & Takeaways

Task Diversity

Noisier Training

Sparse Start

Why does this matter?

Limitations

References