SEAL: The Self-Teaching AI That Writes Its Own Study Guide

TL;DR

The Problem. LLMs are static after training. They cannot update their weights when encountering new information, forcing you to either retrain entirely (expensive) or stuff everything into context windows (limited, slow).
The Solution. SEAL teaches models to generate their own fine-tuning data ("self-edits"). When shown new information, the model transforms it into a format optimized for its own learning, then updates its weights via LoRA.

LoRA (Low-Rank Adaptation)

LoRA adds small trainable matrices to a frozen model, enabling fast weight updates without modifying the original parameters. It reduces compute and memory, making on-the-fly fine-tuning cheap enough for real-time adaptation.

The Results. A 7B model with SEAL outperforms GPT-4.1 synthetic data on knowledge tasks (47.0% vs 46.3%). On abstract reasoning, SEAL achieves 72.5% success where in-context learning scores 0%.
The Business Case. Smaller models matching larger model performance means lower inference costs. Internalized knowledge means no retrieval latency. Adaptable models means no constant retraining cycles.

Research overview

If you have built applications on top of LLMs, you have hit the knowledge cutoff problem. Your model does not know about events after its training date. It cannot learn from your proprietary documents. Every query requires stuffing relevant context into the prompt, burning tokens and hoping the model pays attention to the right parts.

The standard solutions have tradeoffs. Retrieval-Augmented Generation (RAG) helps but adds latency and retrieval errors. Fine-tuning works but requires curating datasets and significant compute. In-context learning hits context limits and struggles with complex adaptation.

Retrieval-Augmented Generation (RAG)

RAG combines a language model with an external document store: the model first retrieves relevant passages, then conditions its generation on them. It improves access to up-to-date knowledge but introduces a separate retrieval step and possible mismatches between retrieved content and user intent.

RAG vs SEAL: When to use which

Dimension	RAG	SEAL
Knowledge storage	External (database)	Internal (weights)
Query latency	Higher (retrieval step)	Lower (direct inference)
Context limits	Bounded by window	Unlimited capacity
Update speed	Instant (add to DB)	Minutes (fine-tune)
Accuracy on updates	Retrieval errors possible	Internalized knowledge
Best for	Frequently changing data	Stable domain knowledge

RAG and SEAL are not mutually exclusive. RAG excels at rapidly changing information (news, prices, inventory). SEAL excels at stable domain knowledge that should be internalized (company policies, technical documentation, domain expertise). Production systems may use both.

Why can't models just "learn" new things?

Neural network weights are fixed after training. The model's knowledge is baked into billions of parameters that do not change when you send it new information. In-context learning is not really learning. The model is pattern-matching within your prompt, not updating its internal knowledge.

SEAL proposes something different: teach the model to teach itself. When given new information, the model generates its own training data (called "self-edits"), applies a lightweight fine-tuning step, and emerges with updated weights. The key insight is using reinforcement learning to optimize what kind of self-edits lead to better learning outcomes.

Key results

Task	Before SEAL	After SEAL	Best Baseline
Knowledge QA	32.7%	47.0%	46.3% (GPT-4.1)
Few-shot ARC	0% (ICL)	72.5%	20% (no RL)

The knowledge incorporation results are striking: a 7B parameter model generating its own training data outperforms synthetic data from GPT-4.1, a much larger model. The model has learned what format of information is easiest for itself to absorb.

The static model problem

Current LLMs face a fundamental limitation: they cannot update their own knowledge. This creates three practical problems for production systems.

Knowledge cutoff. Models freeze at their training date. A model trained in January 2025 knows nothing about February 2025. For rapidly evolving domains (security, finance, current events), this staleness is critical.

Proprietary knowledge. Your company's internal documents, customer data, and domain expertise are invisible to the model. RAG can surface relevant passages, but the model cannot truly internalize this knowledge.

Adaptation inefficiency. When models encounter new task patterns, they cannot learn from experience. Every similar query starts from scratch. A model that successfully handles a complex workflow cannot "remember" the solution for next time.

SEAL Framework: Self-Adapting LLMs

Model generates training data, updates weights, improves via RL

The diagram above shows SEAL's approach. Instead of treating the model as a static artifact, SEAL adds a self-modification loop. The model sees new input, generates training data for itself, updates its weights, and improves on the downstream task. Reinforcement learning optimizes this entire loop.

How SEAL works

SEAL operates through two nested loops: an outer RL loop that improves the self-edit generation, and an inner loop that applies each self-edit to update the model.

What is a self-edit?

A self-edit is the model's generated output that becomes its own training data. For knowledge tasks, this might be implications and restatements of a passage. For few-shot learning, this includes data augmentation choices and hyperparameter settings. The model learns to generate self-edits that maximize its own improvement.

The training loop

Generate candidates. Given new input (a passage, few-shot examples), the model generates multiple candidate self-edits.
Apply updates. Each self-edit is used to fine-tune the model via LoRA (lightweight adapter layers that do not require full model updates).
Evaluate performance. The updated model is tested on a downstream task (answering questions about the passage, solving a held-out test case).
Reward good edits. Self-edits that led to correct answers receive positive reward. The model is trained via ReST-EM (a form of filtered behavior cloning) to generate more edits like the successful ones.

What is ReST-EM?

ReST-EM (Reinforced Self-Training with Expectation Maximization) is a simpler alternative to PPO or GRPO for reinforcement learning. It works like rejection sampling: generate many candidates, keep only the ones that worked, and fine-tune on those. The authors found this more stable than other RL methods for this task.

Why not just use GPT-4 for synthetic data?

You could generate training data with a large model like GPT-4 and fine-tune on that. Many teams do exactly this. But SEAL demonstrates a counterintuitive result: a smaller model generating data for itself outperforms data from larger models.

The reason is optimization. GPT-4 generates generic "good" explanations. SEAL learns what specific format, phrasing, and structure works best for the target model's learning. It is personalized training data, optimized through trial and error.

The student analogy

The paper draws a compelling analogy to human learning. Consider a student preparing for an exam.

The raw material approach. Read the textbook and lecture notes verbatim. Hope the information sticks. This is equivalent to training directly on passage text.

The study notes approach. Rewrite the material in your own words. Create summaries, draw diagrams, make flashcards. Transform the information into a format that clicks for your brain. This is SEAL.

Different students learn differently. Some prefer visual diagrams. Others need concrete examples. Some condense information into bullet points, others expand it into narratives, and some convert it into Q&A flashcards. A tutor who knows your specific learning style will always outperform a generic "best practices" study guide.

This is exactly why SEAL beats GPT-4.1 synthetic data. GPT-4.1 generates high-quality, human-readable explanations optimized for general understanding. But SEAL discovers what specific format, structure, and phrasing works best for the target model's learning. It finds the model's "learning style" through experimentation.

Why personalization beats quality

GPT-4.1 optimizes for human readability. SEAL optimizes for model learnability. These are different objectives. A 7B model might learn better from atomic facts in Q&A format than from elegant prose explanations. SEAL discovers this through trial and error, while GPT-4.1 generates the same style regardless of the student.

Knowledge Incorporation: SQuAD QA Accuracy

SEAL (7B) outperforms GPT-4.1 synthetic data

The chart shows how self-edit quality improves over RL iterations. Early self-edits are generic (similar to what GPT-4 would produce). After training, the model learns to generate more specific, detailed transformations that improve its own learning.

Knowledge incorporation

The first application domain tests whether SEAL can help models retain new factual information without relying on context.

The setup

Show the model a Wikipedia passage (from SQuAD dataset)
Model generates "implications" of the passage (the self-edit)
Fine-tune the model on these implications using LoRA
Test: Can the model answer questions about the passage without seeing it again?

Why "implications" as the self-edit format?

An implication restates information in new ways. "The Eiffel Tower is 330 meters tall" might become "The Eiffel Tower is taller than a 100-story building" or "The Eiffel Tower's height equals about 3.5 football fields." These restatements create multiple neural pathways to the same fact, improving retention.

Results breakdown

Method	Accuracy
Base model (no passage)	32.7%
Fine-tune on passage only	33.5%
Fine-tune + Qwen synthetic data	39.7%
Fine-tune + GPT-4.1 synthetic data	46.3%
SEAL	47.0%

Fine-tuning on the raw passage barely helps (+0.8 points). The information is there, but in a format the model struggles to absorb. Adding synthetic implications helps significantly. GPT-4.1 data provides a strong baseline. But SEAL, despite using a smaller model to generate data, achieves the best results.

Continued pretraining

SEAL also works at scale. When aggregating synthetic data from 200+ passages and doing continued pretraining (full fine-tuning, not LoRA), SEAL achieves 58.2% accuracy. The self-edit policy learned on single passages generalizes to batch scenarios.

Few-shot learning

The second application tests abstract reasoning on the ARC (Abstraction and Reasoning Corpus) benchmark.

ARC (Abstraction and Reasoning Corpus)

A collection of visual-grid puzzles that require learning a transformation rule from only a few examples. Each task shows 2-3 input-output pairs demonstrating a pattern (rotation, color change, shape manipulation), then asks for the output given a new input. Success on ARC demonstrates true few-shot abstract reasoning, not memorization.

The challenge

ARC tasks show a few input-output grid transformations and ask the model to predict the output for a new input. These puzzles test pattern recognition and abstract reasoning. Standard language models struggle because each task requires learning a novel transformation rule from minimal examples.

What is test-time training (TTT)?

TTT adapts the model during inference using the few-shot examples as training data. You augment the examples (rotations, flips, size changes), fine-tune the model on these augmented versions, then evaluate. This adaptation is temporary, lasting only for that specific task.

In SEAL, the inner loop performs temporary TTT adaptation to evaluate each self-edit. But the outer RL loop creates permanent improvement: the model learns which self-edits to generate, and this policy persists across all future tasks. The model is not just adapting to one task; it is learning how to adapt.

SEAL's approach

Instead of manually tuning augmentation strategies, SEAL learns to generate them. The self-edit is not just text; it is a structured specification of tools to invoke:

Data augmentation tools:

rotate(90), rotate(180), rotate(270) - rotate grids
flip_horizontal(), flip_vertical() - mirror transformations
transpose() - swap rows and columns
resize(scale) - change grid resolution
chain(aug1, aug2) - combine multiple augmentations

Optimization parameters:

learning_rate - typically 1e-4 to 1e-3
epochs - number of training passes (1-10)
loss_tokens - compute loss on "all" tokens or "output_only"

The model learns to output structured self-edits like: {"augmentations": ["rotate(90)", "flip_horizontal()"], "lr": 1e-4, "epochs": 5, "loss": "output_only"}. This is not free-form text generation; it is tool use.

Few-Shot Learning: ARC Benchmark

SEAL achieves 72.5% where in-context learning fails completely

Results

Method	Success Rate
In-Context Learning	0%
TTT + Self-Edit (no RL)	20%
SEAL	72.5%
Oracle TTT (human-tuned)	100%

In-context learning completely fails on these tasks for the 1B model tested. Test-time training with randomly generated self-edits achieves 20%. SEAL improves this to 72.5% by learning which augmentations and settings work.

The gap to oracle (human-tuned optimal settings) shows room for improvement, but SEAL already provides massive gains over baselines without any manual tuning.

Benchmark results

Scaling behavior

SEAL's benefits appear across model sizes, though the gap narrows at larger scales where base models are already stronger.

What the model learns

Examining self-edits across RL iterations reveals how the model's strategy evolves:

Early (before RL): Generic implications, similar to what any LLM would produce. "The passage discusses X. One implication is Y."

Late (after RL): More specific, structured outputs. Atomic facts broken out individually. Question-answer format. Multiple phrasings of key information.

The model learns that breaking information into smaller, more explicit pieces improves its own retention. It discovers effective learning strategies through trial and error.

Implementation blueprint

Recommended stack

Component	Option	Notes
Base model	Qwen 2.5 (7B+)	Paper uses Qwen for knowledge tasks
Adapter	LoRA	Fast updates, low memory
RL method	ReST-EM	Simpler than PPO, more stable
Evaluation	Task-specific	QA for knowledge, accuracy for ARC

Core workflow

Step 1: Define the self-edit format

For knowledge incorporation:

Prompt: "List implications from this passage: {passage}"
Output: Multiple restatements and inferences

For few-shot learning:

Prompt: "Given these examples: {demos}, specify
augmentations and hyperparameters"
Output: Tool calls for rotations, learning rate, etc.

Step 2: Generate candidate self-edits

Sample multiple self-edits per input (paper uses 5-15 depending on task). More candidates increase the chance of finding good ones but add compute cost.

Step 3: Apply each self-edit

for edit in candidate_edits:
    # Fine-tune with LoRA
    updated_model = apply_lora_sft(
        base_model,
        training_data=edit,
        lr=2e-4,
        epochs=3
    )
 
    # Evaluate on held-out task
    score = evaluate(updated_model, test_task)
    rewards.append((edit, score))

Step 4: Filter and train

Keep self-edits that improved performance. Fine-tune the base model to generate more edits like those.

good_edits = [e for e, s in rewards if s > baseline]
model = sft(model, good_edits)

Key hyperparameters

Parameter	Knowledge	Few-shot
LoRA rank	8	8
Learning rate	2e-4	1e-4
Epochs per edit	3	5
Candidates per input	5	15
RL iterations	2	3

Where teams get stuck

Problem 1: Reward computation cost. Each self-edit requires a full fine-tuning and evaluation cycle. Budget 30-45 seconds per candidate. Parallelization across GPUs helps.

Problem 2: Evaluation task design. You need a measurable downstream task for each input. For passages, this means QA pairs. For few-shot, held-out test cases. Without evaluation, there is no reward signal.

Problem 3: Catastrophic forgetting. Sequential self-edits can interfere. The model learns new information but may forget old. The paper shows degradation after 5+ sequential updates.

Limitations

Catastrophic forgetting

SEAL does not solve the fundamental challenge of continual learning. When applying multiple sequential self-edits, performance on earlier tasks degrades.

The paper's experiments show the pattern clearly:

1-2 sequential updates: Minimal degradation, both old and new tasks work well
3-5 sequential updates: Noticeable decline on earliest tasks (10-15% accuracy drop)
5+ sequential updates: Significant forgetting, with oldest tasks dropping below baseline

Practical implication: SEAL works well for single updates or small batches of related knowledge. It is not yet suitable for continuous, long-running adaptation where the model must retain hundreds of sequential updates.

Potential mitigations include reward shaping to penalize forgetting, null-space constrained updates, or using RL instead of SFT for the inner loop.

Computational overhead

Cost warning for engineers

SEAL's training cost is significantly higher than standard RL. Each self-edit candidate requires a full fine-tuning cycle (30-45 seconds on a single GPU). With 5 candidates per input and 50 inputs per batch, that is 250 fine-tuning runs per RL iteration. Budget accordingly.

The reward loop is expensive. Standard RL reward computation (preference models, regex matching) takes milliseconds. SEAL requires fine-tuning and evaluating a full model for each candidate, taking 30-45 seconds each.

Cost breakdown (per RL iteration):

50 contexts x 5 candidates = 250 fine-tuning runs
250 runs x 40 seconds = ~2.8 hours on single GPU
2-3 RL iterations = 6-9 hours total training time

The tradeoff: high upfront training cost for potentially lower inference cost. A SEAL-trained model that internalizes knowledge does not need expensive context stuffing or retrieval at inference time. For high-volume production workloads, the training investment may pay off quickly.

This limits the number of RL iterations practical in a given compute budget. The paper uses only 2-3 iterations, suggesting that more could help but is costly.

Evaluation task dependency

Current SEAL requires paired evaluation tasks: passages with QA, few-shot with test cases. This prevents scaling to unlabeled corpora where no ground truth exists.

One potential solution: have the model generate its own evaluation questions while the context is fresh, then use those for reward computation.

Future directions

Synthetic data generation

As high-quality training data becomes increasingly scarce, SEAL points toward a solution: models that generate their own training data, optimized for their own learning. Unlike generic synthetic data from larger models, SEAL-generated data is personalized for the target model's learning patterns.

Agentic self-improvement

For agents operating over extended interactions, SEAL enables learning from experience. After completing a task, the agent could synthesize a self-edit capturing what it learned, updating weights for future tasks.

This moves toward systems that improve through use, developing expertise in domains they operate in frequently rather than relying solely on context retrieval or prompt engineering.

Reasoning integration

Modern reasoning models use RL to generate chain-of-thought traces. SEAL could complement this: the model chooses when to update weights mid-reasoning (to guide the current problem) or after (to retain insights for future problems).

Original paper: arXiv ・ PDF ・ HTML

Code: GitHub

Authors: Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, Pulkit Agrawal (Massachusetts Institute of Technology)

Authors

Adam ZweigerMIT,Jyothish PariMIT,Han GuoMIT,Ekin AkyürekMIT,Yoon KimMIT,Pulkit AgrawalMIT

Code & Data

Cite this paper

Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, Pulkit Agrawal (2025). SEAL: The Self-Teaching AI That Writes Its Own Study Guide. arXiv 2025.

Key Findings