-
The Problem. LLMs are static after training. They cannot update their weights when encountering new information, forcing you to either retrain entirely (expensive) or stuff everything into context windows (limited, slow).
-
The Solution. SEAL teaches models to generate their own fine-tuning data ("self-edits"). When shown new information, the model transforms it into a format optimized for its own learning, then updates its weights via LoRA.
LoRA adds small trainable matrices to a frozen model, enabling fast weight updates without modifying the original parameters. It reduces compute and memory, making on-the-fly fine-tuning cheap enough for real-time adaptation.
-
The Results. A 7B model with SEAL outperforms GPT-4.1 synthetic data on knowledge tasks (47.0% vs 46.3%). On abstract reasoning, SEAL achieves 72.5% success where in-context learning scores 0%.
-
The Business Case. Smaller models matching larger model performance means lower inference costs. Internalized knowledge means no retrieval latency. Adaptable models means no constant retraining cycles.
Research overview
If you have built applications on top of LLMs, you have hit the knowledge cutoff problem. Your model does not know about events after its training date. It cannot learn from your proprietary documents. Every query requires stuffing relevant context into the prompt, burning tokens and hoping the model pays attention to the right parts.
The standard solutions have tradeoffs. Retrieval-Augmented Generation (RAG) helps but adds latency and retrieval errors. Fine-tuning works but requires curating datasets and significant compute. In-context learning hits context limits and struggles with complex adaptation.
RAG combines a language model with an external document store: the model first retrieves relevant passages, then conditions its generation on them. It improves access to up-to-date knowledge but introduces a separate retrieval step and possible mismatches between retrieved content and user intent.
RAG vs SEAL: When to use which
| Dimension | RAG | SEAL |
|---|---|---|
| Knowledge storage | External (database) | Internal (weights) |
| Query latency | Higher (retrieval step) | Lower (direct inference) |
| Context limits | Bounded by window | Unlimited capacity |
| Update speed | Instant (add to DB) | Minutes (fine-tune) |
| Accuracy on updates | Retrieval errors possible | Internalized knowledge |
| Best for | Frequently changing data | Stable domain knowledge |
RAG and SEAL are not mutually exclusive. RAG excels at rapidly changing information (news, prices, inventory). SEAL excels at stable domain knowledge that should be internalized (company policies, technical documentation, domain expertise). Production systems may use both.
Neural network weights are fixed after training. The model's knowledge is baked into billions of parameters that do not change when you send it new information. In-context learning is not really learning. The model is pattern-matching within your prompt, not updating its internal knowledge.
SEAL proposes something different: teach the model to teach itself. When given new information, the model generates its own training data (called "self-edits"), applies a lightweight fine-tuning step, and emerges with updated weights. The key insight is using reinforcement learning to optimize what kind of self-edits lead to better learning outcomes.
Key results
| Task | Before SEAL | After SEAL | Best Baseline |
|---|---|---|---|
| Knowledge QA | 32.7% | 47.0% | 46.3% (GPT-4.1) |
| Few-shot ARC | 0% (ICL) | 72.5% | 20% (no RL) |
The knowledge incorporation results are striking: a 7B parameter model generating its own training data outperforms synthetic data from GPT-4.1, a much larger model. The model has learned what format of information is easiest for itself to absorb.
The static model problem
Current LLMs face a fundamental limitation: they cannot update their own knowledge. This creates three practical problems for production systems.
Knowledge cutoff. Models freeze at their training date. A model trained in January 2025 knows nothing about February 2025. For rapidly evolving domains (security, finance, current events), this staleness is critical.
Proprietary knowledge. Your company's internal documents, customer data, and domain expertise are invisible to the model. RAG can surface relevant passages, but the model cannot truly internalize this knowledge.
Adaptation inefficiency. When models encounter new task patterns, they cannot learn from experience. Every similar query starts from scratch. A model that successfully handles a complex workflow cannot "remember" the solution for next time.
SEAL Framework: Self-Adapting LLMs
Model generates training data, updates weights, improves via RL
The diagram above shows SEAL's approach. Instead of treating the model as a static artifact, SEAL adds a self-modification loop. The model sees new input, generates training data for itself, updates its weights, and improves on the downstream task. Reinforcement learning optimizes this entire loop.
How SEAL works
SEAL operates through two nested loops: an outer RL loop that improves the self-edit generation, and an inner loop that applies each self-edit to update the model.
A self-edit is the model's generated output that becomes its own training data. For knowledge tasks, this might be implications and restatements of a passage. For few-shot learning, this includes data augmentation choices and hyperparameter settings. The model learns to generate self-edits that maximize its own improvement.
The training loop
-
Generate candidates. Given new input (a passage, few-shot examples), the model generates multiple candidate self-edits.
-
Apply updates. Each self-edit is used to fine-tune the model via LoRA (lightweight adapter layers that do not require full model updates).
-
Evaluate performance. The updated model is tested on a downstream task (answering questions about the passage, solving a held-out test case).
-
Reward good edits. Self-edits that led to correct answers receive positive reward. The model is trained via ReST-EM (a form of filtered behavior cloning) to generate more edits like the successful ones.
ReST-EM (Reinforced Self-Training with Expectation Maximization) is a simpler alternative to PPO or GRPO for reinforcement learning. It works like rejection sampling: generate many candidates, keep only the ones that worked, and fine-tune on those. The authors found this more stable than other RL methods for this task.
Why not just use GPT-4 for synthetic data?
You could generate training data with a large model like GPT-4 and fine-tune on that. Many teams do exactly this. But SEAL demonstrates a counterintuitive result: a smaller model generating data for itself outperforms data from larger models.
The reason is optimization. GPT-4 generates generic "good" explanations. SEAL learns what specific format, phrasing, and structure works best for the target model's learning. It is personalized training data, optimized through trial and error.
The student analogy
The paper draws a compelling analogy to human learning. Consider a student preparing for an exam.
The raw material approach. Read the textbook and lecture notes verbatim. Hope the information sticks. This is equivalent to training directly on passage text.
The study notes approach. Rewrite the material in your own words. Create summaries, draw diagrams, make flashcards. Transform the information into a format that clicks for your brain. This is SEAL.
Different students learn differently. Some prefer visual diagrams. Others need concrete examples. Some condense information into bullet points, others expand it into narratives, and some convert it into Q&A flashcards. A tutor who knows your specific learning style will always outperform a generic "best practices" study guide.
This is exactly why SEAL beats GPT-4.1 synthetic data. GPT-4.1 generates high-quality, human-readable explanations optimized for general understanding. But SEAL discovers what specific format, structure, and phrasing works best for the target model's learning. It finds the model's "learning style" through experimentation.
GPT-4.1 optimizes for human readability. SEAL optimizes for model learnability. These are different objectives. A 7B model might learn better from atomic facts in Q&A format than from elegant prose explanations. SEAL discovers this through trial and error, while GPT-4.1 generates the same style regardless of the student.
Knowledge Incorporation: SQuAD QA Accuracy
SEAL (7B) outperforms GPT-4.1 synthetic data
The chart shows how self-edit quality improves over RL iterations. Early self-edits are generic (similar to what GPT-4 would produce). After training, the model learns to generate more specific, detailed transformations that improve its own learning.
Knowledge incorporation
The first application domain tests whether SEAL can help models retain new factual information without relying on context.
The setup
- Show the model a Wikipedia passage (from SQuAD dataset)
- Model generates "implications" of the passage (the self-edit)
- Fine-tune the model on these implications using LoRA
- Test: Can the model answer questions about the passage without seeing it again?
An implication restates information in new ways. "The Eiffel Tower is 330 meters tall" might become "The Eiffel Tower is taller than a 100-story building" or "The Eiffel Tower's height equals about 3.5 football fields." These restatements create multiple neural pathways to the same fact, improving retention.
Results breakdown
| Method | Accuracy |
|---|---|
| Base model (no passage) | 32.7% |
| Fine-tune on passage only | 33.5% |
| Fine-tune + Qwen synthetic data | 39.7% |
| Fine-tune + GPT-4.1 synthetic data | 46.3% |
| SEAL | 47.0% |
Fine-tuning on the raw passage barely helps (+0.8 points). The information is there, but in a format the model struggles to absorb. Adding synthetic implications helps significantly. GPT-4.1 data provides a strong baseline. But SEAL, despite using a smaller model to generate data, achieves the best results.
Continued pretraining
SEAL also works at scale. When aggregating synthetic data from 200+ passages and doing continued pretraining (full fine-tuning, not LoRA), SEAL achieves 58.2% accuracy. The self-edit policy learned on single passages generalizes to batch scenarios.
Few-shot learning
The second application tests abstract reasoning on the ARC (Abstraction and Reasoning Corpus) benchmark.
A collection of visual-grid puzzles that require learning a transformation rule from only a few examples. Each task shows 2-3 input-output pairs demonstrating a pattern (rotation, color change, shape manipulation), then asks for the output given a new input. Success on ARC demonstrates true few-shot abstract reasoning, not memorization.
The challenge
ARC tasks show a few input-output grid transformations and ask the model to predict the output for a new input. These puzzles test pattern recognition and abstract reasoning. Standard language models struggle because each task requires learning a novel transformation rule from minimal examples.
TTT adapts the model during inference using the few-shot examples as training data. You augment the examples (rotations, flips, size changes), fine-tune the model on these augmented versions, then evaluate. This adaptation is temporary, lasting only for that specific task.
In SEAL, the inner loop performs temporary TTT adaptation to evaluate each self-edit. But the outer RL loop creates permanent improvement: the model learns which self-edits to generate, and this policy persists across all future tasks. The model is not just adapting to one task; it is learning how to adapt.
SEAL's approach
Instead of manually tuning augmentation strategies, SEAL learns to generate them. The self-edit is not just text; it is a structured specification of tools to invoke:
Data augmentation tools:
rotate(90),rotate(180),rotate(270)- rotate gridsflip_horizontal(),flip_vertical()- mirror transformationstranspose()- swap rows and columnsresize(scale)- change grid resolutionchain(aug1, aug2)- combine multiple augmentations
Optimization parameters:
learning_rate- typically 1e-4 to 1e-3epochs- number of training passes (1-10)loss_tokens- compute loss on "all" tokens or "output_only"
The model learns to output structured self-edits like: {"augmentations": ["rotate(90)", "flip_horizontal()"], "lr": 1e-4, "epochs": 5, "loss": "output_only"}. This is not free-form text generation; it is tool use.
Few-Shot Learning: ARC Benchmark
SEAL achieves 72.5% where in-context learning fails completely
Results
| Method | Success Rate |
|---|---|
| In-Context Learning | 0% |
| TTT + Self-Edit (no RL) | 20% |
| SEAL | 72.5% |
| Oracle TTT (human-tuned) | 100% |
In-context learning completely fails on these tasks for the 1B model tested. Test-time training with randomly generated self-edits achieves 20%. SEAL improves this to 72.5% by learning which augmentations and settings work.
The gap to oracle (human-tuned optimal settings) shows room for improvement, but SEAL already provides massive gains over baselines without any manual tuning.
Benchmark results
Scaling behavior
SEAL's benefits appear across model sizes, though the gap narrows at larger scales where base models are already stronger.
What the model learns
Examining self-edits across RL iterations reveals how the model's strategy evolves:
Early (before RL): Generic implications, similar to what any LLM would produce. "The passage discusses X. One implication is Y."
Late (after RL): More specific, structured outputs. Atomic facts broken out individually. Question-answer format. Multiple phrasings of key information.
The model learns that breaking information into smaller, more explicit pieces improves its own retention. It discovers effective learning strategies through trial and error.
Implementation blueprint
Recommended stack
| Component | Option | Notes |
|---|---|---|
| Base model | Qwen 2.5 (7B+) | Paper uses Qwen for knowledge tasks |
| Adapter | LoRA | Fast updates, low memory |
| RL method | ReST-EM | Simpler than PPO, more stable |
| Evaluation | Task-specific | QA for knowledge, accuracy for ARC |
Core workflow
Step 1: Define the self-edit format
For knowledge incorporation:
Prompt: "List implications from this passage: {passage}"
Output: Multiple restatements and inferences
For few-shot learning:
Prompt: "Given these examples: {demos}, specify
augmentations and hyperparameters"
Output: Tool calls for rotations, learning rate, etc.
Step 2: Generate candidate self-edits
Sample multiple self-edits per input (paper uses 5-15 depending on task). More candidates increase the chance of finding good ones but add compute cost.
Step 3: Apply each self-edit
for edit in candidate_edits:
# Fine-tune with LoRA
updated_model = apply_lora_sft(
base_model,
training_data=edit,
lr=2e-4,
epochs=3
)
# Evaluate on held-out task
score = evaluate(updated_model, test_task)
rewards.append((edit, score))Step 4: Filter and train
Keep self-edits that improved performance. Fine-tune the base model to generate more edits like those.
good_edits = [e for e, s in rewards if s > baseline]
model = sft(model, good_edits)Key hyperparameters
| Parameter | Knowledge | Few-shot |
|---|---|---|
| LoRA rank | 8 | 8 |
| Learning rate | 2e-4 | 1e-4 |
| Epochs per edit | 3 | 5 |
| Candidates per input | 5 | 15 |
| RL iterations | 2 | 3 |
Where teams get stuck
Problem 1: Reward computation cost. Each self-edit requires a full fine-tuning and evaluation cycle. Budget 30-45 seconds per candidate. Parallelization across GPUs helps.
Problem 2: Evaluation task design. You need a measurable downstream task for each input. For passages, this means QA pairs. For few-shot, held-out test cases. Without evaluation, there is no reward signal.
Problem 3: Catastrophic forgetting. Sequential self-edits can interfere. The model learns new information but may forget old. The paper shows degradation after 5+ sequential updates.
Limitations
Catastrophic forgetting
SEAL does not solve the fundamental challenge of continual learning. When applying multiple sequential self-edits, performance on earlier tasks degrades.
The paper's experiments show the pattern clearly:
- 1-2 sequential updates: Minimal degradation, both old and new tasks work well
- 3-5 sequential updates: Noticeable decline on earliest tasks (10-15% accuracy drop)
- 5+ sequential updates: Significant forgetting, with oldest tasks dropping below baseline
Practical implication: SEAL works well for single updates or small batches of related knowledge. It is not yet suitable for continuous, long-running adaptation where the model must retain hundreds of sequential updates.
Potential mitigations include reward shaping to penalize forgetting, null-space constrained updates, or using RL instead of SFT for the inner loop.
Computational overhead
SEAL's training cost is significantly higher than standard RL. Each self-edit candidate requires a full fine-tuning cycle (30-45 seconds on a single GPU). With 5 candidates per input and 50 inputs per batch, that is 250 fine-tuning runs per RL iteration. Budget accordingly.
The reward loop is expensive. Standard RL reward computation (preference models, regex matching) takes milliseconds. SEAL requires fine-tuning and evaluating a full model for each candidate, taking 30-45 seconds each.
Cost breakdown (per RL iteration):
- 50 contexts x 5 candidates = 250 fine-tuning runs
- 250 runs x 40 seconds = ~2.8 hours on single GPU
- 2-3 RL iterations = 6-9 hours total training time
The tradeoff: high upfront training cost for potentially lower inference cost. A SEAL-trained model that internalizes knowledge does not need expensive context stuffing or retrieval at inference time. For high-volume production workloads, the training investment may pay off quickly.
This limits the number of RL iterations practical in a given compute budget. The paper uses only 2-3 iterations, suggesting that more could help but is costly.
Evaluation task dependency
Current SEAL requires paired evaluation tasks: passages with QA, few-shot with test cases. This prevents scaling to unlabeled corpora where no ground truth exists.
One potential solution: have the model generate its own evaluation questions while the context is fresh, then use those for reward computation.
Future directions
Synthetic data generation
As high-quality training data becomes increasingly scarce, SEAL points toward a solution: models that generate their own training data, optimized for their own learning. Unlike generic synthetic data from larger models, SEAL-generated data is personalized for the target model's learning patterns.
Agentic self-improvement
For agents operating over extended interactions, SEAL enables learning from experience. After completing a task, the agent could synthesize a self-edit capturing what it learned, updating weights for future tasks.
This moves toward systems that improve through use, developing expertise in domains they operate in frequently rather than relying solely on context retrieval or prompt engineering.
Reasoning integration
Modern reasoning models use RL to generate chain-of-thought traces. SEAL could complement this: the model chooses when to update weights mid-reasoning (to guide the current problem) or after (to retain insights for future problems).
Original paper: arXiv ・ PDF ・ HTML
Code: GitHub
Authors: Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, Pulkit Agrawal (Massachusetts Institute of Technology)
Cite this paper
Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, Pulkit Agrawal (2025). SEAL: The Self-Teaching AI That Writes Its Own Study Guide. arXiv 2025.