Small language models are the future of agentic AI

TL;DR

The Contrarian Take. Bigger models are not always better. For the repetitive, specialized tasks that AI agents perform (tool calls, JSON generation, parsing), sub-10B models often match or exceed frontier LLMs.
The Economics. Running a 7B model costs 10-30x less than a 175B model. When an agent makes thousands of calls per session, this difference determines whether your system is profitable or bleeding money.
The Recommendation. Use hybrid architectures: small specialized models for routine agent tasks (40-70% of calls), large models reserved for complex reasoning. NVIDIA provides a 6-step conversion algorithm.

Research Overview

The AI industry operates on an assumption: bigger models are better. GPT-4 outperforms GPT-3. Claude 3.5 beats Claude 3. So when building AI agents, the instinct is to use the most capable model available.

NVIDIA's research team challenges this assumption with a simple observation: AI agents don't need general intelligence. They need specialized competence.

What is an AI agent?

An AI agent is a system that can take actions autonomously. Unlike a chatbot that just responds, an agent can call APIs, execute code, browse the web, and use tools to accomplish goals. Examples include coding assistants that run tests, customer service bots that process refunds, and research agents that search databases.

Consider what an agent actually does. It parses a user request into structured commands. It generates JSON to call an API. It summarizes a response. It decides which tool to use next. These are narrow, well-defined tasks repeated thousands of times. They don't require the broad knowledge of a 175B parameter model trained on the entire internet.

Asking a massive frontier model to generate a simple JSON object for a tool call is like using a supercar to deliver a pizza. It works, but it's expensive, inefficient, and overkill for the job.

The paper argues that small language models (SLMs) under 10 billion parameters are:

Sufficiently powerful for most agent tasks
Operationally more suitable due to faster iteration and easier alignment
Economically superior with 10-30x lower inference costs

The authors tested three leading agent frameworks and found that 40-70% of large model calls could be replaced with specialized small models without performance loss.

The contrarian argument

The conventional wisdom goes like this: scaling laws show that larger models perform better. Therefore, use the largest model you can afford.

The paper identifies three flaws in this reasoning:

Flaw 1: Scaling laws assume identical architectures

Studies comparing 7B to 70B models use the same architecture at different scales. But modern small models use size-optimized designs. NVIDIA's Hymba-1.5B achieves 3.5x higher throughput than comparable transformers and outperforms 13B models on instruction-following. The architecture matters as much as the parameter count.

Flaw 2: Scaling laws measure general performance

Benchmarks test broad capabilities. But agents perform narrow tasks. A model that scores lower on general reasoning can still excel at parsing JSON or generating SQL. Salesforce's xLAM-2-8B achieves state-of-the-art tool-calling performance, surpassing GPT-4o despite being 20x smaller.

Flaw 3: Fine-tuning changes everything

Scaling studies use base models. In practice, you fine-tune for your specific task. With about 100 labeled examples, a well-tuned 7B model reaches parity with a 70B model on specialized tasks. The gap that exists in general benchmarks closes when you specialize.

Small Models vs Large Models on Agent Tasks

Sub-10B models matching or exceeding frontier LLMs on specialized tasks

The key insight: capability is the binding constraint, not parameter count. If a small model can do the job, the extra parameters in a large model are wasted compute.

Performance evidence

The paper documents specific cases where small models match or exceed large ones on agent-relevant tasks.

Tool-calling and structured output

The most striking results come from specialized tool-calling models:

Model	Size	Performance
xLAM-2-8B	8B	State-of-the-art tool calling, beats GPT-4o and Claude 3.5
Nemotron-H	2-9B	Matches 30B dense models at 1/10th the FLOPs
Fine-tuned OPT-350M	350M	77.55% on ToolBench vs ChatGPT's 26% (3x better)

The xLAM and Nemotron results are particularly significant because these are current-generation models optimized specifically for agent workloads, not general benchmarks.

Reasoning and code generation

Model	Size	Comparison
Phi-2	2.7B	Matches 30B models on reasoning, runs 15x faster
Phi-3 small	7B	Matches 70B models on language understanding
DeepSeek-R1-Distill-Qwen	7B	Outperforms Claude-3.5-Sonnet on reasoning benchmarks
Hymba-1.5B	1.5B	3.5x higher throughput, outperforms 13B models
SmolLM2	125M-1.7B	Matches 70B models from 2 years prior on instruction following
RETRO-7.5B	7.5B	Matches GPT-3 (175B) with 25x fewer parameters

Inference enhancement techniques

The paper highlights that small models can exceed larger ones through smart inference strategies:

Technique	Model	Result
Tool use	Toolformer (6.7B)	Outperforms GPT-3 (175B) via API calls
Structured reasoning	1-3B models	Rival 30B+ LLMs on math problems
Test-time compute	SLMs generally	Significantly more affordable scaling than LLMs

Why inference techniques matter

Toolformer shows that a 6.7B model calling calculators and search APIs beats a 175B model that tries to do everything in its head. Similarly, chain-of-thought and structured reasoning let tiny models punch above their weight. For agents that already use tools, this advantage compounds.

Real-world agent frameworks

The authors tested three open-source agent frameworks to measure how many LLM calls could be replaced:

Framework	Purpose	SLM-Replaceable
MetaGPT	Multi-agent software development	~60%
Open Operator	Workflow automation	~40%
Cradle	GUI computer control	~70%

The pattern is consistent: routine operations (parsing, template generation, structured output) work with small models. Complex reasoning (architectural decisions, debugging, error recovery) benefits from large models.

Economic analysis

The cost difference is substantial.

The Economics of Agent Architectures

Cost per 1,000 agent sessions comparing LLM-only vs hybrid approaches

Inference costs

Running a 7B model costs roughly 10-30x less than running a 175B model, depending on optimization and query length. At scale, this determines viability.

Consider an agent that makes 1,000 API calls per user session. At $0.01 per call with a large model, that's $10 per session. At $0.001 per call with a small model, it's $1. The difference between a sustainable business and one that burns money.

Fine-tuning agility

Large models require weeks and significant GPU clusters to fine-tune. Small models can be specialized overnight.

Why "hours not weeks"?

The key enabler is LoRA (Low-Rank Adaptation). Instead of updating all 7 billion parameters, LoRA trains small adapter layers (typically 0.1-1% of model size). A 7B model can be fine-tuned on a single A100 GPU in 2-4 hours. The same process for a 70B model requires a cluster and takes days to weeks.

This enables a faster iteration cycle. You can ship a fix today, not next quarter. When your agent starts failing on a new edge case, you can have a patched model in production by tomorrow morning.

Edge deployment

Small models run on consumer hardware. NVIDIA's ChatRTX demonstrates local SLM execution on consumer GPUs. This means:

Lower latency (no round-trip to cloud)
Better privacy (data stays on device)
Reduced infrastructure costs

The market context

The agentic AI sector received over $2 billion in startup funding in 2024, with projections reaching $200 billion by 2034. Cloud infrastructure investment hit $57 billion in 2024.

Most of this assumes LLM-centric deployment. The authors argue this creates a massive efficiency opportunity for teams willing to use hybrid architectures.

When to use what

Not all agent tasks are equal. The paper provides guidance on task categorization.

Hybrid Agent Architecture

Route routine tasks to SLM, complex reasoning to LLM

Task routing decision matrix

Task Type	Recommended	Why
Command parsing	SLM	Narrow vocabulary, predictable structure
JSON/API payload generation	SLM	Templated output, schema-constrained
Tool selection	SLM	Fixed set of options, pattern matching
Short summarization	SLM	Routine compression, no novel reasoning
Template completion	SLM	Fill-in-the-blank, highly repetitive
Multi-step planning	LLM	Requires exploring solution space
Error diagnosis	LLM	Unexpected inputs, needs broad knowledge
Architectural decisions	LLM	Trade-off reasoning, context-dependent
Creative problem-solving	LLM	No clear template to follow
Context-heavy reasoning	LLM	Requires synthesizing broad information

The pattern: if the task has predictable structure and limited variation, route to SLM. If it requires flexible reasoning about novel situations, route to LLM.

The hybrid pattern

The recommended architecture:

Router layer: Classifies incoming requests by complexity
SLM pool: Handles routine operations (60-70% of traffic)
LLM fallback: Processes complex cases (30-40% of traffic)
Feedback loop: Logs cases where SLM failed, uses them to improve routing

The goal is not to replace large models entirely, but to use the right tool for each job.

Implementation blueprint

You've decided to stop paying 10x more than necessary for your agent's API calls. Now what?

The migration from LLM-only to hybrid architecture follows a pattern: instrument your current system, find the repetitive tasks, train small models to handle them, and route intelligently. The paper outlines six steps. Here's how to execute them.

Recommended tech stack

Component	Recommended	Alternative
Base SLM	Phi-3 (7B), SmolLM2 (1.7B)	Llama 3.2 (3B), Nemotron-H
Tool-calling SLM	xLAM-2-8B	Fine-tuned Mistral 7B
Fine-tuning	LoRA via PEFT library	QLoRA for memory constraints
Serving	vLLM, TensorRT-LLM	Ollama for prototyping
Embedding Model	text-embedding-3-small	nomic-embed-text
Clustering	HDBSCAN	k-means
Orchestration	LangGraph, CrewAI	Custom router

Core architecture pattern

The fundamental pattern is simple: classify each request, send routine tasks to cheap local models, and reserve expensive API calls for genuinely hard problems.

class HybridAgentRouter:
    """Route requests to SLM or LLM based on task complexity."""
 
    def __init__(self, slm_client, llm_client, classifier):
        self.slm = slm_client      # e.g., local Phi-3 via vLLM
        self.llm = llm_client      # e.g., GPT-4o API
        self.classifier = classifier  # Trained routing model
 
    def route(self, task: str, context: dict) -> str:
        # Classify task complexity
        complexity = self.classifier.predict(task)
 
        if complexity == "routine":
            # 60-70% of calls: parsing, JSON, templates
            return self.slm.generate(task, context)
        else:
            # 30-40% of calls: reasoning, planning, debugging
            return self.llm.generate(task, context)
 
    def log_for_improvement(self, task, response, success):
        # Every call is training data for future SLM fine-tuning
        self.data_collector.log(task, response, success)

The 6-step conversion process

Step 1: Instrument and collect

Before optimizing anything, you need to know what your agent actually does. Deploy logging on every LLM call. Capture:

Input prompts
Output responses
Tool call contents
Latency metrics
Success/failure indicators

# Example logging decorator
def log_llm_call(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        response = func(*args, **kwargs)
        log_entry = {
            "prompt": args[0],
            "response": response,
            "latency_ms": (time.time() - start) * 1000,
            "timestamp": datetime.utcnow().isoformat()
        }
        # Encrypt and store with role-based access
        secure_logger.log(log_entry)
        return response
    return wrapper

Use encrypted pipelines with role-based access. Anonymize before storage.

Step 2: Curate and filter

You need 10,000-100,000 examples to fine-tune effectively. Clean the data aggressively: remove PII, strip sensitive information, and paraphrase application-specific content. Your training data will eventually define your model's behavior, so garbage in means garbage out.

Step 3: Cluster tasks

Your logged prompts will naturally group into task types. Apply unsupervised clustering to find them. Common clusters include:

Intent recognition
Data extraction
Document summarization
Code generation for tools
Response formatting

How to cluster tasks in practice

Use an embedding model (like OpenAI's text-embedding-3 or an open-source alternative) to convert each logged prompt into a vector. Apply k-means or HDBSCAN clustering to group similar prompts. Inspect the clusters manually to identify distinct task types. Each cluster becomes a candidate for a specialized SLM. The goal is to find groups where prompts share structure but differ in details (e.g., "Extract the price from this invoice" vs "Extract the date from this receipt").

from sklearn.cluster import HDBSCAN
import numpy as np
 
# Embed all logged prompts
embeddings = embedding_model.encode(prompts)
 
# Cluster to find task types
clusterer = HDBSCAN(min_cluster_size=50, min_samples=10)
labels = clusterer.fit_predict(embeddings)
 
# Each cluster = candidate for specialized SLM
for cluster_id in set(labels):
    if cluster_id == -1:  # Noise
        continue
    cluster_prompts = [p for p, l in zip(prompts, labels) if l == cluster_id]
    print(f"Cluster {cluster_id}: {len(cluster_prompts)} examples")
    print(f"  Sample: {cluster_prompts[0][:100]}...")

Step 4: Select SLM candidates

Pick your base models. Ignore general benchmarks like MMLU. Instead, evaluate on criteria that actually matter for your deployment:

Task-relevant performance (test on your actual prompts)
Licensing terms (can you use it commercially?)
Memory footprint (will it fit on your target hardware?)
Instruction-following quality (does it respect your format requirements?)

Phi-3, Nemotron, SmolLM2, and Llama 3.2 are solid starting points. For tool-calling specifically, xLAM-2-8B is hard to beat.

Step 5: Specialize and fine-tune

For each task cluster, fine-tune a specialized model:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, Trainer
 
# Load base SLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
 
# Configure LoRA (trains ~0.1-1% of parameters)
lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
 
# Fine-tune on cluster-specific data
trainer = Trainer(
    model=model,
    train_dataset=cluster_dataset,
    # ... training args
)
trainer.train()  # Hours, not weeks

A useful trick: knowledge distillation. Run your existing LLM on the training prompts, capture its outputs, and train the SLM to mimic them. You're essentially compressing the LLM's behavior into a smaller, cheaper package.

Step 6: Deploy and iterate

Ship it. Replace LLM calls with your new SLM endpoints. But don't walk away. Monitor performance closely in the first weeks. Log cases where the SLM fails and the system falls back to LLM. Those failures become your next training set.

The best hybrid systems improve continuously. Every production call is potential training data. Every failure teaches you where your routing logic or SLM capability falls short.

Key parameters

When fine-tuning, these numbers are a reasonable starting point:

Parameter	Recommended	Notes
Training examples	10K-100K	Per task cluster
LoRA rank (r)	8-32	Higher = more capacity, more compute
Batch size	4-8	Adjust for GPU memory
Learning rate	1e-4 to 2e-4	Standard for LoRA
Epochs	3-5	Monitor for overfitting

Pitfalls and gotchas

Teams who've done this before will tell you: the technical implementation is the easy part. The hard part is avoiding these traps:

Don't optimize for general benchmarks: SLMs that score lower on MMLU can still beat GPT-4o on your specific tool-calling task. Test on your actual workload.
Logging overhead matters: At thousands of calls per session, logging latency adds up. Use async logging and batch writes.
Cluster quality is critical: Bad clustering = bad specialized models. Spend time validating clusters manually before fine-tuning.
Router errors cascade: If your router misclassifies a complex task as routine, the SLM fails and the user sees the error. Build in fallback mechanisms.
Monitor SLM drift: As your agent evolves, task distributions change. Re-cluster and re-train periodically.
Licensing traps: Some models (Llama 3) have commercial use restrictions. Verify before deploying.

Case studies

MetaGPT: Multi-agent software development

MetaGPT uses multiple agents (Product Manager, Architect, Engineer, QA) to collaboratively build software.

LLM usage: Role-based actions, prompt templates, RAG retrieval, dynamic intelligence

SLM opportunity: ~60% of calls are suitable for specialized models. Role-specific prompting, template completion, and retrieval augmentation work with fine-tuned small models.

LLM-required: Complex architectural reasoning and adaptive debugging benefit from generalist capabilities.

Open Operator: Workflow automation

Open Operator automates business workflows through API orchestration.

LLM usage: Natural language parsing, decision-making, content generation

SLM opportunity: ~40% of calls. Command parsing and template-based message generation are predictable enough for small models.

LLM-required: Multi-step reasoning chains and context-dependent decisions need larger models.

Cradle: GUI computer control

Cradle controls computers by understanding screenshots and simulating interactions.

LLM usage: Interface interpretation, task execution planning, error handling

SLM opportunity: ~70% of calls. Repetitive workflows and pre-learned UI sequences work with specialized small models.

LLM-required: Dynamic GUI adaptation and unstructured error resolution require flexible reasoning.

Business implications

Cost reduction at scale

For organizations running agent systems at scale, the savings are significant. Moving 60% of calls from $0.01 to $0.001 reduces costs by 54%. At millions of daily calls, this translates to substantial operational savings.

Faster iteration cycles

Small model fine-tuning takes hours, not weeks. This enables:

Rapid bug fixes
Quick adaptation to new requirements
Continuous improvement from production data

Edge deployment opportunities

On-device inference opens new use cases:

Privacy-sensitive applications where data can't leave the device
Low-latency requirements where cloud round-trips are too slow
Offline operation in disconnected environments

Competitive advantage

Organizations that adopt hybrid architectures gain cost advantages over competitors stuck on LLM-only stacks. As the agentic AI market grows toward $200 billion, efficiency becomes a differentiator.

Limitations

The paper is honest about constraints, identifying both inherent challenges and industry barriers.

Operational complexity

Managing multiple specialized models is harder than using one general model. You need:

Routing logic to direct requests
Multiple model serving infrastructure
Separate fine-tuning pipelines
Monitoring across model types

Data requirements

Effective fine-tuning requires 10,000+ examples. Organizations without significant usage data may struggle to specialize effectively.

Upfront investment

Existing infrastructure favors LLM-centric deployment. Switching requires new tooling, training, and potentially retraining teams.

Evolving definitions

What counts as "small" changes over time. Today's 10B model is large by 2022 standards. The specific recommendations will shift as hardware and models evolve.

Barriers to adoption

The paper identifies three structural barriers slowing SLM adoption:

B1: Infrastructure Lock-in

Large investments in centralized LLM inference infrastructure (estimated at $57 billion in 2024) create industry inertia. Organizations expect returns within 3-4 years, making them reluctant to shift architectures mid-investment.

B2: Benchmark Mismatch

SLM development still follows generalist benchmarks (MMLU, HumanEval) rather than agentic-specific metrics. The paper notes: "If one focuses solely on benchmarks measuring the agentic utility of agents, the studied SLMs easily outperform larger models." The industry lacks standardized benchmarks for tool-calling, JSON generation, and routing accuracy.

B3: Visibility Gap

SLMs "do not receive the level of marketing intensity and press attention LLMs do, despite their better suitability in many industrial scenarios." Frontier model releases dominate headlines; incremental SLM improvements go unnoticed.

The authors argue these are practical hurdles, not fundamental flaws, and will diminish as the market recognizes the economic advantages.

NVIDIA's perspective

The authors work for NVIDIA, which sells hardware for both training and inference at all scales. While the analysis is rigorous, readers should note the source.

Key takeaways

Agent tasks are narrow. AI agents perform specialized, repetitive tasks. They don't need general intelligence.
Small models can match large ones. On tool-calling, structured output, and templated generation, fine-tuned sub-10B models match or beat frontier LLMs.
Economics favor hybrid architectures. 10-30x cost reduction on 60-70% of calls creates significant savings at scale.
Start large, specialize small. Use large models to prototype and identify patterns. Replace high-volume tasks with specialized small models.
The infrastructure is ready. Tools like NVIDIA NeMo, LoRA fine-tuning, and optimized serving frameworks make hybrid deployment practical today.

The bigger-is-better assumption served the industry well during the capability race. For production agent systems where economics matter, the answer is more nuanced: use the smallest model that does the job.

Original paper: arXiv ・ PDF ・ HTML

NVIDIA Research Page: research.nvidia.com/labs/lpr/slm-agents

Authors

Peter BelcakNVIDIA,Greg HeinrichNVIDIA,Shizhe DiaoNVIDIA,Yonggan FuNVIDIA,Xin DongNVIDIA,Saurav MuralidharanNVIDIA,Yingyan Celine LinNVIDIA,Pavlo MolchanovNVIDIA

Cite this paper

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, Pavlo Molchanov (2025). Small language models are the future of agentic AI. NVIDIA Research 2025.

Key Findings