arXiv 2026January 17, 2026

PaperScout: Teaching AI to Search Like a Researcher

Tao Yanget al.

Categories:LLM AgentsInformation RetrievalReinforcement Learning

Key Findings

1

Reframes paper search as sequential decisions: the agent chooses actions (search, expand, observe) based on what it has learned so far, not a fixed script

2

Small models can match large ones: a 4B parameter model with PSPO training matches or beats untrained 70B+ models on search tasks

3

Solves the credit assignment problem: PSPO teaches the agent which specific step led to the breakthrough, unlike standard RL which blurs the signal across tokens

4

57% recall on real queries: PaperScout achieves 57.4% recall versus 54.1% for PaSa (the previous best multi-turn baseline)

5

Efficient tool usage: reaches higher recall with fewer tool calls compared to baselines, reducing compute and API costs

6

Open-source implementation: code and trained models available at github.com/pty12345/PaperScout

TL;DR
  1. The problem. Academic paper search is hard because queries are vague ("papers about LLM agents for code") and require iterative refinement. Fixed workflows cannot adapt to what the search returns.

  2. The solution. PaperScout treats search as a sequential decision process. An agent observes what it has found so far and decides whether to search with new keywords, expand by following citations, or stop.

  3. The results. A 4B parameter model trained with PSPO achieves 57.4% recall on real academic queries, beating larger untrained models and prior multi-turn baselines like PaSa (54.1%).

Research overview

If you have ever searched for academic papers, you know the frustration. You start with a vague query like "recent work on LLM agents." Google Scholar returns thousands of results. You skim a few, find a relevant one, and check its references. Then you notice a citation that looks promising and follow that thread. An hour later, you have 30 tabs open and still feel like you are missing something important.

This is not a workflow you can automate with a fixed pipeline. The search process depends on what you find. A good paper might reveal new keywords. A dead end might force you to backtrack. Human researchers make these decisions constantly without thinking about them.

Current AI search systems do not work this way. They either do a single embedding lookup (semantic match) or follow a rigid sequence: rewrite query, search, expand, select. When the first search returns poor results, these systems have no way to adapt. Think of the difference between a keyword engine and a junior PhD student. The engine just fetches what you asked for. The student reads the abstract, realizes their initial keywords were wrong, and pivots to a new search strategy. PaperScout emulates the student.

PaperScout takes a different approach. It models paper search as a Partially Observable Markov Decision Process (POMDP), which is a fancy way of saying: the agent makes decisions based on incomplete information, and each decision changes what information it can see next. At each step, PaperScout chooses between three actions:

  • Search: Query the database with keywords
  • Expand: Follow citations from a promising paper
  • Observe: Check the current paper pool and decide what to do next

The agent learns when to use each action through reinforcement learning. Not standard RL, which optimizes individual tokens, but a new method called PSPO (Proximal Sequence Policy Optimization) that optimizes entire tool-call sequences.

The gap: why fixed workflows fail

Consider a realistic academic query: "Methods for reducing hallucination in retrieval-augmented generation systems, focusing on citation verification."

A semantic matching system embeds this query, finds the top-k similar papers, and returns them. If the database does not have papers with those exact phrases, you get mediocre results. No iteration, no refinement.

A fixed workflow system might:

  1. Rewrite the query into multiple sub-queries
  2. Search each sub-query
  3. Expand by following citations
  4. Select the final set

This is better, but it cannot adapt. What if the "citation verification" sub-query returns nothing useful, but "factual grounding" would have worked? The system has no mechanism to discover this.

Here is a concrete example of what PaperScout does differently. Query: "Deep learning for weather prediction." A keyword search misses papers that use the term "Neural Earth System Modeling" because that phrase never appears in the query. PaperScout finds one relevant paper, sees the unfamiliar term in its abstract, and pivots to search for that. The fixed workflow never gets the chance.

The PaperScout paper tested these approaches on RealScholarQuery, a dataset of 100 complex queries with ground-truth relevant papers. Results:

MethodRecallF1
Google Search30.4%0.254
Google Scholar24.7%0.208
PaSa (workflow)54.1%0.418
PaperScout57.4%0.441

The 3.3 percentage point improvement in recall might seem small, but consider what it means in practice. For a literature review with 50 ground-truth relevant papers, that is 1-2 additional papers found. Over hundreds of searches, this compounds significantly.

How PaperScout works

PaperScout has three core components: an LLM backbone that generates actions, a tool interface that executes them, and a paper pool that tracks what has been found.

The decision loop

At each timestep t, the agent:

  1. Observes the current state: query, papers found so far, previous actions
  2. Generates an action: Search(keywords), Expand(paper_id), or Stop
  3. Executes the action and updates the paper pool
  4. Repeats until Stop or max steps reached

This is fundamentally different from a workflow because the agent's choice at step t depends on what it found in steps 1 through t-1. If early searches return good results, it might search more. If they return noise, it might switch to expanding citations from the one good paper it found.

The paper shows this dynamic adaptation in action. On clear, well-defined queries, the agent mostly uses Expand (following citations from good papers). On vague, exploratory queries, it shifts to 80% Search (trying different keyword combinations). Hard-coded workflows cannot make this distinction.

The Search tool

The Search tool takes keywords and queries Semantic Scholar's API. Unlike simple embedding lookup, PaperScout can generate different keyword combinations based on what it has learned. If "hallucination reduction" returns nothing, it might try "factual consistency" or "grounding methods."

The Expand tool

The Expand tool takes a paper ID and retrieves its references and citations. This is how human researchers find related work. You find one good paper and follow its connections. PaperScout learns when expansion is more valuable than another search.

The Observe tool

The Observe tool lets the agent inspect its current paper pool. This sounds trivial but is critical. Without explicit observation, the agent might keep searching for papers it already found. Observe provides the feedback loop that makes adaptive search possible.

PSPO: Optimizing sequences, not tokens

Standard reinforcement learning for LLMs operates at the token level. Each token gets a reward, and the model learns to maximize cumulative token rewards. This works well for single-turn generation but poorly for multi-step tool use.

Why? Credit assignment. If the agent makes a good Search call at step 3 that leads to a successful Expand at step 5, token-level RL cannot easily attribute the final success back to that Search decision. The reward signal is too diffuse.

Here is a concrete example. For the query "neural earth-system modeling for climate prediction":

  1. Search("neural earth-system modeling") → 0 results (reward = 0)
  2. Search("climate prediction deep learning") → 3 papers, 1 relevant (reward = +0.4)
  3. Expand(Paper A) → finds Paper B, a perfect match (reward = +0.6)

With token-level PPO, the +0.4 reward gets split across every token in "climate prediction deep learning." The model cannot tell which words mattered. With PSPO, the entire Search action receives the +0.4 reward as a unit. The model learns: "this keyword combination works for climate queries."

PSPO (Proximal Sequence Policy Optimization) solves this by operating at the sequence level. Instead of rewarding individual tokens, it rewards entire tool-call sequences. The key insight is that tool calls are natural units of action. A Search call is one decision, regardless of how many tokens it takes to generate.

If this sounds familiar, it should. PSPO shares the same philosophy as DeepSeek-R1's GRPO (Group Relative Policy Optimization): optimize at the outcome or sequence level rather than the noisy token level. This is why small models can suddenly "reason" better when trained with these methods. The optimization unit matches the decision unit.

What the agent optimizes for

The reward signal has two components:

  • Relevance Gain: Finding new papers that match the ground-truth set. Each new relevant paper adds to the reward.
  • Repetition Penalty: Spamming the same tool calls (searching identical keywords, expanding the same paper) is penalized.

This combination teaches the agent to be both effective (find good papers) and efficient (do not waste steps). The penalty prevents degenerate strategies like expanding every paper in the pool.

How PSPO differs from alternatives

PPO (token-level): Optimizes each token independently. High granularity but noisy credit assignment. Struggles to connect early actions to later outcomes.

GSPO (outcome-only): Only uses the final outcome reward. Avoids noisy intermediate rewards but provides no learning signal for how to improve intermediate steps.

PSPO (sequence-level): Treats each tool-call sequence as one unit. Uses a critic to provide dense process rewards while keeping the actor's optimization at sequence level. Best of both approaches.

The paper shows PSPO beats both alternatives:

MethodPrecisionRecallF1
PPO0.4050.5370.408
GSPO0.4330.5570.439
PSPO0.4420.5740.441

The improvements are consistent across all metrics. PSPO learns better policies because it aligns the optimization granularity with the decision granularity.

Results: efficiency and scalability

PaperScout achieves the best recall (57.4%) but also uses its tool calls more efficiently than baselines.

The efficiency curve shows recall versus number of tool calls. PaperScout reaches high recall faster than alternatives:

  • Baseline (PaSa): needs ~70 tool calls to reach 55% recall
  • PaperScout: reaches 55% recall in ~30 tool calls

That is a 57% reduction in API usage for the same recall. At 30 tool calls, PaperScout has already hit 67% of its maximum recall, while PaSa is still climbing. This matters for both cost (fewer API calls) and latency (faster results for users).

Small models can compete

A surprising finding: Qwen3-4B with PSPO training matches or beats much larger untrained models.

ModelParametersRecall
Qwen3-4B (untrained)4B49.7%
Qwen3-Max (untrained)70B+56.2%
Qwen3-4B + PSPO4B57.4%

The 4B model with PSPO beats the 70B+ model without it. This suggests that learning when to use tools matters more than raw model size for this task. A well-trained small model outperforms a larger model that has not learned the search dynamics.

The cost implications are significant:

  • Without PSPO (70B model): ~$0.12 per query → $1,200/day for 10,000 queries
  • With PSPO (4B model): ~$0.006 per query → $60/day for the same volume

That is a 20x reduction in inference costs while achieving better recall. PSPO training is a one-time cost that pays for itself within days of production deployment.

Implementation blueprint

ComponentRecommendedAlternative
LLMQwen3-4BLlama 3, Mistral
TrainingPSPO (paper's code)GRPO
Search APISemantic ScholarOpenAlex
FrameworkPyTorchJAX

Key parameters

These are the values from the paper that produced the benchmark results:

ParameterValueNotes
Max tool calls100Per query
Temperature0.7For exploration
Top-p0.9Nucleus sampling
Batch size8Per GPU
Learning rate1e-5AdamW

Core data structures

The paper uses two main data structures:

Paper Pool: A set of paper IDs with metadata (title, abstract, year). Updated after each Search or Expand action.

class PaperPool:
    def __init__(self):
        self.papers = {}  # id -> metadata
 
    def add(self, paper_id, metadata):
        self.papers[paper_id] = metadata
 
    def observe(self):
        return list(self.papers.values())

Action: A typed union of Search, Expand, or Observe.

@dataclass
class SearchAction:
    keywords: list[str]
 
@dataclass
class ExpandAction:
    paper_id: str
 
@dataclass
class ObserveAction:
    pass

The decision loop

def search_loop(query, max_steps=100):
    pool = PaperPool()
    history = []
 
    for step in range(max_steps):
        action = agent.generate(
            query=query,
            pool=pool.observe(),
            history=history
        )
 
        if isinstance(action, SearchAction):
            results = s2_api.search(
                action.keywords
            )
            for p in results:
                pool.add(p.id, p)
 
        elif isinstance(action, ExpandAction):
            refs = s2_api.refs(
                action.paper_id
            )
            for p in refs:
                pool.add(p.id, p)
 
        elif action == "STOP":
            break
 
        history.append(action)
 
    return pool.papers

Pitfalls and edge cases

API rate limits: Semantic Scholar has rate limits. Implement exponential backoff and caching. The paper's experiments used local caching to avoid repeated API calls.

Circular expansions: The agent might expand a paper it has already expanded. Track expanded IDs and filter them from the Expand action space.

Keyword repetition: The agent might search the same keywords repeatedly. Add the search history to the prompt so it knows what it has tried.

Empty results: Some searches return nothing. The agent needs to learn that this is information, not failure. An empty result means those keywords do not work for this query.

Context window pollution: The Observe action puts the entire paper pool into the context. For long sessions with 50+ papers, this explodes token usage and can exceed context limits. Implement summarization (keep only titles and relevance scores) or a sliding window that drops the oldest papers from the observation state.

Limitations and future work

Domain specificity: PaperScout was tested on computer science papers via Semantic Scholar. Performance on other domains (medicine, law, humanities) is unknown. Different fields have different citation patterns.

Query complexity: The benchmark uses complex, multi-faceted queries. Simple queries ("papers by Hinton on backprop") might not benefit from the adaptive approach.

Training data requirements: PSPO training requires thousands of query-paper examples. Creating this dataset for a new domain takes significant effort.

Semantic Scholar dependency: The current implementation is tied to Semantic Scholar's API and coverage. Papers not indexed there cannot be found.

The bottom line

For ML engineers building search systems: PaperScout demonstrates that modeling search as sequential decision-making beats fixed workflows. The PSPO training method was validated on academic search, but the core idea (sequence-level optimization for tool-using agents) could transfer to other multi-step retrieval tasks.

For researchers doing literature reviews: The approach achieves 57% recall on complex queries, which is state-of-the-art. A trained 4B model can match what larger models achieve without training.

For teams considering implementation: Start with the paper's open-source code. The main engineering challenge is setting up PSPO training and managing API rate limits.

A note on the GitHub repo: The code is research-grade. It has clear structure and reproduces the paper's results, but lacks production error handling and uses local Milvus for vector storage. Expect to rewrite the retrieval layer for your specific vector database (Pinecone, Weaviate, etc.) and add proper logging, retry logic, and monitoring. The PSPO training code is solid; the inference wrapper needs work.

The key insight is not the specific architecture but the framing: search is a sequential decision process, not a pipeline. Once you model it that way, standard RL techniques become applicable, and small models can learn to search effectively.

Authors

Tao YangBeijing University of Posts and Telecommunications,Fangxiang FengBeijing University of Posts and Telecommunications,Jingjing GuoBeijing University of Posts and Telecommunications,Xinjie LiBeijing University of Posts and Telecommunications,Xiaojie WangBeijing University of Posts and Telecommunications

Cite this paper

Tao Yang, Fangxiang Feng, Jingjing Guo, Xinjie Li, Xiaojie Wang (2026). PaperScout: Teaching AI to Search Like a Researcher. arXiv 2026.

Related Research