AI Research Papers - Analysis & Implementation Insights | Tekta.ai

PaperScout: Teaching AI to Search Like a Researcher

Tao Yang, Fangxiang Feng et al.•Jan 2026

LLM AgentsInformation RetrievalReinforcement Learning

PaperScout reframes academic paper search as sequential decision-making. Instead of fixed workflows, an autonomous agent decides when to search, expand, or observe, achieving 57% recall versus 54% for the best baseline.

Reframes paper search as sequential decisions: the agent chooses actions (search, expand, observe) based on what it has learned so far, not a fixed script

CORAL: Dynamic Multi-Agent Coordination Without Predefined Workflows

Xinxing Ren, Quagmire Zang et al.•Jan 2026

Multi-Agent SystemsLLM ApplicationsAgent Architectures

Most multi-agent systems require engineers to manually define workflows and routing rules. CORAL introduces an information-flow orchestrator that dynamically coordinates agents through natural language, eliminating predefined state machines. On the GAIA benchmark, this approach outperforms workflow-based systems by 8.5 percentage points when using heterogeneous models.

Eliminates manual workflow design: an 'information flow orchestrator' dynamically coordinates agents using natural language instead of predefined routing rules

MAXS: The 'Measure Twice, Cut Once' Agent Architecture

Jian Zhang, Zhiyuan Wang et al.•Jan 2026

AI AgentsReasoningInference Optimization

Standard LLM agents rush to execute, like junior devs who code before thinking. MAXS looks 4 steps ahead before committing. The result: 63.5% accuracy vs 52.9% for CoT, 100x cheaper than MCTS. A senior-dev approach to agent reasoning.

4-step lookahead adds 5 points: Removing lookahead drops MiMo-VL-7B from 63.46% to 58.50% average accuracy across five benchmarks

ViDoRe V3: The Benchmark That Exposes What Your RAG Pipeline Cannot See

Antonio Loison, Quentin Mace et al.•Jan 2026

Information RetrievalMultimodal AIBenchmarks

Most RAG benchmarks test text retrieval on text documents. But production documents contain charts, tables, and diagrams. ViDoRe V3 is a comprehensive multimodal RAG benchmark revealing that visual retrievers beat text-only by 8+ points, textual rerankers deliver 13x more improvement than visual ones, and current VLMs fail at visual grounding with 85% gap to human performance.

Visual retrievers beat text-only: ColEmbed-3B-v2 scores 59.8 NDCG@10 versus 51.0 for the best textual retriever, an 8.8 point advantage

YaPO: Sparse Steering Vectors That Actually Work

Abdelaziz Bounhar, Rania Hossam Elmohamady Elbadry et al.•Jan 2026

AlignmentInterpretabilityFine-tuning

Dense activation steering entangles behaviors. YaPO learns sparse steering vectors in SAE latent space, converging 10x faster than BiPO while achieving better cultural alignment across 15 cultural contexts. The sparse approach generalizes to hallucination reduction, jailbreak resistance, and power-seeking mitigation without degrading general knowledge.

10x faster convergence: YaPO reaches target loss in ~150 steps while BiPO plateaus above 0.3 loss after 600 steps

HiMem: Hierarchical Memory That Actually Remembers What Matters

Yiming Zhang, Yifan Zeng et al.•Jan 2026

LLM AgentsMemory SystemsNatural Language Processing

Current LLM memory systems treat all conversations equally, but human memory does not work that way. HiMem introduces a two-tier architecture inspired by cognitive science: abstract 'notes' for distilled knowledge and concrete 'episodes' for raw events. The result is 11.7 pp higher accuracy on LoCoMo with 53% fewer tokens than A-MEM.

Two-tier memory mimics human cognition: abstract 'Notes' store distilled facts and preferences, while concrete 'Episodes' preserve raw conversation context

LiveVectorLake: Real-Time Versioned Knowledge Base for RAG

Tarun Prajapati et al.•Jan 2026

RAG SystemsVector Databases

Standard RAG systems re-embed entire documents on every update, wasting compute and losing version history. LiveVectorLake introduces a dual-tier architecture with chunk-level change detection: a hot tier (Milvus) for sub-100ms current queries and a cold tier (Delta Lake) for complete version history. Results: 10-15% content reprocessing vs 85-95% baseline, 65ms median query latency, and zero temporal leakage on historical queries.

Processes only what changed: Instead of re-embedding entire documents on every edit, the system fingerprints each paragraph and skips unchanged content (85% compute savings)

CIEA: How to Search for Visual Details Your Captions Miss

Delong Zeng, Yuexiang Xie et al.•Jan 2026

Information RetrievalComputer VisionMultimodal Learning

Multimodal search typically matches images to text descriptions. CIEA flips the script: it extracts what images contain that captions leave out. By focusing on complementary rather than shared information, this approach outperforms state-of-the-art retrieval on three benchmarks.

Focuses on what captions miss: instead of matching images to their text descriptions, CIEA extracts the visual details that captions forget to mention

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

Yuxuan Jiang, Francis Ferraro•Jan 2026

AI AgentsLarge Language ModelsReinforcement Learning

Standard reward models for tool-using LLMs provide inconsistent signals because they cannot distinguish between high-level planning errors and low-level execution mistakes. SCRIBE introduces Skill Prototypes and a Router to achieve MATH500 95.8% (+6.7pp), AIME25 63.3% (+20.0pp), and BFCL v4 51.3% (+18.3pp) through structured mid-level supervision.

Solves the 'credit assignment' trap: pinpoints exactly where an agent failed (planning, skill, or tool) rather than guessing from the final answer

From Vague to Precise: How Multi-Agent LLM Teams Fix Your Prompts

Alberto Purpura, Li Wang et al.•Jan 2026

Natural Language ProcessingPrompt EngineeringMulti-Agent Systems

LLMs often produce conceptually correct but procedurally flawed outputs. This paper introduces a multi-agent workflow that iteratively refines prompts by separating tasks from constraints and using constraint-specific scoring, improving instruction compliance by up to 53% on challenging prompts (tested on Llama 3.1 8B and Mixtral-8x7B).

The 'checklist trick': simply listing constraints separately from the task jumps compliance from 82% to 91%. Like giving a recipe with ingredients listed separately from instructions.

ARM: Teaching RAG Systems to Forget Like Humans

Okan Bursa•Jan 2026

Information RetrievalNatural Language ProcessingCognitive AI

Static vector indexes treat all information equally. ARM introduces cognitive-inspired memory that consolidates frequently accessed knowledge while letting rarely used items fade, achieving near-SOTA results with only 22M parameters.

Borrows from psychology: uses Ebbinghaus's forgetting curve to let unused knowledge fade naturally, preventing index bloat

Orchestral AI: A Lightweight Framework for Provider-Agnostic LLM Agents

Alexander Roman, Jacob Roman•Jan 2026

AI AgentsDeveloper Tools

Building LLM agents means choosing between vendor lock-in or complex multi-package ecosystems. Orchestral provides a unified, type-safe Python interface across all major providers with synchronous execution that makes debugging straightforward and deployment simple.

Write once, run anywhere: switch between OpenAI, Anthropic, Google, Mistral, Groq, or local Ollama models by changing one line of code

SimpleMem: 30x More Efficient Memory for LLM Agents

Jiaqi Liu, Yaofeng Su et al.•Jan 2026

AI AgentsNatural Language Processing

LLM agents accumulate conversation history that balloons token costs and degrades performance. SimpleMem introduces semantic lossless compression that cuts token usage by 30x while improving accuracy by 26% over existing memory systems like Mem0.

Slashes inference costs by 97%: reduces context from ~17,000 tokens to just ~550 tokens per query, enabling 'infinite' conversations over months of interaction

CaveAgent: Transforming LLMs into Stateful Runtime Operators

Maohao Ran, Zhenglin Wan, Cooper Lin, et al.•Jan 2026

AI AgentsLarge Language Models

Traditional LLM agents serialize everything to text, losing data fidelity and wasting tokens. CaveAgent introduces a dual-stream architecture that separates reasoning (semantic stream) from execution (runtime stream), letting agents manipulate persistent Python objects directly. Results: 28% fewer tokens and 38% fewer steps on multi-turn tasks, with 10%+ accuracy gains on stateful workflows.

Persistent memory without token cost: Store DataFrames, models, and custom objects in a Python runtime that persists across turns, reducing repeated serialization

Topic-Enriched Embeddings: Combining Classical NLP with Modern RAG

Rodrigo Kataishi•Jan 2026

Information RetrievalNatural Language ProcessingMachine Learning

Modern embedding models capture semantic similarity but miss topical structure. This paper shows that adding TF-IDF, LSA, and LDA signals to contextual embeddings improves retrieval precision by 4 percentage points on domain-specific corpora. A practical approach for practitioners dealing with specialized documents.

Adds 'topic awareness' to embeddings: combines classical topic modeling (LDA) with modern sentence transformers for better document clustering

HGMem: Hypergraph Memory for Multi-step RAG

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu•Dec 2025

Information RetrievalLarge Language ModelsKnowledge Graphs

Standard RAG systems store facts as isolated items, losing the connections between them. HGMem represents memory as a hypergraph where 'hyperedges' connect multiple related facts into composite units. On sense-making tasks requiring integration of scattered evidence, HGMem achieves up to 10% accuracy gains over strong baselines like DeepRAG and LightRAG.

Solves the 'scattered evidence' problem: when answers require connecting facts from different sections of a 200-page document, standard RAG fails but HGMem succeeds

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Continuous Optimization

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu et al.•Dec 2025

AI AgentsLarge Language ModelsReinforcement Learning

LLM agent development faces two major bottlenecks: high configuration costs and static capabilities. Youtu-Agent addresses both with automated agent generation (81%+ tool synthesis success rate) and a hybrid optimization system that improves agents for just $18. Achieves 71.47% on WebWalkerQA and 72.8% on GAIA.

Automated tool synthesis: Describe what you need in plain language, get working Python tools. 81%+ success rate on first try

Recursive Language Models: Processing Unlimited Context Through Code

Alex L. Zhang, Tim Kraska et al.•Dec 2025

Natural Language ProcessingLarge Language ModelsEfficient AI

LLMs have fixed context windows, but real-world documents can be millions of tokens. Recursive Language Models (RLMs) let models treat their prompts as programmable objects, recursively calling themselves over snippets to handle inputs 100x beyond their context limits while outperforming long-context baselines.

Breaks the context window barrier: handles documents 100x longer than the model's native limit without any architectural changes

Deep Delta Learning: Rethinking Residual Connections with Geometric Transformations

Yifan Zhang•Jan 2026

Deep LearningNeural ArchitectureMachine Learning

DDL replaces the standard additive skip connection with a learnable Delta Operator (a rank-1 Householder transformation) that dynamically interpolates between identity, projection, and reflection. This enables networks to model complex, non-monotonic dynamics while preserving training stability.

Delta Operator generalizes identity shortcuts via rank-1 Householder transformations

Black-Box On-Policy Distillation: Learning from Closed-Source LLMs

Tianzhu Ye, Li Dong et al.•Nov 2025

Large Language ModelsEfficient AIMachine Learning

Generative Adversarial Distillation (GAD) enables training smaller models from proprietary LLMs like GPT-5 using only text outputs. By framing distillation as an adversarial game between student and discriminator, GAD achieves what was previously impossible: a 14B parameter model matching its closed-source teacher.

14B student model matches GPT-5-Chat teacher on LMSYS-Chat benchmark

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

Mohsen Fayyaz, Ali Modarressi et al.•Jul 2025

Information RetrievalNatural Language ProcessingMachine Learning

A comprehensive study revealing critical vulnerabilities in dense retrieval models used in RAG systems, showing how biases for shorter documents, early positions, and literal matches cause models to ignore factual evidence.

AI search models lazily prefer short, repetitive text over detailed, correct answers

DeepSeek-R1 & V3: The Open-Source Reasoning Revolution

DeepSeek-AI•Jan 2025

Large Language ModelsMachine Learning

DeepSeek shook the AI industry by matching OpenAI o1's reasoning capabilities at a fraction of the cost—then open-sourcing everything. DeepSeek-V3's efficient MoE architecture and R1's pure reinforcement learning approach demonstrate that frontier AI doesn't require frontier budgets.

DeepSeek-R1 matches OpenAI o1-1217 on reasoning benchmarks using pure RL

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen et al.•Nov 2025

Computer VisionDeep Learning

A simplified approach to visual geometry that predicts spatially consistent depth from any number of images. Using just a plain transformer and single prediction target, DA3 outperforms prior methods by 44% on camera pose accuracy while matching Depth Anything 2's quality.

44.3% improvement in camera pose accuracy over prior state-of-the-art

The Fold Principle: A Universal Pattern from Cosmos to Cognition

Jonas Jakob Gebendorfer•Oct 2025

Machine Learning

A new theory proposing that intelligence emerges when systems 'hold tension' between contradictory constraints instead of collapsing to simple answers. This framework explains why Chain-of-Thought works, why jailbreaks happen, and how to build better AI.

Intelligence is not computational power. It's the capacity to hold productive tension between competing ideas without immediately collapsing to one answer.

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Flood Sung•Jan 2025

Large Language ModelsReinforcement Learning

Moonshot AI's breakthrough in reasoning through simplicity. By removing complex RL components like Monte Carlo tree search and value functions, Kimi k1.5 matches OpenAI's o1 on math and coding benchmarks while enabling efficient short-response models that outperform GPT-4o by up to 550% on AIME math problems.

Matches OpenAI o1 on AIME (77.5) and MATH-500 (96.2) without MCTS or value functions

RAG-Anything: Unified Multimodal Retrieval for Real-World Documents

Zirui Guo, Xubin Ren et al.•Oct 2025

Information RetrievalMultimodal AINatural Language Processing

Traditional RAG systems only handle text, but real documents contain images, tables, and equations. RAG-Anything introduces a dual-graph architecture that treats all content types as interconnected knowledge entities, achieving 13+ percentage point improvements over baselines on long documents.

Solves the 'long document' problem: maintains high accuracy on 200+ page reports where standard RAG systems fail

SAGA: Autonomous Goal-Evolving Agents for Scientific Discovery

Yuanqi Du, Botao Yu et al.•Dec 2025

AI AgentsScientific DiscoveryMachine Learning

SAGA automates objective evolution in AI-driven science: instead of optimizing fixed objectives, it evolves the objectives themselves. This bi-level framework where LLMs propose, implement, and refine scientific goals has achieved strong results in antibiotic design, materials discovery, DNA engineering, and chemical process optimization.

Automating objective evolution (not just solution optimization) is the key to scientific discovery

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution

Hau-Shiang Shiu, Chin-Yang Lin et al.•Dec 2025

Computer VisionGenerative Models

A diffusion-based framework that achieves 130x faster video super-resolution than previous methods, processing 720p frames in 0.328 seconds while improving perceptual quality. The first practical diffusion approach for real-time video enhancement.

130x faster than previous diffusion-based video super-resolution methods

End-to-End Test-Time Training for Long Context

Arnuv Tandon, Karan Dalal et al.•Dec 2025

Large Language ModelsMachine Learning

A novel approach that treats long-context language modeling as a continual learning problem, achieving 2.7× faster inference than full attention at 128K tokens while maintaining comparable performance, but with a critical weakness on retrieval tasks.

Treats long documents as a 'learning problem': instead of building complex attention patterns, the model 'studies' the context and stores knowledge in its weights

On the Theoretical Limitations of Embedding-Based Retrieval

Orion Weller, Michael Boratko et al.•Aug 2025

Information RetrievalMachine LearningNatural Language Processing

A theoretical and empirical analysis proving that vector embeddings face fundamental dimensional constraints when handling diverse retrieval tasks, with state-of-the-art models achieving only 8-19% recall on the new LIMIT benchmark.

Vector search has hard mathematical limits that no amount of training can fix

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Tanghui Jia, Dongyu Yan et al.•Dec 2025

Computer VisionGenerative Models

A two-stage 3D diffusion framework that generates detailed geometry by first creating coarse structures then refining them with voxel-based methods. Trained entirely on public datasets, it matches proprietary approaches in quality.

Two-stage pipeline separates coarse structure generation from fine detail synthesis

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Yifei Zhou, Song Jiang et al.•Mar 2025

AI AgentsReinforcement LearningLarge Language Models

Multi-turn agent training suffers from poor credit assignment: when a 10-turn conversation fails, which turn caused it? SWEET-RL solves this with a critic that sees the ground truth solution during training, enabling step-level rewards. Result: Llama-8B matches GPT-4o on collaborative tasks.

Solves the 'which turn broke it?' problem: assigns credit to individual steps in multi-turn conversations instead of treating the whole trajectory as pass/fail

Synthetic Product Data That Actually Works: Amazon's LLM Pipeline for E-commerce

Virginia Negri, Víctor Martínez Gómez et al.•Jan 2025

Natural Language ProcessingMachine LearningE-commerce

Labeling product data for ML is expensive. Amazon researchers show how to generate synthetic e-commerce data using LLMs that matches real data performance. Their 3-strategy framework produces 99.6% natural-sounding products, and hybrid training reaches 68.8% accuracy on attribute extraction.

Synthetic matches real: Training on 100% synthetic data achieves 60.5% accuracy vs 60.8% for real data on attribute extraction

RAGVue: Diagnostic Evaluation That Tells You Why Your RAG System Fails

Keerthana Murugaraj, Salima Lamsiyah et al.•Dec 2025

Information RetrievalNatural Language ProcessingEvaluation Methods

Standard RAG metrics give you a number. RAGVue gives you a diagnosis. This framework decomposes evaluation into retrieval quality, answer completeness, claim-level faithfulness, and evaluator stability, turning opaque scores into actionable debugging intelligence.

Turns opaque scores into debugging reports: instead of 'faithfulness: 0.52', you get 'claim X is unsupported because retrieved chunk Y lacks evidence for date Z'

SEAL: The Self-Teaching AI That Writes Its Own Study Guide

Adam Zweiger, Jyothish Pari et al.•Jun 2025

Machine LearningNatural Language ProcessingReinforcement Learning

Current LLMs are frozen after training. SEAL changes this: models learn to generate their own fine-tuning data, discovering what format works best for their own learning. A 7B model with SEAL outperforms GPT-4.1 synthetic data, proving that personalized self-teaching beats generic high-quality data.

Models can learn to teach themselves: SEAL trains LLMs to generate their own fine-tuning data, turning raw information into 'study notes' optimized for learning

GenZ: Using Foundation Models as Feature Generators

Marko Jojic, Nebojsa Jojic•Dec 2025

Machine LearningLarge Language Models

Foundation models struggle at direct prediction tasks like pricing or recommendations. GenZ shows how to use LLMs as semantic feature extractors within traditional statistical models, achieving 3.2x better house price predictions and cold-start recommendations equivalent to 4,000 user ratings.

Stop asking LLMs to guess numbers. Use them to answer yes/no questions, then let a simple regression model do the math. You get interpretable features and accurate predictions.

The AI Scientist: Fully Automated Scientific Discovery

Chris Lu*, Cong Lu* et al.•Aug 2024

AI AgentsScientific DiscoveryMachine Learning

The AI Scientist is a comprehensive framework enabling LLMs to autonomously conduct scientific research: generating ideas, writing code, running experiments, creating visualizations, writing papers, and simulating peer review. All for under $15 per paper.

First comprehensive framework for fully automated open-ended scientific discovery

Learning Dynamics of LLM Finetuning: Why Your Model Hallucinates and Forgets

Yi Ren, Danica J. Sutherland•Jan 2025

Machine LearningNatural Language ProcessingLarge Language Models

ICLR 2025 Outstanding Paper reveals why finetuning makes LLMs hallucinate and why training too long actually hurts performance. A three-term decomposition framework explains the hidden mechanics of SFT, DPO, and RLHF.

Explains why finetuning causes hallucinations: the model borrows phrases from one answer to respond to unrelated questions

Small language models are the future of agentic AI

Peter Belcak, Greg Heinrich et al.•Jun 2025

Large Language ModelsAI AgentsEfficient AI

NVIDIA researchers argue that sub-10B parameter models are better suited for AI agent tasks than frontier LLMs. With 10-30x lower inference costs and comparable performance on tool-calling, the economics of agentic AI may favor specialized small models over general-purpose giants.

Challenges the 'bigger is better' dogma: sub-10B models match or beat GPT-4o on specialized agent tasks

Verbalized Sampling: Unlocking LLM Creativity with a Simple Prompt

Jiayi Zhang, Simon Yu et al.•Oct 2025

Large Language ModelsMachine Learning

Stanford researchers discovered that adding ~20 words to any prompt can boost LLM creativity by 1.6-2x. Verbalized Sampling bypasses RLHF's mode collapse by asking models to generate probability distributions instead of single answers, recovering the diverse capabilities lost during alignment.

RLHF causes mode collapse due to typicality bias in human preference data, not algorithmic limitations