GrepSeek: Training Search Agents for Direct Corpus Interaction via Shell Commands (93 HF Upvotes)
University of Massachusetts Amherst
GrepSeek (arXiv 2605.29307) trains LLM-based search agents to interact with text corpora through executable shell commands (grep, file reads, lightweight scripts) rather than pre-built vector indices — a paradigm called Direct Corpus Interaction (DCI). A two-stage pipeline combines cold-start trajectory generation with Group Relative Policy Optimization (GRPO), and a sharded-parallel execution engine provides up to 7.6× speedup. The system achieves top performance on seven open-domain QA benchmarks.
Why it matters
Removes the semantic index bottleneck entirely, enabling agents to do exact lexical matching, conjunctive sparse clue lookup, and multi-step hypothesis refinement directly on raw corpora — capabilities that embedding-based RAG systems struggle with. 93 upvotes on HuggingFace Daily Papers for June 1.
Importance: 3/5
93 HF upvotes (ranked second on June 1); novel DCI paradigm that bypasses vector index limitations for open-domain QA.