Arbor: Generalist Autonomous ML Research via Hypothesis-Tree Refinement

NLPIR Lab

Research official 1 src. ~1 min

Arbor introduces a framework for fully autonomous ML research. An LLM-based coordinator manages a persistent Hypothesis Tree linking hypotheses, experimental artifacts, and learned insights. Executor agents test individual hypotheses in isolated sandboxes, allowing knowledge to accumulate across many experimental rounds rather than being discarded after each run. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal score — over 2.5× the relative held-out gains of both Codex and Claude Code under identical compute budgets.

Why it matters

30 upvotes on HuggingFace June 11. A concrete step toward AI systems that conduct sustained, compounding scientific research. The 2.5× advantage over Codex and Claude Code on a standardized ML engineering benchmark is a strong empirical signal for autonomous research agents.

Importance: 3/5

Notable research paper; Hypothesis Tree framework for autonomous research; 2.5× improvement over Codex/Claude Code on MLE-Bench Lite.

agents reasoning autonomous-research rl software-engineering

Sources

official arXiv:2606.11926 — Toward Generalist Autonomous Research via Hypothesis-Tree Refinement