DeNovoSWE: Full Repository Generation Jumps from 5.8% to 47.2% with Synthetic Training Data
AweAI Team
DeNovoSWE addresses a gap in AI code agents: most training data covers bug-fixing in existing codebases, not building complete repositories from scratch. The benchmark provides 4,818 instances where each requires generating a full repo from documentation. A divide-and-conquer critic-repair pipeline with difficulty-aware filtering produces high-quality training trajectories. Fine-tuning Qwen3-30B-A3B on this data pushes BeyondSWE-Doc2Repo performance from 5.8% to 47.2%.
Why it matters
21 upvotes on HuggingFace June 11. The near 10× benchmark jump demonstrates that training-data quality for long-horizon coding tasks is a major bottleneck — automated, sandboxed construction can close the gap. Advances AI toward being a full software architect rather than just a patch writer.
Importance: 3/5
Notable research paper; near 10× benchmark improvement on full-repo generation; new training data paradigm for long-horizon coding agents.