Agentic Transformers Provably Learn to Search via Reinforcement Learning

Research official 1 src. ~1 min

A theoretical study showing that transformer-based agents trained via policy gradient on a stochastic k-ary tree environment provably develop a depth-first search mechanism, with one attention head tracking prior actions and another detecting failures and triggering backtracking. Policies trained on shallow trees generalize to deeper ones without additional training.

Why it matters

Provides rare provable guarantees on emergent agentic search behaviors in transformers trained with RL, explaining mechanistically why curriculum-trained agents can generalize beyond their training distribution.

Importance: 2/5

Rare theoretical provable result for emergent DFS in RL-trained transformers with cross-distribution generalization; relevant to understanding how RL training shapes agentic reasoning

Sources