InterleaveThinker: RL Planner+Critic Pipeline for Interleaved Text-and-Image Generation
CUHK Multimedia Lab
InterleaveThinker is a multi-agent pipeline — a planner and a critic agent — that equips any image generator with the ability to produce interleaved text-image sequences. The planner organizes input sequences; the critic evaluates outputs and refines instructions for regeneration. Training uses SFT datasets (80K planner, 112K critic examples) and GRPO reinforcement learning with step-wise rewards. The system achieves performance comparable to GPT-5-level models on interleaved generation benchmarks (WISE, RISE). Published on arXiv (2606.13679) with 124 upvotes on HuggingFace Daily Papers.
Why it matters
Interleaved text-image generation (illustrated stories, embodied instructions) is a key missing capability in open multimodal systems. This is the first work to apply RL to a planner+critic pipeline for this task, matching proprietary frontier models on relevant benchmarks. 124 upvotes on HF Daily Papers.
Importance: 3/5
Novel RL approach to interleaved generation; matches GPT-5-level performance; 124 HF upvotes indicates strong community interest.