InterleaveThinker: RL Planner+Critic Pipeline for Interleaved Text-and-Image Generation

CUHK Multimedia Lab

Research official 3 src. ~1 min

InterleaveThinker is a multi-agent pipeline — a planner and a critic agent — that equips any image generator with the ability to produce interleaved text-image sequences. The planner organizes input sequences; the critic evaluates outputs and refines instructions for regeneration. Training uses SFT datasets (80K planner, 112K critic examples) and GRPO reinforcement learning with step-wise rewards. The system achieves performance comparable to GPT-5-level models on interleaved generation benchmarks (WISE, RISE). Published on arXiv (2606.13679) with 124 upvotes on HuggingFace Daily Papers.

Why it matters

Interleaved text-image generation (illustrated stories, embodied instructions) is a key missing capability in open multimodal systems. This is the first work to apply RL to a planner+critic pipeline for this task, matching proprietary frontier models on relevant benchmarks. 124 upvotes on HF Daily Papers.

Importance: 3/5

Novel RL approach to interleaved generation; matches GPT-5-level performance; 124 HF upvotes indicates strong community interest.

multimodal agents rl image-generation paper research generation

InterleaveThinker: RL Planner+Critic Pipeline for Interleaved Text-and-Image Generation

Why it matters

Related items

Sources