GLM-5V-Turbo: a natively multimodal foundation model for agents
Z.ai
Z.ai unveiled GLM-5V-Turbo, a multimodal foundation model in which visual perception is embedded as a first-class component of reasoning, planning, and tool use rather than bolted on after the fact. The model handles images, video, web pages, and documents; the authors report gains on multimodal coding, visual tool use, and agent tasks while preserving text-only quality. The role of end-to-end verification of agent trajectories during training is emphasized.
Why it matters
One of the most-hyped releases of the week on HF Daily — 2.28k upvotes. A bid for a natively multimodal agent (rather than a VLM with tacked-on tool use) — a direction in which Z.ai is systematically competing with GPT-5 and Gemini.
Importance: 4/5
Flagship paper from Z.ai; HF Daily 2.28k upvotes (>>100, +1 to base 3).