llama.cpp b9085: MiMo-V2.5 Flash Attention and Vertex AI Server Support

Tools official 1 src. ~1 min

llama.cpp builds released May 8–9 include two notable features: b9077 adds a Vertex AI-compatible server API endpoint configured via `AIP_*` environment variables for drop-in cloud integration, and b9085 adds flash attention MMA/tiles support for MiMo-V2.5 models with GQA handling optimizations. Additional builds add a Hexagon HTP kernel for Gated Delta Net recurrence and Gemma4_26B_A4B_NVFP4 GGUF conversion support.

Why it matters

Vertex AI server compatibility lets developers swap llama.cpp into Google Cloud pipelines with minimal changes; MiMo-V2.5 attention support extends local inference to very large MoE models.

Importance: 2/5

Vertex AI server endpoint enables drop-in use of llama.cpp in Google Cloud; MiMo-V2.5 flash attention extends efficient local inference to large MoE models.

inference local-ai vertex-ai release open-source

Sources

official Releases — ggml-org/llama.cpp