llama.cpp b9085: MiMo-V2.5 Flash Attention and Vertex AI Server Support
llama.cpp builds released May 8–9 include two notable features: b9077 adds a Vertex AI-compatible server API endpoint configured via `AIP_*` environment variables for drop-in cloud integration, and b9085 adds flash attention MMA/tiles support for MiMo-V2.5 models with GQA handling optimizations. Additional builds add a Hexagon HTP kernel for Gated Delta Net recurrence and Gemma4_26B_A4B_NVFP4 GGUF conversion support.
Why it matters
Vertex AI server compatibility lets developers swap llama.cpp into Google Cloud pipelines with minimal changes; MiMo-V2.5 attention support extends local inference to very large MoE models.
Importance: 2/5
Vertex AI server endpoint enables drop-in use of llama.cpp in Google Cloud; MiMo-V2.5 flash attention extends efficient local inference to large MoE models.