llama.cpp b9716 Builds: InternVL Multimodal Batching, CUDA col2im, and Nginx SSE Fix
llama.cpp shipped over a dozen builds on June 18–19 (b9702–b9716). Key additions: batching support for InternVL multimodal models in the mtmd pipeline, a CUDA col2im 1D operation, a streaming fix adding `X-Accel-Buffering: no` header to prevent Nginx from buffering SSE responses, and HTTP 400 errors for invalid grammar inputs instead of silent drops. Server schema and request validation were also added.
Why it matters
The Nginx SSE buffering fix is a widely encountered production issue for anyone serving llama.cpp behind a reverse proxy; the grammar validation change improves debuggability for structured-output use cases.
Importance: 2/5
Routine patch builds with a broadly impactful production bug fix (Nginx SSE buffering).
Sources
official
Releases — ggml-org/llama.cpp