llama.cpp b9589–b9592: CUDA SSM Sync Fix and Mamba Memory Optimization

Tools official 2 src. ~1 min

Four builds landed around June 10. b9589 fixes missing thread-sync barriers before shared memory reuse in CUDA SSM scan operations — a correctness bug affecting Mamba-family models running on GPU. b9591 consolidates D2D memory copies for MTP/Mamba into a single strided transfer and refactors ggml_gated_delta_net, reducing overhead. b9590 fixes LFM2/LFM2.5 ignoring json_schema from response_format. b9592 updates LibreSSL to 4.3.2.

Why it matters

The CUDA SSM sync fix addresses a silent correctness issue — affected users may have been getting subtly wrong outputs from Mamba models without knowing it. The memory transfer consolidation improves throughput for Mamba architectures gaining traction as attention alternatives.

Importance: 2/5

Correctness fix for Mamba/SSM GPU inference; silent bug that could affect output quality for local Mamba model users.

inference cuda ssm open-source local-llm

Sources

official llama.cpp b9589 — CUDA SSM sync fix — GitHub

official llama.cpp b9591 — MTP/Mamba memory optimization — GitHub