inference — AI Digest

7 мая xAI Releases Grok 4.3 with 1M Context, 40-60% Price Cuts, and Agentic Benchmark Gains xAI models-llm
13 мая Codex-Spark (GPT-5.3-Codex-Spark) Research Preview: 1000+ Tokens/Second Coding Model OpenAI tools
20 мая Gemini 3.5 Flash Released at Google I/O 2026: Frontier Coding + Agentic at Flash Speed Google DeepMind models-llm
8 июн NVIDIA Nemotron 3 Ultra: Open 550B MoE Model Now Available for Agentic Workloads NVIDIA models-llm
9 мая Zyphra Releases ZAYA1-8B: Open Reasoning MoE Model Trained on AMD Hardware Zyphra models-llm
4 июн JetBrains Open-Sources Mellum2: 12B MoE Coding Model for Multi-Model Pipelines JetBrains models-llm
16 июн Kimi K2.7-Code HighSpeed: 6× Throughput for Production Coding Agent Pipelines Moonshot AI models-llm
19 июн AWS Summit New York 2026: Bedrock AgentCore GA, Kiro iOS Preview, and AWS Context Previewed Amazon tools
19 мая LongLive-2.0: NVFP4 Parallel Infrastructure for Long Video Generation (NVIDIA, 1,220 HF upvotes) NVIDIA research
14 июн MiniMax Sparse Attention: 28× Compute Reduction at 1M-Token Context with No Quality Loss MiniMax research
6 мая SGLang v0.5.11: Speculative Decoding V2 as Default and Eight New Model Architectures tools
10 мая vLLM v0.20.2: TurboQuant 2-bit KV Cache and FlashAttention 4 Default for MoE Serving tools
15 мая Hugging Face Transformers: Async Continuous Batching Achieves 22% Inference Speedup Hugging Face tools
16 мая Orthrus: 7.8x Inference Speedup for Qwen3 via Autoregressive-Diffusion KV Sharing research
18 мая vLLM v0.21.0: Blackwell MLA Backend, HMA KV Offload, Spec Decode for Reasoning Models vLLM Project tools
2 июн vLLM v0.22.0: DeepSeek V4 Production Hardening, Rust Frontend, 28.9% Latency Drop tools
9 июн vLLM Semantic Router v0.3 Themis: Stateful Production Routing with Session-Aware Agentic Routing tools
14 июн vLLM Adds Day-0 Support for MiniMax M3 Open Weights with 1M-Context Sparse Attention MiniMax tools
17 июн vLLM v0.23.0: Model Runner V2 Default for Llama and Mistral, Transformers v5, Multi-Tier KV Cache tools
29 апр vLLM v0.20.0 — third release in two weeks vLLM tools
30 апр TIDE: cross-architecture distillation for diffusion LLMs Peking University research
2 мая ESamp: LLMs explore by latent distilling for semantic-novelty sampling ShanghaiTech University research
11 мая AutoTTS: LLM Agents Automatically Discover Test-Time Scaling Strategies for $40 research
2 июн BadHost (CVE-2026-48710): Host-Header Auth Bypass in Starlette Exposes vLLM, LiteLLM, and MCP Servers tools
4 мая Ollama v0.23.0 Adds Claude Desktop Support via Ollama Launch Ollama tools
4 мая vLLM v0.20.1 Patches Critical DeepSeek V4 Instability Under Production Workloads vLLM Project tools
6 мая vLLM v0.20.1: DeepSeek V4 Stabilization on CUDA 13 and PyTorch 2.11 tools
6 мая Ollama v0.23.1: Gemma 4 MTP Speculative Decoding Delivers 2× Speed on Apple Silicon tools
7 мая VK Video AI Character Recognition Boosts Watch Time 9% via Cascade Face Detection VK AI tools
7 мая LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents Shanghai Jiao Tong University research
9 мая llama.cpp b9085: MiMo-V2.5 Flash Attention and Vertex AI Server Support tools
12 мая vLLM v0.21.0rc1: Python 3.14, CUDA 13.0, and Transformers v5 Compatibility tools
12 мая TMAS: Scaling Test-Time Compute via Multi-Agent Synergy with Hierarchical Memory research
13 мая vLLM v0.21.0rc1: PyTorch 2.11, HuggingFace Transformers v5, and Python 3.14 Support tools
15 мая Ollama v0.24.0: Codex App Integration and MLX Sampler Improvements Ollama tools
16 мая llama.cpp b9161/b9169: Codex CLI Compatibility and Qwen3A Multimodal Support ggml-org tools
18 мая BetaPRM: Uncertainty-Aware Process Rewards Cut Reasoning Token Use by 33% research
8 июн Code2LoRA: Hypernetwork Generates Repo-Specific Adapters for Code LMs with Zero Inference Overhead University of Waterloo research
9 июн Ollama v0.30.7: Hermes Desktop Support, Gemma 4 QAT, and Nemotron-3-Ultra Ollama tools
11 июн llama.cpp b9589–b9592: CUDA SSM Sync Fix and Mamba Memory Optimization tools
12 июн llama.cpp b9603: Qualcomm Adreno OpenCL Kernels for On-Device Inference ggml-org tools
17 июн Ollama v0.30.9: Cohere2Moe Support, Coding Agent Single-Token Output Bug Fixed tools
17 июн llama.cpp June 16 Builds: Eagle3 Speculative Decoding, Vulkan UMA Memory, NVFP4 Fixes tools
19 июн llama.cpp b9716 Builds: InternVL Multimodal Batching, CUDA col2im, and Nginx SSE Fix tools
12 мая llama.cpp Adds gpt-oss-20b Support in May 12 Build tools
13 мая Ollama v0.23.3: MLX Runner Fixes and macOS 26 Metal Compatibility Ollama tools