-
xAI Releases Grok 4.3 with 1M Context, 40-60% Price Cuts, and Agentic Benchmark Gains
xAI
models-llm
-
Codex-Spark (GPT-5.3-Codex-Spark) Research Preview: 1000+ Tokens/Second Coding Model
OpenAI
tools
-
Gemini 3.5 Flash Released at Google I/O 2026: Frontier Coding + Agentic at Flash Speed
Google DeepMind
models-llm
-
NVIDIA Nemotron 3 Ultra: Open 550B MoE Model Now Available for Agentic Workloads
NVIDIA
models-llm
-
Zyphra Releases ZAYA1-8B: Open Reasoning MoE Model Trained on AMD Hardware
Zyphra
models-llm
-
JetBrains Open-Sources Mellum2: 12B MoE Coding Model for Multi-Model Pipelines
JetBrains
models-llm
-
Kimi K2.7-Code HighSpeed: 6× Throughput for Production Coding Agent Pipelines
Moonshot AI
models-llm
-
AWS Summit New York 2026: Bedrock AgentCore GA, Kiro iOS Preview, and AWS Context Previewed
Amazon
tools
-
LongLive-2.0: NVFP4 Parallel Infrastructure for Long Video Generation (NVIDIA, 1,220 HF upvotes)
NVIDIA
research
-
MiniMax Sparse Attention: 28× Compute Reduction at 1M-Token Context with No Quality Loss
MiniMax
research
-
SGLang v0.5.11: Speculative Decoding V2 as Default and Eight New Model Architectures
tools
-
vLLM v0.20.2: TurboQuant 2-bit KV Cache and FlashAttention 4 Default for MoE Serving
tools
-
Hugging Face Transformers: Async Continuous Batching Achieves 22% Inference Speedup
Hugging Face
tools
-
Orthrus: 7.8x Inference Speedup for Qwen3 via Autoregressive-Diffusion KV Sharing
research
-
vLLM v0.21.0: Blackwell MLA Backend, HMA KV Offload, Spec Decode for Reasoning Models
vLLM Project
tools
-
vLLM v0.22.0: DeepSeek V4 Production Hardening, Rust Frontend, 28.9% Latency Drop
tools
-
vLLM Semantic Router v0.3 Themis: Stateful Production Routing with Session-Aware Agentic Routing
tools
-
vLLM Adds Day-0 Support for MiniMax M3 Open Weights with 1M-Context Sparse Attention
MiniMax
tools
-
vLLM v0.23.0: Model Runner V2 Default for Llama and Mistral, Transformers v5, Multi-Tier KV Cache
tools
-
vLLM v0.20.0 — third release in two weeks
vLLM
tools
-
TIDE: cross-architecture distillation for diffusion LLMs
Peking University
research
-
ESamp: LLMs explore by latent distilling for semantic-novelty sampling
ShanghaiTech University
research
-
AutoTTS: LLM Agents Automatically Discover Test-Time Scaling Strategies for $40
research
-
BadHost (CVE-2026-48710): Host-Header Auth Bypass in Starlette Exposes vLLM, LiteLLM, and MCP Servers
tools
-
Ollama v0.23.0 Adds Claude Desktop Support via Ollama Launch
Ollama
tools
-
vLLM v0.20.1 Patches Critical DeepSeek V4 Instability Under Production Workloads
vLLM Project
tools
-
vLLM v0.20.1: DeepSeek V4 Stabilization on CUDA 13 and PyTorch 2.11
tools
-
Ollama v0.23.1: Gemma 4 MTP Speculative Decoding Delivers 2× Speed on Apple Silicon
tools
-
VK Video AI Character Recognition Boosts Watch Time 9% via Cascade Face Detection
VK AI
tools
-
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
Shanghai Jiao Tong University
research
-
llama.cpp b9085: MiMo-V2.5 Flash Attention and Vertex AI Server Support
tools
-
vLLM v0.21.0rc1: Python 3.14, CUDA 13.0, and Transformers v5 Compatibility
tools
-
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy with Hierarchical Memory
research
-
vLLM v0.21.0rc1: PyTorch 2.11, HuggingFace Transformers v5, and Python 3.14 Support
tools
-
Ollama v0.24.0: Codex App Integration and MLX Sampler Improvements
Ollama
tools
-
llama.cpp b9161/b9169: Codex CLI Compatibility and Qwen3A Multimodal Support
ggml-org
tools
-
BetaPRM: Uncertainty-Aware Process Rewards Cut Reasoning Token Use by 33%
research
-
Code2LoRA: Hypernetwork Generates Repo-Specific Adapters for Code LMs with Zero Inference Overhead
University of Waterloo
research
-
Ollama v0.30.7: Hermes Desktop Support, Gemma 4 QAT, and Nemotron-3-Ultra
Ollama
tools
-
llama.cpp b9589–b9592: CUDA SSM Sync Fix and Mamba Memory Optimization
tools
-
llama.cpp b9603: Qualcomm Adreno OpenCL Kernels for On-Device Inference
ggml-org
tools
-
Ollama v0.30.9: Cohere2Moe Support, Coding Agent Single-Token Output Bug Fixed
tools
-
llama.cpp June 16 Builds: Eagle3 Speculative Decoding, Vulkan UMA Memory, NVFP4 Fixes
tools
-
llama.cpp b9716 Builds: InternVL Multimodal Batching, CUDA col2im, and Nginx SSE Fix
tools
-
llama.cpp Adds gpt-oss-20b Support in May 12 Build
tools
-
Ollama v0.23.3: MLX Runner Fixes and macOS 26 Metal Compatibility
Ollama
tools