Daily digest
15 items · ~15 min · Week 2026-W20
Must-read (4)
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
Mind LabMinT is a production infrastructure system for managing millions of LoRA policy variants on top of shared base models up to 1T+ parameters. It scales in three directions: up to frontier-scale models, down by transferring only LoRA adapters (<1% of base model size), and out by supporting concurrent multi-policy training and cold-loading for a million-scale catalog. Efficiency gains: 18.3x on dense models, 2.85x on MoE models.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Images
TechnionMulTaBench introduces 40 datasets (20 image-tabular, 20 text-tabular) — the largest image-tabular benchmarking effort to date. The benchmark reveals that current tabular foundation models rely on frozen embeddings and that task-specific tuning substantially improves performance across text and image modalities and multiple encoder scales.
EVA-Bench: End-to-End Framework for Evaluating Voice Agents
ServiceNow AIEVA-Bench provides end-to-end evaluation for voice agents through bot-to-bot audio conversation simulation. It introduces composite metrics EVA-A (task completion + speech fidelity) and EVA-X (conversation flow + turn-taking timing), plus a 213-scenario benchmark across three enterprise domains. Evaluation of 12 systems reveals no single system excels on both metrics, with a median gap of 0.44 between peak and reliable performance.
xAI Launches Grok Build: Agentic Coding CLI in Early Beta
xAIxAI released Grok Build, an agentic command-line coding agent currently in early beta for SuperGrok Heavy subscribers. Grok Build operates from the terminal and can read repositories, propose structured plans, edit files across a codebase, run shell commands, install dependencies, and spin up parallel subagents in isolated worktrees. A Plan Mode lets users inspect and modify the proposed steps before execution begins.
Worth knowing (7)
SU-01: Gold-Medal-Level Olympiad Reasoning via Curriculum SFT and Two-Stage RL
SU-01 TeamSU-01 is a 30B-A3B model trained with reverse-perplexity curriculum SFT followed by two-stage RL (~340K SFT trajectories + 200 RL steps). The model achieves gold-medal-level performance on IMO, USAMO, and IPhO benchmarks, handling reasoning trajectories exceeding 100K tokens stably.
OpenAI Brings Codex to ChatGPT Mobile and Enables Remote SSH
OpenAIOpenAI released Codex in the ChatGPT mobile app for iOS and Android, enabling users to monitor active Codex sessions remotely — reviewing diffs, terminal output, test results, and screenshots — and approve or reject proposed commands from their phone while the agent runs on a desktop or devbox. The update also brings Remote SSH to general availability with new programmatic access tokens for Business and Enterprise automation. Over 4 million users interact with Codex weekly.
Cursor 3.4: Cloud Agent Development Environments with Multi-Repo Docker Support
CursorCursor 3.4 introduces development environments for cloud agents — Docker-based sandboxes with cloned repos, installed dependencies, credentials, and build system access. Teams can configure multi-repo environments reused across sessions, with 70% faster build caching for cache hits. A May 11 update added Microsoft Teams integration for delegating coding tasks via @Cursor.
VS Code 1.120: Agents Window Ships to Stable with Terminal Risk Assessment
MicrosoftVS Code 1.120 graduates the Agents window from Insiders to Stable, providing a dedicated interface for multiple agents across multiple projects. New safety features include terminal command risk assessment with AI-generated Safe/Caution/Review badges and terminal output compression to reduce context window usage. BYOK token visibility and configurable thinking effort for reasoning models were also added.
IBM Granite Embedding Multilingual R2: 32K Context and Best Sub-100M Retrieval
IBMIBM released two new open embedding models: granite-embedding-311m-multilingual-r2 (MTEB Multilingual 65.2) and granite-embedding-97m-multilingual-r2 (60.3, best sub-100M). Both support a 32,768-token context window — 64x more than R1 — 200+ languages, and 9 programming languages. Built on ModernBERT with Flash Attention 2.0. Apache 2.0 license; ONNX/OpenVINO weights included.
Hugging Face Transformers: Async Continuous Batching Achieves 22% Inference Speedup
Hugging FaceHugging Face published a blog post describing asynchronous continuous batching in the Transformers library. Using CUDA streams to overlap CPU batch preparation with GPU compute, GPU utilization climbs from 76% to 99.4%, cutting generation time by 22% (300.6s → 234.5s) on an 8B model at batch size 32. The technique requires zero model architecture changes.
Runway Launches Runway Agent: End-to-End Agentic Video Production
RunwayRunway introduced Runway Agent — an agentic creative partner that takes a user from a text description to a finished, multi-scene, ready-to-publish video in a single conversation. The agent proposes concepts, develops story structure, generates multiple scenes with voiceover, dialogue, and music, and assembles the final video. Users can provide reference images and refine direction conversationally.
For reference (4)
xAI Retires 8 Legacy Models; Grok 4.3 Becomes Default API Model
xAIEffective May 15, 2026, xAI retired eight legacy models from its API — including grok-4-fast-reasoning, grok-4-0709, grok-code-fast-1, and grok-3 variants — redirecting all traffic to Grok 4.3. Grok 4.3 is xAI's current flagship with built-in reasoning (four effort levels), a 1 million token context window, native video input, and pricing at $1.25/$2.50 per million input/output tokens. It tops the Artificial Analysis Intelligence Index (score 53 vs. median 35).
Claude Code v2.1.142: Opus 4.7 Fast Mode Default and Expanded Agents Flags
AnthropicClaude Code v2.1.142 upgrades the fast mode default model from Opus 4.6 to Opus 4.7 and introduces new flags for the `claude agents` command: --add-dir, --settings, --mcp-config, --plugin-dir, --permission-mode, --model, --effort, and --dangerously-skip-permissions. Also fixes MCP_TOOL_TIMEOUT cap, background session worktree recognition, and a Windows network-drive deadlock.
OpenCode v1.15.0: Effect-Based Event System and Background Subagents
SSTOpenCode v1.15.0 introduces an Effect-based core event system for more complete event delivery across sessions and integrations. The preceding v1.14.51 shipped experimental background subagents, allowing tasks to continue running while the user keeps working in the foreground, plus NVIDIA billing header support and LiteLLM v1.85+ requirement.
Ollama v0.24.0: Codex App Integration and MLX Sampler Improvements
OllamaOllama v0.24.0 introduces built-in Codex App integration with browser and review mode capabilities. The MLX sampler was refined for improved generation quality on Apple Silicon. Earlier v0.23.x releases added vision model support in `ollama launch opencode` and fixed Claude tool result formatting.