Daily digest
7 items · ~7 min · Week 2026-W27
Worth knowing (5)
Grok 4.5 Enters Private Beta at SpaceX and Tesla
xAIxAI CEO Elon Musk announced on June 28 that Grok 4.5 has entered private beta at SpaceX and Tesla. Built on xAI's V9 foundation with 1.5 trillion parameters — a 50% increase over Grok 4.4 — the model incorporates supplemental training data from the Cursor coding platform. Early internal evaluations reportedly show Grok 4.5 performing at or above Anthropic Claude Opus on certain tasks. No public release date has been announced.
JetSpec: Parallel Tree Drafting Achieves 9.64× Speculative Decoding Speedup
Hao AI Lab, UCSDJetSpec introduces a causal parallel draft head that resolves the causality-efficiency dilemma in speculative decoding. Standard tree-based drafters either draft autoregressively (accurate but slow) or in one parallel pass (fast but incoherent). JetSpec trains a draft head over the target model's fused hidden states so candidate-tree token scores follow the target's autoregressive factorization, then verifies the full tree in a single forward pass. On coding and math benchmarks it achieves up to 9.64× speedup over standard autoregressive decoding on H100/B200 GPUs. Code is open-sourced.
OPID: On-Policy Skill Distillation Improves Long-Horizon Agent RL
Institute of Automation, Chinese Academy of SciencesOPID adds dense, token-level supervision to outcome-based RL for LLM agents. During training, a lightweight LLM analyzer extracts two levels of hindsight skill from completed trajectories: episode-level workflow summaries and step-level action rationales at critical decision points. A critical-first routing mechanism injects the appropriate skill into the interaction history, letting the policy contrast responses with and without skill guidance for token-level advantage estimation. On ALFWorld, WebShop, and Search-QA, OPID improves task completion, sample efficiency, and robustness over baseline outcome-only RL.
SingGuard: Runtime Policy-Adaptive Multimodal LLM Guardrail with 56K-Example Benchmark
inclusionAISingGuard is a guardrail model for vision-language models that accepts natural-language safety policies at runtime rather than using rules baked in at training time. It evaluates content against policy rules one-by-one with three inference speed modes (fast/hybrid/slow) to trade interpretability for latency. A new benchmark, SingGuard-Bench, contains 56,340 examples across 80+ risk categories including cross-modal joint-risk cases where neither text nor image alone is harmful but their combination implies unsafe intent. Policy-following accuracy improves from ~64.6% to ~74.1% over prior methods on runtime policy changes.
DeepSeek Open-Sources DSpark: 57–85% Inference Speedup for V4 in Production
DeepSeekDeepSeek and Peking University NLP Lab released DSpark (Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation), a framework that accelerates DeepSeek-V4-Flash inference by 60–85% and V4-Pro by 57–78% over the prior MTP-1 baseline. The framework is live in production for both V4 variants. The training and evaluation codebase DeepSpec is open-sourced under MIT on GitHub (`deepseek-ai/DeepSpec`), with HuggingFace model cards for DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark published.
For reference (2)
Wayfinder Router: Open-Source Offline LLM Query Router Trends on Hacker News
Wayfinder Router (Apache-2.0, Python) is a CLI tool that routes LLM queries between local models (Ollama, vLLM) and hosted APIs (OpenAI, Claude, Gemini, OpenAI-compatible endpoints) without making a model call for the routing decision. It scores prompt structural complexity on a 0–1 scale offline in under 1ms, dispatching low-complexity queries to local models and high-complexity ones to hosted APIs. It exposes an OpenAI-compatible gateway so callers need not change client code. The project reached 115 points on Hacker News on June 28.
llama.cpp Builds b9830–b9837: DFlash v2, MiniCPM5 Parser, --reasoning-preserve Flag
ggml-orgSix llama.cpp builds shipped June 28–29 (b9830–b9837). Key additions: b9830 adds an `--offline` flag to `llama download` for cache-only model access and fixes a use-after-free in URL-task callbacks; b9831 adds DFlash v2 with per-layer sliding window attention; b9833 implements a dedicated MiniCPM5 PEG parser with XML tool-call support; b9837 adds `--reasoning-preserve` to retain chain-of-thought tokens in Jinja and chat output.