Daily digest
10 items · ~10 min · Week 2026-W21
Must-read (2)
ExploitBench: Claude Mythos Preview and GPT-5.5 Develop Real Browser Exploits Autonomously
AnthropicCarnegie Mellon University researchers published ExploitBench, a benchmark testing AI models on real-world V8 JavaScript engine vulnerabilities across 16 capability tiers. Anthropic's Claude Mythos Preview led all models with a score of 9.90/16 (with hints) and 9.55/16 autonomous, achieving arbitrary code execution on 21 of 41 tested vulnerabilities. OpenAI's GPT-5.5 scored 5.51. Researchers found 'reaching arbitrary code execution is an emerging frontier capability.'
RoPE Provably Fails at Long Contexts: Locality Bias and Token Consistency Both Break
A NeurIPS 2026 submission (arXiv:2605.15514) formally proves two fundamental failures of Rotary Positional Embeddings (RoPE) at long context lengths: locality bias collapses (the model cannot reliably favor nearby tokens), and token consistency breaks (attention scores for the same token differ by position). The authors prove these failures are in direct tension — adjusting RoPE's base parameter trades one failure for the other rather than resolving either.
Worth knowing (5)
DeepSeek Nears Close of Record ~$4B Funding Round at $50B Valuation
DeepSeekDeepSeek is finalizing its first-ever external funding round, targeting $3–4 billion at a $50 billion valuation — a fivefold increase within weeks. China's National AI Industry Investment Fund ('Big Fund III') is leading the round, with Tencent participating. Founder Liang Wenfeng, holding ~90% of the company, is personally contributing up to $2.94 billion. The round is expected to close imminently as of mid-May 2026, per South China Morning Post.
OpenAI Reorganizes Product Teams Around Agentic Strategy, Brockman Takes Charge
OpenAIOpenAI president Greg Brockman formally assumed control of the company's product strategy, with an internal memo outlining plans to merge ChatGPT, Codex, and the API into a single unified platform under one product team. The stated goal is building an 'agentic future' with Thibault Sottiaux (Codex CEO) leading the consolidated product org. The restructuring comes while CEO of AGI Deployment Fidji Simo is on medical leave.
SOOHAK: Frontier LLMs Solve Hard Math But Fail to Recognize Unsolvable Problems
A 64-mathematician consortium from CMU, EleutherAI, and Seoul National University published SOOHAK, a 439-problem research-level math benchmark. Frontier scores: Gemini 3 Pro 30.4%, GPT-5 26.4%, Claude Opus 4.5 10.4%. A 'refusal subset' of 99 intentionally ill-posed problems revealed no model exceeded 50% accuracy at refusing unsolvable questions — models regularly produced confident wrong answers on problems with no valid solution.
Judge Circuits: Mechanistic Explanation of LLM-as-Judge Format Inconsistency
Researchers applied causal circuit analysis to Gemma-3, Qwen2.5, and Llama-3 to explain why LLM judges produce inconsistent scores across output formats (e.g., 1–5 vs. True/False). They identified a sparse 'Latent Evaluator' sub-graph in mid-to-late layers shared across tasks; a single continuous judgment signal routes through fragile format-specific terminal branches, explaining format-driven score variance (arXiv:2605.16023).
vLLM v0.21.0: Blackwell MLA Backend, HMA KV Offload, Spec Decode for Reasoning Models
vLLM ProjectvLLM v0.21.0 shipped May 15, 2026 (367 commits, 202 contributors). Key additions: TOKENSPEED_MLA attention backend for DeepSeek-R1 and Kimi-K2.5 on NVIDIA Blackwell GPUs; KV offloading integrated with the Hybrid Memory Allocator (HMA); speculative decoding now respects reasoning/thinking budgets for correctness with reasoning models; Docker image reduced ~2.5 GB. Breaking changes: C++20 compiler required, Transformers v4 deprecated (must upgrade to v5).
For reference (3)
BetaPRM: Uncertainty-Aware Process Rewards Cut Reasoning Token Use by 33%
BetaPRM (arXiv:2605.15529) extends Process Reward Models (PRMs) by predicting both step-level reward scores and their reliability via a Beta-Binomial likelihood framework trained on Monte Carlo rollouts. An Adaptive Computation Allocation (ACA) strategy stops reasoning early when reward confidence is high and allocates more compute when uncertain, achieving up to 33.57% reduction in token usage while maintaining or improving accuracy across reasoning benchmarks.
OpenCode v1.15.2–v1.15.4: Async Context Fix and Custom LSP Events
SSTOpenCode (by SST) shipped three patch releases May 16–17, 2026. v1.15.2 reduced unnecessary prompting in shell and task flows. v1.15.3 fixed async commands losing active instance context — a bug that broke agent generation and GitHub-driven runs. v1.15.4 fixed project-scoped bus events for file watchers and added custom LSP server refresh event support. These follow the major v1.15.0 Effect-based event system and v1.15.1 collapsible thinking view.
OpenClaw v2026.5.16-beta.5/6: Grok OAuth, Mac Settings Redesign, Python Debugging
OpenClawOpenClaw released beta.5 (May 17) and beta.6 (May 18) of its 2026.5.16 series. beta.5 improved OpenAI stream handling and Codex thread management. beta.6 added a Mac app Settings redesign, a meme-maker skill, Python debugging support, and an HTTPS proxy endpoint. beta.3 (May 16) added xAI Grok OAuth login. The openclaw-code-agent plugin manages Claude Code and Codex as background coding sessions from Telegram, Slack, Discord, and WhatsApp.