Daily digest
14 items · ~14 min · Week 2026-W26
Must-read (2)
OpenAI and Broadcom Unveil Jalapeño: OpenAI's First Custom AI Inference Chip
OpenAIOpenAI and Broadcom jointly announced Jalapeño on June 24 — OpenAI's first custom ASIC designed exclusively for LLM inference. The chip was co-developed from initial design to tape-out in nine months, with AI models accelerating parts of the chip design itself. OpenAI claims roughly 50% better cost-per-token versus current-generation GPUs. Prototype deployments are targeted for end of 2026, with production ramp in 2027–2028. The chip will not be sold to external customers.
Qualcomm Acquires Modular for $3.92B to Challenge CUDA Lock-in
QualcommQualcomm announced at its Investor Day on June 24 that it is acquiring Modular — the startup behind the Mojo programming language and MAX inference engine — in an all-stock deal valued at approximately $3.92B. The deal is expected to close H2 2026 pending regulatory approval. Modular's stack runs AI models across Nvidia, AMD, Intel, and Apple Silicon without hardware-specific rewrites, directly attacking the developer lock-in that makes CUDA sticky.
Worth knowing (4)
Anthropic Accuses Alibaba of Largest Known Claude Distillation Attack: 28.8M Conversations
AnthropicIn a letter to the US Senate Banking Committee disclosed on June 24, Anthropic accused Alibaba's Qwen lab of conducting the largest known distillation attack against Claude: 28.8 million conversation exchanges via nearly 25,000 fraudulent accounts between April 22 and June 5, 2026. The campaign targeted Claude's software engineering and agentic reasoning capabilities. Anthropic had previously identified similar campaigns attributed to DeepSeek (150K interactions), Moonshot AI (3.4M), and MiniMax (13M).
Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence
A comprehensive survey of code intelligence systems that go beyond natural-language-only inputs, covering how LLMs process visual artifacts — screenshots, charts, vector drawings, interactive UI states — to generate executable code. The paper maps four domains: graphical user interfaces, scientific visualization, structured graphics, and emerging agent frameworks, and argues future progress requires multi-signal validation and agent transparency.
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
MetaAn empirical study showing that post-training quantization of reasoning models paradoxically increases chain-of-thought length while reducing accuracy. In up to 52% of failures, quantized models reach the correct intermediate answer but then fail to select it — because high-entropy token positions cause them to oversample 'overthinking' markers like 'wait', 'but', 'alternatively'. A training-free logit penalty on these markers reduces reasoning length 12–23% while maintaining or improving accuracy across 5 models (1.5B–32B), 3 quantization methods, and 5 benchmarks.
Gemini 3.5 Flash Gains Native Computer Use as Built-in Tool
Google DeepMindGoogle announced on June 24 that computer use is now a native built-in tool in Gemini 3.5 Flash, available via the Gemini API and Gemini Enterprise Agent Platform. Previously available only as a standalone specialist model, the capability now lets agents see, click, type, and scroll across browser, mobile, and desktop environments. Targeted adversarial training mitigates prompt injection risks. Improved OSWorld benchmark performance versus prior implementations.
For reference (8)
Are We Ready For an Agent-Native Memory System? SJTU Benchmarks 12 Architectures
A systematic evaluation of AI agent memory through a data-management lens from SJTU and Tsinghua. The paper proposes a framework decomposing agent memory into four modules — representation and storage, extraction, retrieval and routing, and maintenance — then benchmarks 12 existing memory systems. Key finding: no single architecture performs optimally across all workloads; localized maintenance is more cost-efficient than full reorganization.
Wan-Streamer v0.1: End-to-End Real-Time Interactive Foundation Model Under 550ms Latency
Wan-AIA unified foundation model for real-time multimodal interaction handling language, audio, and video in a single Transformer with block-causal attention. Unlike pipeline systems chaining separate ASR, reasoning, and TTS modules, Wan-Streamer jointly learns perception, reasoning, and generation — achieving ~200ms model-side latency and 550ms total interaction latency, with streaming units as short as 160ms at 25 fps. Currently at 192p resolution as proof of concept.
DomainShuttle: Subject-Driven Text-to-Video Across In-Domain and Cross-Domain Scenarios
A text-to-video system for subject-driven synthesis across two scenarios: in-domain (preserving reference subject features precisely) and cross-domain (flexible variation while retaining identity). Introduces Domain-MoT (domain-aware adaptive layer normalization), Video-Reference DualRoPE (separate rotary position encoding for reference and video tokens), and Cross-Pair Consistent Loss. Ranked third on HF Daily Papers for June 25 (34 upvotes).
GitHub Copilot Removes Manual Model Selection from Free and Student Plans
GitHub / MicrosoftEffective June 24, GitHub made Copilot auto model selection the default and only option for Free and Student plan users. The Auto system dynamically routes each request to the best available model across OpenAI, Anthropic, and Google families, within plan restrictions. GitHub simultaneously retired the (Preview) label from all Microsoft-released models.
Claude Code v2.1.191: /rewind Command, 37% CPU Reduction, MCP Retry Logic
AnthropicClaude Code v2.1.191 (June 24) adds /rewind to resume conversations from before a /clear was run, cuts CPU usage during streaming by ~37% through text-update coalescing, adds MCP server retry logic for transient network errors, and reduces memory growth in long sessions. The prior v2.1.187 (June 23) had added sandbox.credentials to block sandboxed commands from reading secret files and org-configured model restrictions in the model picker.
OpenCode v1.17.10: MCP Server Instructions in Context, --mini CLI Mode
SSTOpenCode v1.17.10 (June 24) ships MCP server instructions integrated directly into session context, a new --mini CLI mode for lightweight invocation, MCP resource template listing and read tools, opencode-managed provider integration support, and fixed MCP OAuth callbacks for local authentication.
OpenAI Codex CLI v0.142.1: Opt-in Windows System Proxy Support
OpenAICodex CLI v0.142.1 (June 25, stable) adds opt-in Windows system proxy support covering PAC, WPAD, static proxies, and bypass rules. The 0.143.0-alpha series continued with 9+ pre-release builds across June 23–25, suggesting a larger feature update is in progress.
Google Brings Veo 3.1 Audio to All Flow Editing Tools, Adds Insert and Remove
Google DeepMindOn June 22, Google extended Veo 3.1's audio generation to existing Flow creation features — Ingredients to Video, Frames to Video, and Extend — that previously produced silent output. Two new precision editing tools were also added: Insert (adding elements to a scene with matched lighting) and Remove (deleting objects with automatic background reconstruction). Available in Gemini API, Vertex AI, the Gemini app, and Flow.