benchmark — AI Digest

2 июн Microsoft Build 2026: MAI Model Family Launched to Power GitHub Copilot Without OpenAI Dependency Microsoft models-llm
7 мая xAI Releases Grok 4.3 with 1M Context, 40-60% Price Cuts, and Agentic Benchmark Gains xAI models-llm
13 мая SenseNova-U1: Open-Source Unified Multimodal Understanding and Generation via NEO-unify SenseTime research
15 мая MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Images Technion research
15 мая EVA-Bench: End-to-End Framework for Evaluating Voice Agents ServiceNow AI research
18 мая ExploitBench: Claude Mythos Preview and GPT-5.5 Develop Real Browser Exploits Autonomously Anthropic research
17 июн VibeThinker-3B Reaches Frontier-Level Reasoning Benchmarks via Curriculum RL WeiboAI research
10 мая Google DeepMind's AI Co-Mathematician Reaches 48% on FrontierMath Tier 4 Google DeepMind research
13 мая Baidu Releases ERNIE 5.1 at 6% of Industry Pre-Training Cost, Enters Global Top-10 Search Baidu models-llm
13 мая RubricEM: Meta-RL with Rubric-Guided Policy Decomposition Beyond Verifiable Rewards Google research
18 мая SOOHAK: Frontier LLMs Solve Hard Math But Fail to Recognize Unsolvable Problems research
19 июн ENPIRE: AI Coding Agents Close the Loop on Physical Robotics Research Without Human Intervention NVIDIA / Carnegie Mellon University / UC Berkeley research
14 июн MaxProof: MiniMax Model Exceeds IMO and USAMO Gold-Medal Thresholds on Formal Math MiniMax research
8 мая AI Co-Mathematician: Google DeepMind Achieves 48% on FrontierMath Tier 4 Google DeepMind research
16 мая MemLens: Benchmark for Multimodal Long-Term Memory in Vision-Language Models NVIDIA research
18 мая Judge Circuits: Mechanistic Explanation of LLM-as-Judge Format Inconsistency research
19 мая CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes) Peking University / Shanghai Artificial Intelligence Laboratory research
19 мая MMSkills: Reusable Multimodal Skills for General Visual Agents (105 HF upvotes) Shanghai Jiao Tong University research
2 июн Crafter: Multi-Agent Harness for Editable Scientific Figure Generation Scores +16pt Over Baselines (103 HF Upvotes) Tsinghua University research
14 июн EvoArena: LLM Agents Score Only 40% on Dynamic Evolving Environments MIT / NUS / Salesforce research
14 июн WeaveBench: Computer-Use Agents Fail at Hybrid GUI+CLI Tasks — 41% Pass Rate Microsoft Research research
17 июн Anthropic Study: Domain Expertise Drives Agentic Coding Success, Not Programming Background Anthropic research
30 апр Programming with Data: test-driven data engineering for self-improving LLMs OpenDataLab research
1 мая AutoResearchBench — a benchmark for autonomous scientific literature search by AI agents BAAI research
8 мая GigaChat Passes Engineering Certification at Moscow Power Engineering Institute Sber industry
11 мая Soohak: 64 Mathematicians Build Research-Level Benchmark That Stumps Frontier LLMs Seoul National University research
12 июн EvoArena: LLM Agents Score Only 39.6% on Dynamic Evolving Environments Benchmark MIT research
7 мая Executable World Models for ARC-AGI-3: Coding-Agent Approach Without Game-Specific Logic research
13 мая Learning, Fast and Slow: Dual-Weight Architecture for Continual LLM Adaptation research
8 июн SubtleMemory: Benchmark Reveals Agents Systematically Fail Fine-Grained Relational Memory research
8 июн VideoKR: 315K-Example Training Corpus for Knowledge- and Reasoning-Intensive Video Understanding Yale University research
9 июн SWE-Explore: Benchmarking Repository Exploration as the Binding Constraint in Coding Agents Shanghai Jiao Tong University research
19 июн StylisticBias: 15 Visual Attributes Account for 80% of Social Bias in Multimodal LLMs research