#benchmark
- Microsoft Build 2026: MAI Model Family Launched to Power GitHub Copilot Without OpenAI Dependency Microsoft models-llm
- xAI Releases Grok 4.3 with 1M Context, 40-60% Price Cuts, and Agentic Benchmark Gains xAI models-llm
- SenseNova-U1: Open-Source Unified Multimodal Understanding and Generation via NEO-unify SenseTime research
- MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Images Technion research
- EVA-Bench: End-to-End Framework for Evaluating Voice Agents ServiceNow AI research
- ExploitBench: Claude Mythos Preview and GPT-5.5 Develop Real Browser Exploits Autonomously Anthropic research
- VibeThinker-3B Reaches Frontier-Level Reasoning Benchmarks via Curriculum RL WeiboAI research
- Google DeepMind's AI Co-Mathematician Reaches 48% on FrontierMath Tier 4 Google DeepMind research
- Baidu Releases ERNIE 5.1 at 6% of Industry Pre-Training Cost, Enters Global Top-10 Search Baidu models-llm
- RubricEM: Meta-RL with Rubric-Guided Policy Decomposition Beyond Verifiable Rewards Google research
- SOOHAK: Frontier LLMs Solve Hard Math But Fail to Recognize Unsolvable Problems research
- ENPIRE: AI Coding Agents Close the Loop on Physical Robotics Research Without Human Intervention NVIDIA / Carnegie Mellon University / UC Berkeley research
- MaxProof: MiniMax Model Exceeds IMO and USAMO Gold-Medal Thresholds on Formal Math MiniMax research
- AI Co-Mathematician: Google DeepMind Achieves 48% on FrontierMath Tier 4 Google DeepMind research
- MemLens: Benchmark for Multimodal Long-Term Memory in Vision-Language Models NVIDIA research
- Judge Circuits: Mechanistic Explanation of LLM-as-Judge Format Inconsistency research
- CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes) Peking University / Shanghai Artificial Intelligence Laboratory research
- MMSkills: Reusable Multimodal Skills for General Visual Agents (105 HF upvotes) Shanghai Jiao Tong University research
- Crafter: Multi-Agent Harness for Editable Scientific Figure Generation Scores +16pt Over Baselines (103 HF Upvotes) Tsinghua University research
- EvoArena: LLM Agents Score Only 40% on Dynamic Evolving Environments MIT / NUS / Salesforce research
- WeaveBench: Computer-Use Agents Fail at Hybrid GUI+CLI Tasks — 41% Pass Rate Microsoft Research research
- Anthropic Study: Domain Expertise Drives Agentic Coding Success, Not Programming Background Anthropic research
- Programming with Data: test-driven data engineering for self-improving LLMs OpenDataLab research
- AutoResearchBench — a benchmark for autonomous scientific literature search by AI agents BAAI research
- GigaChat Passes Engineering Certification at Moscow Power Engineering Institute Sber industry
- Soohak: 64 Mathematicians Build Research-Level Benchmark That Stumps Frontier LLMs Seoul National University research
- EvoArena: LLM Agents Score Only 39.6% on Dynamic Evolving Environments Benchmark MIT research
- Executable World Models for ARC-AGI-3: Coding-Agent Approach Without Game-Specific Logic research
- Learning, Fast and Slow: Dual-Weight Architecture for Continual LLM Adaptation research
- SubtleMemory: Benchmark Reveals Agents Systematically Fail Fine-Grained Relational Memory research
- VideoKR: 315K-Example Training Corpus for Knowledge- and Reasoning-Intensive Video Understanding Yale University research
- SWE-Explore: Benchmarking Repository Exploration as the Binding Constraint in Coding Agents Shanghai Jiao Tong University research
- StylisticBias: 15 Visual Attributes Account for 80% of Social Bias in Multimodal LLMs research