SubtleMemory: Benchmark Reveals Agents Systematically Fail Fine-Grained Relational Memory
SubtleMemory introduces a 1,522-instance benchmark designed to test whether AI agents can handle memories that reinforce, diverge, or contradict each other — rather than simple recall. Built over 10 long histories grounded in 1,090 relation-controlled memory-variant sets, it evaluates 11 memory systems. All tested systems show systematic failure at fine-grained relational memory discrimination, with distinct failure modes across preservation, retrieval, and downstream reasoning stages.
Why it matters
Existing agent memory benchmarks measure recall, not relational reasoning over conflicting memories. SubtleMemory exposes this blind spot across all current approaches, motivating a new generation of memory architectures for long-horizon agents.
Importance: 2/5
Official arXiv + HuggingFace paper page; systematic evaluation of 11 memory systems revealing a shared fundamental weakness.