MemLens: Benchmark for Multimodal Long-Term Memory in Vision-Language Models

NVIDIA

Research official 1 src. ~1 min

MemLens (arXiv 2605.14906, 62 HF Daily upvotes) evaluates long-term multimodal memory in vision-language models through 789 questions across five memory capabilities and four context lengths, testing 27 models and 7 memory-augmented agents. Key finding: long-context LVLMs succeed via direct visual grounding in short contexts but degrade sharply as conversations grow, while memory agents remain stable but lose visual fidelity. Multi-session reasoning challenges virtually all tested systems.

Why it matters

As multimodal agents are deployed in long-horizon settings (customer service, tutoring, embodied robots), memory limitations become critical. MemLens provides the first systematic evaluation across multiple memory types and context lengths, revealing a clear gap motivating hybrid long-context and structured-retrieval architectures.

Importance: 3/5

62 HF Daily upvotes; first systematic multimodal long-term memory evaluation across 27 models; reveals sharp degradation in long conversations

multimodal memory benchmark vision-language long-context agents

Sources

official MemLens — arXiv