MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Images

Technion

Research official + media 2 src. ~1 min

MulTaBench introduces 40 datasets (20 image-tabular, 20 text-tabular) — the largest image-tabular benchmarking effort to date. The benchmark reveals that current tabular foundation models rely on frozen embeddings and that task-specific tuning substantially improves performance across text and image modalities and multiple encoder scales.

Why it matters

Real-world tabular data routinely includes images and free text alongside numeric columns, yet existing benchmarks ignore this gap. MulTaBench reveals a concrete weakness in current foundation models. 122 upvotes on HF Daily (May 14).

Importance: 4/5

122 HF Daily upvotes (+1 bump); fills a recognized gap in tabular ML benchmarking

multimodal tabular benchmark vision-language

Sources

official arXiv: MulTaBench

media HF Daily Papers May 14, 2026