ELDR: Expert-Locality-Aware Routing Cuts MoE Serving Latency by up to 14%

Microsoft Research

Research official 1 src. ~1 min

Microsoft Research introduces ELDR, a routing system for prefill-decode disaggregated serving of MoE models. During prefill, it builds an expert signature per request; during decode, offline K-means clustering and online locality-band routing minimize distinct expert weight loads across workers. Tested up to 40 GPUs and three MoE models, ELDR achieves 5.9–13.9% median time-per-output-token improvement over load-balancing baselines.

Why it matters

MoE models are increasingly dominant in production but serving them efficiently at disaggregated scale remains unsolved. ELDR's gains are pure routing policy — no model changes required — making it drop-in deployable for any existing MoE serving stack.

Importance: 2/5

Drop-in routing optimization for MoE serving; 5.9–13.9% latency improvement; 21 HF Daily Papers upvotes

inference moe efficiency serving infrastructure

Sources

official ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving — arxiv