ELDR: Expert-Locality-Aware Routing Cuts MoE Serving Latency by up to 14%
Microsoft Research
Microsoft Research introduces ELDR, a routing system for prefill-decode disaggregated serving of MoE models. During prefill, it builds an expert signature per request; during decode, offline K-means clustering and online locality-band routing minimize distinct expert weight loads across workers. Tested up to 40 GPUs and three MoE models, ELDR achieves 5.9–13.9% median time-per-output-token improvement over load-balancing baselines.
Why it matters
MoE models are increasingly dominant in production but serving them efficiently at disaggregated scale remains unsolved. ELDR's gains are pure routing policy — no model changes required — making it drop-in deployable for any existing MoE serving stack.
Importance: 2/5
Drop-in routing optimization for MoE serving; 5.9–13.9% latency improvement; 21 HF Daily Papers upvotes