Sep 2025 – Dec 2025
AI · SWELast edited
Beyond Binary Priorities: Multi-Tier SLA Scheduling for LLM Serving
This project extends Llumnix (Sun et al., OSDI 2024), a dynamic scheduling system from Alibaba Group for LLM inference, with multi-priority SLA support. The goal is to enable fine-grained service-level differentiation across tenants in a shared inference cluster, while preserving low tail latency and high GPU utilization.
We reimplemented the relevant scheduling architecture in Microsoft Research's Vidur simulator and evaluated latency, throughput, and cost efficiency through large-scale simulations.
The result: up to ~3× improvement in P99 latency for high-priority tenants, with diminishing returns beyond four priority tiers.
Affiliation
UC Berkeley
Partners
Report
- Manuscript
Keywords
- Priority Scheduling
- SLO-aware Scheduling
- KV-Cache Management
- Live Migration
- Tail Latency Optimization
- Discrete-Event Simulation
- Python
- Vidur
- LLM Serving
- Slurm
- Weights & Biases
▸ Deepdive
Under development.