Beyond Binary Priorities: Multi-Tier SLA Scheduling for LLM Serving

This project extends Llumnix (Sun et al., OSDI 2024), a dynamic scheduling system from Alibaba Group for LLM inference, with multi-priority SLA support. The goal is to enable fine-grained service-level differentiation across tenants in a shared inference cluster, while preserving low tail latency and high GPU utilization.

We reimplemented the relevant scheduling architecture in Microsoft Research's Vidur simulator and evaluated latency, throughput, and cost efficiency through large-scale simulations.

The result: up to ~3× improvement in P99 latency for high-priority tenants, with diminishing returns beyond four priority tiers.

Affiliation

UC Berkeley

Partners

Report

Manuscript

Keywords

Priority Scheduling
SLO-aware Scheduling
KV-Cache Management
Live Migration
Tail Latency Optimization
Discrete-Event Simulation
Python
Vidur
LLM Serving
Slurm
Weights & Biases

▸ Deepdive

Under development.