Back to all projects

Sep 2025 – Dec 2025

AI · SWE

Last edited

Beyond Binary Priorities: Multi-Tier SLA Scheduling for LLM Serving

This project extends Llumnix (Sun et al., OSDI 2024), a dynamic scheduling system from Alibaba Group for LLM inference, with multi-priority SLA support. The goal is to enable fine-grained service-level differentiation across tenants in a shared inference cluster, while preserving low tail latency and high GPU utilization.

We reimplemented the relevant scheduling architecture in Microsoft Research's Vidur simulator and evaluated latency, throughput, and cost efficiency through large-scale simulations.

The result: up to ~3× improvement in P99 latency for high-priority tenants, with diminishing returns beyond four priority tiers.

Affiliation

UC Berkeley

Partners

Report

  • Manuscript

Keywords

  • Priority Scheduling
  • SLO-aware Scheduling
  • KV-Cache Management
  • Live Migration
  • Tail Latency Optimization
  • Discrete-Event Simulation
  • Python
  • Vidur
  • LLM Serving
  • Slurm
  • Weights & Biases

Deepdive

Under development.