Projects

Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems

Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems

CPU scheduling overhead can dominate LLM inference time—up to 50% in systems like vLLM! Scheduling overhead can no longer be ignored as model forwarding speeds increase and more scheduling tasks get added.

Learn more →
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

MuxServe is a serving system using flexible spatial-temporal multiplexing, leverages dynamic LLM popularity and unbalanced resource utilization to achieve high GPU utilization and reduce serving costs, outperforming baselines by 1.8x in throughput and 2.9x in SLO attainment on synthetic workloads.

Learn more →
Preble: Efficient Prompt Scheduling for Augmented Large Language Models

Preble: Efficient Prompt Scheduling for Augmented Large Language Models

LLM prompts are getting longer and increasingly shared with agents, tools, documents, etc. We introduce Preble, the first distributed LLM serving system targeting long and shared prompts. Preble reduces latency by 1.5-14.5x over SOTA serving systems.

Learn more →
Efficient Augmented LLM Serving With InferCept

Efficient Augmented LLM Serving With InferCept

Today, LLMs are constantly being augmented with tools, agents, models, RAG, etc. We built InferCept [ICML'24], the first serving framework designed for augmented LLMs. InferCept sustains a 1.6x-2x higher serving load than SOTA LLM serving systems.

Learn more →
Consistency Large Language Models: A Family of Efficient Parallel Decoders

Consistency Large Language Models: A Family of Efficient Parallel Decoders

Large language models (LLMs) have traditionally decoded tokens sequentially, our research introduces Consistency Large Language Models (CLLMs), which can be fine-tuned to efficiently decode entire token sequences in a single step, reducing inference latency by up to 3.5x.

Learn more →
DistServe: Prefill-decode Disaggregation for LLM Serving Optimization

DistServe: Prefill-decode Disaggregation for LLM Serving Optimization

DistServe is goodput-optmized LLM serving system that supports prefill-decode disaggregation, a.k.a. splitting prefill from decode into different GPUs, to account for both cost and user satisfaction. DistServe achieves up to 4.48x goodput or 10.2x tighter SLO compared to exiting state-of-the-art serving systems, while staying within tight latency constraints.

Learn more →