清华大学交叉信息研究院

SpecScheduler: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

演讲人： Zikun Li Carnegie Mellon University
时间： 2024-12-25 14:00-2024-12-25 15:00
地点：Online(DingTalk): https://meeting.dingtalk.com/j/duj0dqF7yAT (https://meeting.dingtalk.com/j/duj0dqF7yAT)
内容：

This paper introduces SpecScheduler, the first LLM serving system that supports SLO customization with fine-grained speculative decoding. SpecScheduler leverages the logits of a draft model to approximate the speculative performance of each token and introduces a theoretically optimal algorithm to select speculated tokens for verification. To enable SLO customization for individual requests while maintaining high throughput, SpecScheduler introduces a novel fine-grained speculative decoding pipeline that constructs speculative token trees for requests and customizes token selection for individual requests to attain SLO constraints while achieving high throughput. Evaluation results show that SpecScheduler significantly outperforms state-of-the-art serving systems and various serving strategies, achieving up to 73% higher SLO attainment and 74% higher goodput compared to the best existing approaches.

个人简介:

Zikun Li is a third-year PhD student in the Computer Science Department at Carnegie Mellon University, working under the mentorship of Prof. Zhihao Jia. His research focuses on machine learning systems, with a particular emphasis on large language model serving. Zikun earned his Bachelor’s degree in Computer Science from Peking University in 2021. During his undergraduate studies, he collaborated with Prof. Tong Yang to develop cutting-edge algorithms for data stream mining.