2025
Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference
STOYANOV, Radostin; Viktória SPIŠAKOVÁ; Adrian REBER; Wesley ARMOUR; Marcin COPIK et al.Základní údaje
Originální název
Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference
Autoři
STOYANOV, Radostin; Viktória SPIŠAKOVÁ ORCID; Adrian REBER; Wesley ARMOUR; Marcin COPIK a Rodrigo BRUNO
Vydání
New York, Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, od s. 114-125, 12 s. 2025
Nakladatel
Association for Computing Machinery
Další údaje
Jazyk
angličtina
Typ výsledku
Stať ve sborníku
Obor
10200 1.2 Computer and information sciences
Utajení
není předmětem státního či obchodního tajemství
Forma vydání
elektronická verze "online"
Odkazy
Označené pro přenos do RIV
Ano
Organizační jednotka
Ústav výpočetní techniky
ISBN
979-8-4007-1871-7
UT WoS
Klíčová slova anglicky
Cloud Computing; Containers; LLM Inference; GPU Checkpointing
Štítky
Změněno: 25. 3. 2026 14:59, Mgr. Eva Špillingová
Anotace
V originále
The widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-of-the-art inference engines by 31 × compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.