D 2025

Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference

STOYANOV, Radostin; Viktória SPIŠAKOVÁ; Adrian REBER; Wesley ARMOUR; Marcin COPIK et al.

Základní údaje

Originální název

Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference

Autoři

STOYANOV, Radostin; Viktória SPIŠAKOVÁ ORCID; Adrian REBER; Wesley ARMOUR; Marcin COPIK a Rodrigo BRUNO

Vydání

New York, Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, od s. 114-125, 12 s. 2025

Nakladatel

Association for Computing Machinery

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

Označené pro přenos do RIV

Ano

Organizační jednotka

Ústav výpočetní techniky

ISBN

979-8-4007-1871-7

Klíčová slova anglicky

Cloud Computing; Containers; LLM Inference; GPU Checkpointing

Štítky

Změněno: 25. 3. 2026 14:59, Mgr. Eva Špillingová

Anotace

V originále

The widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-of-the-art inference engines by 31 × compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.