Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM
Inference

D 2025

Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference

STOYANOV, Radostin; Viktória SPIŠAKOVÁ; Adrian REBER; Wesley ARMOUR; Marcin COPIK et al.

Základní údaje

Originální název

Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference

Autoři

STOYANOV, Radostin; Viktória SPIŠAKOVÁ ; Adrian REBER; Wesley ARMOUR; Marcin COPIK a Rodrigo BRUNO

Vydání

New York, Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, od s. 114-125, 12 s. 2025

Nakladatel

Association for Computing Machinery

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

URL

Označené pro přenos do RIV

Ano

Organizační jednotka

Ústav výpočetní techniky

ISBN

979-8-4007-1871-7

DOI

https://doi.org/10.1145/3731599.376735

UT WoS

001661298800014

Klíčová slova anglicky

Cloud Computing; Containers; LLM Inference; GPU Checkpointing

Štítky

rivok

Změněno: 25. 3. 2026 14:59, Mgr. Eva Špillingová

Anotace

V originále

The widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-of-the-art inference engines by 31 × compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.

Citovat

STOYANOV, Radostin; Viktória SPIŠAKOVÁ; Adrian REBER; Wesley ARMOUR; Marcin COPIK a Rodrigo BRUNO. Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference. Online. In Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York: Association for Computing Machinery, 2025, s. 114-125. ISBN 979-8-4007-1871-7. Dostupné z: https://doi.org/10.1145/3731599.376735.

@inproceedings{2518697,
   author = {Stoyanov, Radostin and Spišaková, Viktória and Reber, Adrian and Armour, Wesley and Copik, Marcin and Bruno, Rodrigo},
   address = {New York},
   booktitle = {Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
   doi = {https://doi.org/10.1145/3731599.376735},
   keywords = {Cloud Computing; Containers; LLM Inference; GPU Checkpointing},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {New York},
   isbn = {979-8-4007-1871-7},
   pages = {114-125},
   publisher = {Association for Computing Machinery},
   title = {Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference},
   url = {https://dl.acm.org/doi/10.1145/3731599.3767354},
   year = {2025}
}

TY  - CONF
ID  - 2518697
AU  - Stoyanov, Radostin - Spišaková, Viktória - Reber, Adrian - Armour, Wesley - Copik, Marcin - Bruno, Rodrigo
PY  - 2025
TI  - Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference
PB  - Association for Computing Machinery
CY  - New York
SN  - 9798400718717
KW  - Cloud Computing
KW  - Containers
KW  - LLM Inference
KW  - GPU Checkpointing
UR  - https://dl.acm.org/doi/10.1145/3731599.3767354
N2  - The widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-of-the-art inference engines by 31 × compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.
ER  -

STOYANOV, Radostin; Viktória SPIŠAKOVÁ; Adrian REBER; Wesley ARMOUR; Marcin COPIK a Rodrigo BRUNO. Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference. Online. In \textit{Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis}. New York: Association for Computing Machinery, 2025, s.~114-125. ISBN~979-8-4007-1871-7. Dostupné z: https://doi.org/10.1145/3731599.376735.

Přehled o publikaci