Optimizing CUDA code by kernel fusion: application on BLAS

FILIPOVIČ, Jiří, Matúš MADZIN, Jan FOUSEK and Luděk MATYSKA. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing. Springer US, vol. 71, No 10, p. 3934-3957. ISSN 0920-8542. doi:10.1007/s11227-015-1483-z. 2015.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Optimizing CUDA code by kernel fusion: application on BLAS
Name in Czech	Optimalizace CUDA kódu pomocí fúzí kernelů: aplikace na BLAS
Authors	FILIPOVIČ, Jiří (203 Czech Republic, guarantor, belonging to the institution), Matúš MADZIN (703 Slovakia, belonging to the institution), Jan FOUSEK (203 Czech Republic, belonging to the institution) and Luděk MATYSKA (203 Czech Republic, belonging to the institution).
Edition	The Journal of Supercomputing, Springer US, 2015, 0920-8542.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	United States of America
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
Impact factor	Impact factor: 1.088
RIV identification code	RIV/00216224:14330/15:00083436
Organization unit	Faculty of Informatics
Doi	http://dx.doi.org/10.1007/s11227-015-1483-z
UT WoS	000361531500013
Keywords (in Czech)	GPU; CUDA; BLAS; fúze kernelů; generování kódu
Keywords in English	GPU; CUDA; BLAS; Kernel fusion; Code generation
Tags	J-Q2
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Jiří Filipovič, Ph.D., učo 72898. Changed: 9/7/2019 13:16.

Abstract

Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.

Links
EE2.3.30.0037, research and development project	Name: Zaměstnáním nejlepších mladých vědců k rozvoji mezinárodní spolupráce
MUNI/A/0945/2015, interní kód MU	Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.
MUNI/A/0945/2015, interní kód MU	Investor: Masaryk University, Category A
MUNI/A/1159/2014, interní kód MU	Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace IV.
MUNI/A/1159/2014, interní kód MU	Investor: Masaryk University, Category A

PrintDisplayed: 19/4/2024 12:31

Optimizing CUDA code by kernel fusion: application on BLAS

Other applications