FILIPOVIČ, Jiří, Matúš MADZIN, Jan FOUSEK and Luděk MATYSKA. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, Springer US, 2015, vol. 71, No 10, p. 3934-3957. ISSN 0920-8542. doi:10.1007/s11227-015-1483-z.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Optimizing CUDA code by kernel fusion: application on BLAS
Name in Czech Optimalizace CUDA kódu pomocí fúzí kernelů: aplikace na BLAS
Authors FILIPOVIČ, Jiří (203 Czechia, guarantor, belonging to the institution), Matúš MADZIN (703 Slovakia, belonging to the institution), Jan FOUSEK (203 Czechia, belonging to the institution) and Luděk MATYSKA (203 Czechia, belonging to the institution).
Edition The Journal of Supercomputing, Springer US, 2015, 0920-8542.
Other information
Original language English
Type of outcome article in a journal
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher United States of America
Confidentiality degree is not subject to a state or trade secret
WWW URL
Impact factor Impact factor: 1.088
RIV identification code RIV/00216224:14330/15:00083436
Organization unit Faculty of Informatics
Doi http://dx.doi.org/10.1007/s11227-015-1483-z
UT WoS 000361531500013
Keywords (in Czech) GPU; CUDA; BLAS; fúze kernelů; generování kódu
Keywords in English GPU; CUDA; BLAS; Kernel fusion; Code generation
Tags J-Q2
Tags International impact, Reviewed
Changed by Changed by: RNDr. Jiří Filipovič, Ph.D., učo 72898. Changed: 9/7/2019 13:16.
Abstract
Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.
Links
EE2.3.30.0037, research and development projectName: Zaměstnáním nejlepších mladých vědců k rozvoji mezinárodní spolupráce
MUNI/A/0945/2015, internal MU codeName: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.
Investor: Masaryk University, Grant Agency of Masaryk University, Category A
MUNI/A/1159/2014, internal MU codeName: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace IV.
Investor: Masaryk University, Grant Agency of Masaryk University, Category A
PrintDisplayed: 7/8/2020 12:13