2015
Optimizing CUDA code by kernel fusion: application on BLAS
FILIPOVIČ, Jiří; Matúš MADZIN; Jan FOUSEK and Luděk MATYSKABasic information
Original name
Optimizing CUDA code by kernel fusion: application on BLAS
Name in Czech
Optimalizace CUDA kódu pomocí fúzí kernelů: aplikace na BLAS
Authors
FILIPOVIČ, Jiří (203 Czech Republic, guarantor, belonging to the institution); Matúš MADZIN (703 Slovakia, belonging to the institution); Jan FOUSEK (203 Czech Republic, belonging to the institution) and Luděk MATYSKA (203 Czech Republic, belonging to the institution)
Edition
The Journal of Supercomputing, Springer US, 2015, 0920-8542
Other information
Language
English
Type of outcome
Article in a journal
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
United States of America
Confidentiality degree
is not subject to a state or trade secret
References:
Impact factor
Impact factor: 1.088
RIV identification code
RIV/00216224:14330/15:00083436
Organization unit
Faculty of Informatics
UT WoS
000361531500013
EID Scopus
2-s2.0-84942372773
Keywords (in Czech)
GPU; CUDA; BLAS; fúze kernelů; generování kódu
Keywords in English
GPU; CUDA; BLAS; Kernel fusion; Code generation
Tags
Tags
International impact, Reviewed
Changed: 9/7/2019 13:16, doc. RNDr. Jiří Filipovič, Ph.D.
Abstract
In the original language
Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.
Links
EE2.3.30.0037, research and development project |
| ||
MUNI/A/0945/2015, interní kód MU |
| ||
MUNI/A/1159/2014, interní kód MU |
|