Optimizing CUDA code by kernel fusion: application on BLAS

J 2015

Optimizing CUDA code by kernel fusion: application on BLAS

FILIPOVIČ, Jiří; Matúš MADZIN; Jan FOUSEK and Luděk MATYSKA

Basic information

Original name

Optimizing CUDA code by kernel fusion: application on BLAS

Name in Czech

Optimalizace CUDA kódu pomocí fúzí kernelů: aplikace na BLAS

Authors

FILIPOVIČ, Jiří (203 Czech Republic, guarantor, belonging to the institution); Matúš MADZIN (703 Slovakia, belonging to the institution); Jan FOUSEK (203 Czech Republic, belonging to the institution) and Luděk MATYSKA (203 Czech Republic, belonging to the institution)

Edition

The Journal of Supercomputing, Springer US, 2015, 0920-8542

Other information

Language

English

Type of outcome

Article in a journal

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

United States of America

Confidentiality degree

is not subject to a state or trade secret

References:

URL

Impact factor

Impact factor: 1.088

RIV identification code

RIV/00216224:14330/15:00083436

Organization unit

Faculty of Informatics

DOI

http://dx.doi.org/10.1007/s11227-015-1483-z

UT WoS

000361531500013

EID Scopus

2-s2.0-84942372773

Keywords (in Czech)

GPU; CUDA; BLAS; fúze kernelů; generování kódu

Keywords in English

GPU; CUDA; BLAS; Kernel fusion; Code generation

Abstract

In the original language

Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.

Links

EE2.3.30.0037, research and development project

Name: Zaměstnáním nejlepších mladých vědců k rozvoji mezinárodní spolupráce

MUNI/A/0945/2015, interní kód MU

Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.

Investor: Masaryk University, Category A

MUNI/A/1159/2014, interní kód MU

Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace IV.

Investor: Masaryk University, Category A

Cite

FILIPOVIČ, Jiří; Matúš MADZIN; Jan FOUSEK and Luděk MATYSKA. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing. Springer US, 2015, vol. 71, No 10, p. 3934-3957. ISSN 0920-8542. Available from: https://dx.doi.org/10.1007/s11227-015-1483-z.

@article{1306828,
   author = {Filipovič, Jiří and Madzin, Matúš and Fousek, Jan and Matyska, Luděk},
   article_number = {10},
   doi = {http://dx.doi.org/10.1007/s11227-015-1483-z},
   keywords = {GPU; CUDA; BLAS; Kernel fusion; Code generation},
   language = {eng},
   issn = {0920-8542},
   journal = {The Journal of Supercomputing},
   title = {Optimizing CUDA code by kernel fusion: application on BLAS},
   url = {http://link.springer.com/article/10.1007/s11227-015-1483-z},
   volume = {71},
   year = {2015}
}

TY  - JOUR
ID  - 1306828
AU  - Filipovič, Jiří - Madzin, Matúš - Fousek, Jan - Matyska, Luděk
PY  - 2015
TI  - Optimizing CUDA code by kernel fusion: application on BLAS
JF  - The Journal of Supercomputing
VL  - 71
IS  - 10
SP  - 3934-3957
EP  - 3934-3957
PB  - Springer US
SN  - 09208542
KW  - GPU
KW  - CUDA
KW  - BLAS
KW  - Kernel fusion
KW  - Code generation
UR  - http://link.springer.com/article/10.1007/s11227-015-1483-z
N2  - Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.
ER  -

FILIPOVIČ, Jiří; Matúš MADZIN; Jan FOUSEK and Luděk MATYSKA. Optimizing CUDA code by kernel fusion: application on BLAS. \textit{The Journal of Supercomputing}. Springer US, 2015, vol.~71, No~10, p.~3934-3957. ISSN~0920-8542. Available from: https://dx.doi.org/10.1007/s11227-015-1483-z.

Přehled o publikaci