J 2015

Optimizing CUDA code by kernel fusion: application on BLAS

FILIPOVIČ, Jiří; Matúš MADZIN; Jan FOUSEK and Luděk MATYSKA

Basic information

Original name

Optimizing CUDA code by kernel fusion: application on BLAS

Name in Czech

Optimalizace CUDA kódu pomocí fúzí kernelů: aplikace na BLAS

Authors

FILIPOVIČ, Jiří (203 Czech Republic, guarantor, belonging to the institution); Matúš MADZIN (703 Slovakia, belonging to the institution); Jan FOUSEK (203 Czech Republic, belonging to the institution) and Luděk MATYSKA (203 Czech Republic, belonging to the institution)

Edition

The Journal of Supercomputing, Springer US, 2015, 0920-8542

Other information

Language

English

Type of outcome

Article in a journal

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

United States of America

Confidentiality degree

is not subject to a state or trade secret

References:

Impact factor

Impact factor: 1.088

RIV identification code

RIV/00216224:14330/15:00083436

Organization unit

Faculty of Informatics

UT WoS

000361531500013

EID Scopus

2-s2.0-84942372773

Keywords (in Czech)

GPU; CUDA; BLAS; fúze kernelů; generování kódu

Keywords in English

GPU; CUDA; BLAS; Kernel fusion; Code generation

Tags

Tags

International impact, Reviewed
Changed: 9/7/2019 13:16, doc. RNDr. Jiří Filipovič, Ph.D.

Abstract

In the original language

Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.

Links

EE2.3.30.0037, research and development project
Name: Zaměstnáním nejlepších mladých vědců k rozvoji mezinárodní spolupráce
MUNI/A/0945/2015, interní kód MU
Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.
Investor: Masaryk University, Category A
MUNI/A/1159/2014, interní kód MU
Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace IV.
Investor: Masaryk University, Category A