FILIPOVIČ, Jiří, Jan PLHÁK and David STŘELÁK. Acceleration of dRMSD Calculation and Efficient Usage of GPU Caches. Online. In Waleed Smari. Proceedings of IEEE International Conference on High Performance Computing & Simulation. neuveden: IEEE, 2015. p. 47-54. ISBN 978-1-4673-7812-3. Available from: https://dx.doi.org/10.1109/HPCSim.2015.7237020. [citováno 2024-04-23]
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Acceleration of dRMSD Calculation and Efficient Usage of GPU Caches
Name in Czech Akcelerace dRMSD výpočtu a efektivní užití GPU cache
Authors FILIPOVIČ, Jiří (203 Czech Republic, guarantor, belonging to the institution), Jan PLHÁK (203 Czech Republic, belonging to the institution) and David STŘELÁK (203 Czech Republic, belonging to the institution)
Edition neuveden, Proceedings of IEEE International Conference on High Performance Computing & Simulation, p. 47-54, 8 pp. 2015.
Publisher IEEE
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Netherlands
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
RIV identification code RIV/00216224:14330/15:00083460
Organization unit Faculty of Informatics
ISBN 978-1-4673-7812-3
Doi http://dx.doi.org/10.1109/HPCSim.2015.7237020
UT WoS 000375684100006
Keywords (in Czech) RMSD; GPU; optimalizace kódu; cache
Keywords in English RMSD; GPU; code optimization; cache
Tags firank_B
Tags International impact, Reviewed
Changed by Changed by: RNDr. Jiří Filipovič, Ph.D., učo 72898. Changed: 13/7/2016 11:10.
Abstract
In this paper, we introduce the GPU acceleration of dRMSD algorithm, used to compare different structures of a molecule. Comparing to multithreaded CPU implementation, we have reached 13.4x speedup in clustering and 62.7x speedup in 1:1 dRMSD computation using mid-end GPU. The dRMSD computation exposes strong memory locality and thus is compute-bound. Along with conservative implementation using shared memory, we have decided to implement variants of the algorithm using GPU caches to maintain memory locality. Our implementation using cache reaches 96.5 % and 91.6 % of shared memory performance on Fermi and Maxwell, respectively. We have identified several performance pitfalls related to cache blocking in compute-bound codes and suggested optimization techniques to improve the performance.
Links
EE2.3.30.0037, research and development projectName: Zaměstnáním nejlepších mladých vědců k rozvoji mezinárodní spolupráce
PrintDisplayed: 23/4/2024 18:34