Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

Jan Masek, Radim Burget, Lukas Povoda, Malay Kishore Dutta


Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k–NN) algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison witha single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes.

Full Text:



J. Minar, K. Riha, H. Tong, “Intruder Detection for Automated Access Control Systems with Kinect Device,”In 2013 36th International Conference on Telecommunications and Signal Processing (TSP), 2013, pp. 826-829, ISBN 978-1-4799-0403-7.

Khronos OpenCL Working Group,“The OpenCL Specification - Version: 1.1,”2011, Available:

NVIDIA Inc,“CUDA Toolkit 7.5,”2015, Available:

J. Masek, R. Burget, J. Karasek, V. Uher, M.K. Dutta, “Multi-GPU implementation of k-nearest neighbor algorithm,”In 2013 36th International Conference on Telecommunications and Signal Processing (TSP), 2015, pp. 764-767, ISBN 978-1-4799-8497-8.

NVIDIA, 2013, February 5. “GeForce Hardware.”Available:

J. Masek, R. Burget, V. Uher, S. Guney,“Speeding up Viola–Jones algorithm using multi–Core GPU implementation,”In 2013 36th International Conference on Telecommunications and Signal Processing (TSP), 2013, pp. 808-812, ISBN: 978-1-4799-0402-0.

K. Riha, J. Masek, R. Burget, R. Benes, E. Zavodna, “Novel method for localization of common carotid artery transverse section in ultrasound images using modified Viola–Jones detector,”Ultrasound in Medicine & Biology, 2013, pp. 1887-1902, ISSN 0301-5629.

R. Burget, P. Cika, M. Zukal, J. Masek,“Automated Localization of Temporomandibular Joint Disc in MRI Images,”In 2011 34th International Conference on Telecommunications and Signal Processing (TSP), 2011, pp. 413-416, ISBN: 978-1-4577-1409-2.

J. Masek, R. Burget, J. Karasek, V. Uher, S. Guney,“Evolutionary Improved Object Detector for Ultrasound Images,”In 2013 36th International Conference on Telecommunications and Signal Processing (TSP), 2013, pp. 586-590, ISBN: 978-1-4799-0402-0.

I. Komarov, A. Dashti, R. D Souza,“Fast k-NNG construction with GPUbased quick multi-select,”2013.

V. Garcia, E. Debreuve, M. Barlaud,“Fast k Nearest Neighbor Search using GPU,”In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008, pp. 1-6.

S. Liang, Y. Liu, Ch. Wang, L. Jian,“A CUDA-based Parallel Implementation of K-Nearest Neighbor Algorithm,”In Cyber-Enabled Distributed Computing and Knowledge Discovery, 2009, pp. 291-296.

Q. Kuang, L. Zhao,“A Practical GPU Based KNN Algorithm,”In Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT 09), 2009, pp. 151-155, ISBN: 978-952-5726-07-7.

A. Arefin, C. Riveros, R. Berretta, and P. Moscato,“kNN-BoruvkaGPU: a fast and scalable MST construction from kNN graphs on GPU,”In ICCSA’12 Proceedings of the 12th international konference on Computational Science and Its Applications - Volume Part I, 2012, pp. 71-86, ISBN: 978-3-642-31124-6.

V. Garcia, F. Nielsen,“Searching High-Dimensional Neighbours: CPU-Based Tailored Data-Structures Versus GPU-Based Brute-Force Method,”In Lecture Notes in Computer Science, Springer Berlin Heidelberg 2009, pp. 425-436, ISBN: 978-3-642-01810-7.

J. Pan D. Manocha,“Fast GPU-based locality sensitive hashing for knearest neighbor computation,”In Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2011, pp. 211-220, ISBN: 978-1-4503-1031-4.

F. Jianbin, A.L. Varbanescu, H. Sips,“A Comprehensive Performance Comparison of CUDA and OpenCL,”In Conference of Parallel Processing (ICPP) 2011, pp.216-225, ISBN: 978-1-4577-1336-1.

Ch.L. Su, P.Y. Chen, CH.CH Lan,L.S Huang, K.H. Wu,“Overview and comparison of OpenCL and CUDA technology for GPGPU,”Circuits and Systems (APCCAS), 2012 IEEE Asia Pacific Conference on 2012, pp. 448-451, ISBN: 978-1-4577-1728-4.

A. Kovacs, Z. Prekopcsak,“Robust GPGPU plugin development for RapidMiner,”In RapidMiner Community Meeting And Conference - RCOMM 2012, 2012.

A. Munshi, B. Gaster, T. Mattson, J. Fung, D. Ginsburg,“OpenCL Programming Guide,”2011.

NVIDIA,“OpenCL Programming Guide for the CUDA Architecture,”2010, Available:

NVIDIA Inc,“GPU-Based Deep Learning Inference: A Performance and Power Analysis,”2015, Available:



  • There are currently no refbacks.