Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

Using modern Graphic Processing Units (GPUs) becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k– NN) algorithm. This work compares performances of OpenCL and CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison with a single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes. Keywords—Artificial intelligence, big data, comparison, CUDA, GPU, high performance computing, k-NN, multi–GPU, OpenCL.


I. INTRODUCTION
Parallel computing is a way how to accelerate many algorithms, which are computationally intensive.These algorithms can be found in image, sound and video applications or simulations, data mining, security [1], forecasting systems, etc.
k-NN belongs to the algorithms of artificial intelligence and it is one of the most widely used algorithms in data mining applications.Algorithm can be used for the classification of many various problems from business or science.Sometimes there is a requirement to process large datasets with high dimensional data.These problems can take days to compute.Using parallel computing, these problems can be solved faster than using non-parallel implementation.GPUs have much more cores than CPU, so they can be used as better solution for parallelization.The next advantage to use GPUs is relatively low price due to their high performance.k-NN algorithm is a good candidate for GPU parallelization.
In this paper an OpenCL [2] and CUDA [3] accelerated version of k-Nearest Neighbor machine learning algorithm has been introduced.This work is based on our previous work [4].The algorithm is very computationally intensive mainly when big datasets with high dimensional data have to be processed.The process can take hours or days.To solve this problem, we modified common k-NN algorithm to run on multiple GPUs.We used two common gaming dual-core graphic cards NVIDIA GeForce GTX 690 [5] with 2x3072 CUDA cores in total.The theoretical single precision computing performance is 11.24 TFLOPS for both devices.We also used these GPUs to speed up Viola-Jones object detector [6], which was also used in [7] [8] [9].
The main contribution of this paper is the creation of the OpenCL and CUDA versions of k-NN algorithm, which can be executed on several GPU cards in parallel.Using this relatively cheap hardware, it is able to speed up computation up to 880 times in comparison with CPU with 4.1 GHz frequency.A newly created algorithms were tested on dataset containing millions of elements with various number of attributes (4, 10, 100 and 1000 attributes) and then algorithms were together compared.
The rest of the paper is organized as follows: Section II describes other GPU implementations of k-NN algorithm.Section III describes k-NN algorithm.OpenCL and CUDA platforms are introduced in section IV.In section V our GPU implementation is described.Results and discussion are described in section VI.Section VII concludes this paper.

II. RELATED WORK
GPU computing has become very popular during last several years.There is also increasing need to process more amount of data with artificial intelligence.The next paragraph describes several articles dealing with CUDA implementations of k-NN GPU algorithm and various use cases of the algorithm.
In [10] a new brute force algorithm for building the k-Nearest Neighbor Graph is described.The proposed algorithm has two parts, where the first is for finding distances between the input vectors and the second part is for selection of k neighbors for each testing sample.Also new algorithm based on quick sort was implemented for quicker sorting of distance pairs.The algorithm achieves higher speed up, if the k variable is increasing.
The paper [11] compares GPU implementation of brute force k-NN with several CPU based implementations and the implementation of algorithms from ANN 1 library (A Library for Approximate Nearest Neighbor Searching).
In [12] [13] several optimization techniques were applied to maximize the utilization of the GPU.
The work [14] describes MST (minimum spanning tree) problem, which is resolved by the combination of classical Boruvka MST algorithm and the k-NN graph structure.Achieved speed-ups were between 30 and 40 in comparison with CPU implementation.
In [15] authors describe how to use GPU k-NN algorithm for image processing (texture analysis).Their algorithm is 150 times faster than CPU version during processing synthetic data and up to 75 times faster during processing image data.
In [16], the LSH (Locality Sensitive Hashing) algorithm was used for k-NN computation.The results were demonstrated on large image datasets and achieved acceleration was 40 in comparison with CPU version.
Several comparison tests between OpenCL and CUDA frameworks were performed.In [17] authors performed 16 benchmark tests where CUDA achieves for about 30 % better performance than OpenCL.Further they tried OpenCL portability and they did not found differences in performance.In [18] authors compared executive time of CUDA Drive API with OpenCL platform where CUDA was for about 5 % faster than OpenCL.
The implementation of GPU k-NN algorithm into Rapid-Miner 2 data mining platform was described in [19].The algorithm was created in JAVA programming language with using jcuda 3 library that is responsible for executing CUDA kernel from JAVA.Created algorithm achieves 170x speedup, but it depends mainly on number of used attributes (using more than 128 attributes decreases the speedup).
Our approach differs from using OpenCL platform instead of CUDA and our algorithm has several improvements in comparison with some approaches described in this paragraph.The main improvement is an option to run our algorithm on multiple GPUs in parallel.The created OpenCL kernel was partially vectorized and the algorithm was created without need to have some sorting algorithms.These improvements speed up the algorithm.Our solution was tested on very large data set, where the processing time was minutes against other works, where processing quite small datasets took seconds.

IV. OPENCL AND CUDA INTRODUCTION
Nowadays there exist two platforms for GPU computing that are well used by many users.The first developed platform is CUDA [3] and the second is OpenCL [2].CUDA is being more used but on the other hand CUDA can only be used with NVIDIA GPUs.OpenCL is being used less than CUDA but OpenCL can be performed on many various devices.
When compared these GPU platforms with common CPU solution, GPU hardware is much more specialized for intensive highly parallel computing.It can be seen from Fig. 2 and Fig. 3 where ALU (Arithmetic Logic Unit) elements are used for computing.The GPU hardware can process much more computing units in parallel than CPU.

A. OpenCL
OpenCL (Open Computing Language) [2] is an open royalty-free standard determined for parallel programming of suitable devices like CPUs, GPUs and the other devices.OpenCL can solve many problems more efficiently than CPU.
In these days, the GPU computing is very popular and many applications have been developed in OpenCL.
There are two types of OpenCL code.The first type executed on CPU is called host part and the second one that is executed on OpenCL device is called device part.OpenCL kernel is executed in a device.The kernel can contain optimized code with OpenCL functions.OpenCL devices use SIMT (Single Instruction Multiple Threads) architecture.OpenCL device consists of Streaming Multiprocessors (SMs) where each of them contains many simple cores.These cores are able to do only simple operations, so OpenCL programming is more complex.Cores can execute many work-items (threads) in parallel.Work-items are grouped into work-group and they can mutually communicate and use the same (shared) local memory.The number of work-groups and work-items has to be set on the start of the process.OpenCL device contains onchip and off-chip memories.On-chip (private, local) memories are faster than off-chip (global memory, constant memory, texture cache).However some of these memories can be fast too, because they are cached.[20] B. CUDA CUDA [21] is a parallel computing and programming platform and only newer NVIDIA graphic cards are supported by CUDA.Nowadays, there exist many GPU computing applications developed in CUDA for example deep learning algorithm [22] that is used for training neural network for image recognition.
Common CUDA GPU uses same principles that are described in OpenCL section.There are only different names of therms (shown in Table I).In CUDA CPU is marked as a host and GPU is marked as a device.

V. OPTIMIZATION k-NN FOR GPUS
Firstly, we tried to process large data sets in RapidMiner, but unfortunately the original CPU version of k-NN was too slow.So we decided to create GPU accelerated algorithm that can be Fig. 3. Scheme of GPU executed from RapidMiner.Our implementation was created in JAVA programming language, because RapidMiner is also programmed in JAVA.The first step was to create OpenCL kernel that was created in C programming language with using OpenCL syntaxes.For mutual cooperation between OpenCL and JAVA, jocl4 library was used [21].According to OpenCL kernel we created CUDA version of this kernel using CUDA version 7.5 and jcuda5 library that was used as JAVA wrapper.
Training and testing data sets have to be transformed into float arrays before they are transmitted to GPU.We used OpenCL vector format called float4 that has a big advantage: it contains four float values that are processed in one step instead of four steps (for common float).So every training and testing example is saved into float4 array.We also optimized kernel with using local memory.
In our implementation, the classical principle of k-NN was a little bit modified.The differences are mutually compared during their computation and the lowest k differences are saved as final nearest neighbors.The algorithm 1 shows the principle of modified algorithm.After this modification, the algorithm can work faster for lower k values.When compared with CPU version, the classification results were the same.

A. Multi-GPU support
For multi-GPU support we created a JAVA library that is able to utilize all found GPUs.This library is available only for OpenCL.The library can automatically split input and output data and transmit them equally into all devices.It decreases amount of transmitted data.The next advantage is a very easy way, how to write code in JAVA with minimum knowledge of OpenCL.For multi-GPU support for CUDA platform we had to run split data into GPUs and start computing on each GPU in separated JAVA thread in parallel.
In case of k-NN algorithm, the training data vector had to be transmitted into all GPU devices.Testing data vector and vector with final predictions were splitted equally into all device due to lower load of GPU memory.We also tried the version of algorithm, where data were not splitted into GPU devices, but they were copied whole to each device.Differences between computing times of each version were negligible.We performed several comparison tests to verify the functionality of our accelerated k-NN algorithms.The tests were performed with data sets containing different amount of elements and different numbers of attributes.Since the algorithm has been modified to have a good result for lower k parameter (k = 5), we also carried out several tests for higher values of k parameter (k = 10, k = 20).For a comparison between CPU and GPU versions, we used RapidMiner platform that consists of many machine learning and data-mining algorithms.First, we generated polynomial data set using one of the RapidMiner algorithms.Then this data set was divided into two parts.The training part contained 25 % of elements and testing part contained 75 % of elements.In the next step, several tests with different number of elements and different number of attributes were performed.We used CPU version of k-NN algorithm integrated in RapidMiner and our GPU versions of k-NN that were also executed in RapidMiner.The tests were performed with using one core of CPU, one GPU, 2 GPUs, 3 GPUs and 4 GPUs.The results show how much time each scenario took and they are described in the Table IV for OpenCL implementation and in Table V for CUDA implementation.In this case the measurements were performed for k = 5.As we can see from table some CPU computations can take days in comparison with GPU computation, where it takes minutes.Fig. 4 shows speed up of our OpenCL GPU implementation of k-NN algorithm.We can see that increasing amount of attributes can decrease speed up.Speed up can be also increased if higher number of elements is used.Scenarios for CUDA implementation are shown in Table V.The overall speed up of CUDA implementation of k-NN algorithm is shown in Fig. 5.The best speed up was achieved in scenario with 1 million of elements and 4 attributes where achieved speed up was 882 times.The comparison between CUDA and OpenCL implementations is shown in Fig 6 .We can see that for scenarios with number of attributes 100 and 1000, CUDA was for about 3 % faster than OpenCL.For scenarios with 10 attributes OpenCL implementation was faster for about 11 %.And for 4 attributes CUDA was for about 18 % faster than OpenCL.These differences in speed up when 4 or 10 attributes are used, can be caused with using float4 data type for storing array of attributes where CUDA can handle much more better with 4 attributes in one float4 array than with 10 attributes in 3 float4 arrays where two elements in array are not used.When computing the average value of all scenarios, CUDA was for about 0.5 % faster.
The table II shows the results for using different values of k (measured for all GPUs).Our OpenCL implementation has been created to work effectively with the number of k neighbors lower than 10.Otherwise, the speed up of algorithm will be radically decreased.In comparison with CUDA implementation (see Table.III) we can see that CUDA is slightly faster than OpenCL implementation.

VII. CONCLUSION
The main contribution of this work is OpenCL accelerated implementations of k-Nearest Neighbor machine learning algorithm with using OpenCL and CUDA.The algorithm can be executed on multiple GPUs in parallel.We created the modified version of algorithm that achieves very good results for k neighbors lower than 10.We found that with using relatively cheap hardware (2x NVIDIA GeForce GTX 690), it is possible to compute 4 million elements (each has 10 attributes) in 3 minutes in comparison with using one single core CPU  on the other hand OpenCL has better result with using data set with 10 attributes.When compared overall results, both OpenCL and CUDA achieves similar speed up.

Fig. 6 .
Fig. 6.Speed up comparison between CUDA and OpenCL implementations of k-NN algorithm.

TABLE II OPENCL
-COMPARING FOR DIFFERENT k NEIGHBORS.