Experimenting with OpenCL on Apalis iMX6Q
Friday, December 2, 2016
Introduction
Nowadays, technological devices are demanding increasingly higher processing power and speed than ever before. To cope with the advances in technology, companies create various ways to achieve better processing performance. One such way that Apple Inc. found was to create the Open Computing Language (OpenCL). On June 16, 2008, Apple submitted a proposal to the Khronos Group to work on OpenCL. After five months of work, OpenCL 1.0 was released to the public on December 8, 2008.
OpenCL is a low level API for parallel programming of diverse processors found in personal computers, servers, mobile devices, as well as embedded devices. The programming language for OpenCL is a C-like language. It is made to work in heterogeneous platforms containing CPU's, GPU's, and processors from such popular manufacturers as NXP®, NVIDIA®, Intel®, AMD, IBM, etc. The purpose of OpenCL is to improve the speed and responsiveness of a wide range of applications such as gaming and, entertainment, as well as scientific and medical software.
In this post, we experiment with OpenCL using Toradex's Apalis iMX6Q SoM to compare two applications - one of them running on the GPU and the other on the CPU. At the end we share the results found in this experiment.
Hardware used
Toradex's Apalis iMX6Q SoM is based on NXP's iMX6Quad processor which offers efficient processing capabilities particularly suited to multimedia applications. The processor has four ARM® Cortex®-A9 cores up to 800 MHz per core. In addition to the processor, the Apalis SoM also offers up to 2GB DDR3 RAM (64 Bit) and 4GB eMMC FLASH.
Focusing on graphics and multimedia purposes, this processor also offers a Vivante GC2000 3D GPU which is capable of running OpenCL EP (Embedded Profile); therefore, we can use the i.MX6Q GPU processing power in any program.
OpenCL support in Toradex Embedded Linux image
We start from the point where we already have an OpenEmbedded build system configured ready to build an image for Apalis iMX6. This can be achieved following our OpenEmbedded (core) article.
For building an Embedded Linux image which supports OpenCL and includes its libraries, some additional steps need to be taken.
First, edit the following file in the directory;
/meta-toradex/recipes-fsl/packagegroups/packagegroup-fsl-tools-gpu.bbappend
adding the following content:
SOC_TOOLS_GPU_append_mx6 = " \ libopencl-mx6 \ libgles-mx6 \ "
Also, add the package imx-gpu-viv in local.conf file:
IMAGE_INSTALL_append = "imx-gpu-viv"
Now, build a Desktop based image:
bitbake angstrom-lxde-image
GPU and CPU code
All the code in this post can be found on GitHub.
As an example, we used two basic applications which perform a simple sum of arrays. The first code runs on the GPU and the second on the CPU. The consumed time is printed after the applications finish. The header needed to use OpenCL is cl.h which can be found at /usr/include/CL in the roofts. The libraries needed to link the program are libGAL.so and libOpenCL.so. They can be found at /usr/lib.
To calculate the consumed time, we created a queue with profiling enabled and got the profiling data at the end.
Follow the OpenCL code:
//************************************************************ // Demo OpenCL application to compute a simple vector addition // computation between 2 arrays on the GPU // ************************************************************ #include #include #include #include <CL/cl.h> // // OpenCL source code const char* OpenCLSource[] = { "__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)", "{", " // Index of the elements to add \n", " unsigned int n = get_global_id(0);", " // Sum the nth element of vectors a and b and store in c \n", " c[n] = a[n] + b[n];", "}" }; // Some interesting data for the vectors Int InitialData1[80] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17}; int InitialData2[80] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15}; // Number of elements in the vectors to be added #define SIZE 600000 // Main function // ************************************************************ int main(int argc, char **argv) { // Two integer source vectors in Host memory int HostVector1[SIZE], HostVector2[SIZE]; //Output Vector int HostOutputVector[SIZE]; // Initialize with some interesting repeating data for(int c = 0; c < SIZE; c++) { HostVector1[c] = InitialData1[c%20]; HostVector2[c] = InitialData2[c%20]; HostOutputVector[c] = 0; } //Get an OpenCL platform cl_platform_id cpPlatform; clGetPlatformIDs(1, &cpPlatform, NULL); // Get a GPU device cl_device_id cdDevice; clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &cdDevice, NULL); char cBuffer[1024]; clGetDeviceInfo(cdDevice, CL_DEVICE_NAME, sizeof(cBuffer), &cBuffer, NULL); printf("CL_DEVICE_NAME: %s\n", cBuffer); clGetDeviceInfo(cdDevice, CL_DRIVER_VERSION, sizeof(cBuffer), &cBuffer, NULL); printf("CL_DRIVER_VERSION: %s\n\n", cBuffer); // Create a context to run OpenCL enabled GPU cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // Create a command-queue on the GPU device cl_command_queue cqCommandQueue = clCreateCommandQueue(GPUContext, cdDevice, CL_QUEUE_PROFILING_ENABLE, NULL); // Allocate GPU memory for source vectors AND initialize from CPU memory cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL); cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL); // Allocate output memory on GPU cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY, sizeof(int) * SIZE, NULL, NULL); // Create OpenCL program with source code cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource, NULL, NULL); // Build the program (OpenCL JIT compilation) clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL); // Create a handle to the compiled OpenCL function (Kernel) cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL); // In the next step we associate the GPU memory with the Kernel arguments clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem), (void*)&GPUOutputVector); clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1); clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2); //create event cl_event event = clCreateUserEvent(GPUContext, NULL); // Launch the Kernel on the GPU // This kernel only uses global data size_t WorkSize[1] = {SIZE}; // one dimensional Range clEnqueueNDRangeKernel(cqCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, &event); // Copy the output in GPU memory back to CPU memory clEnqueueReadBuffer(cqCommandQueue, GPUOutputVector, CL_TRUE, 0, SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL); // Cleanup clReleaseKernel(OpenCLVectorAdd); clReleaseProgram(OpenCLProgram); clReleaseCommandQueue(cqCommandQueue); clReleaseContext(GPUContext); clReleaseMemObject(GPUVector1); clReleaseMemObject(GPUVector2); clReleaseMemObject(GPUOutputVector); clWaitForEvents(1, &event); cl_ulong start = 0, end = 0; double total_time; clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); total_time = end - start; printf("\nExecution time in milliseconds = %0.3f ms", (total_time / 1000000.0) ); printf("\nExecution time in seconds = %0.3f s\n\n", ((total_time / 1000000.0))/1000 ); return 0; }
The CPU code is a simple C program which computes the same sum of arrays from above. To calculate the time consumed, we used the library time.h. The code is as follows:
#include #include #include int InitialData1[80] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17}; int InitialData2[80] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15}; #define SIZE 600000 int main(int argc, char **argv) { time_t start, stop; clock_t ticks; time(&start); // Two integer source vectors in Host memory int HostVector1[SIZE], HostVector2[SIZE]; //Output Vector int HostOutputVector[SIZE]; // Initialize with some interesting repeating data //int n; for(int c = 0; c < SIZE; c++) { HostVector1[c] = InitialData1[c%20]; HostVector2[c] = InitialData2[c%20]; HostOutputVector[c] = 0; } for(int i = 0; i < SIZE; i++) { HostOutputVector[i] = HostVector1[i] + HostVector2[i]; ticks = clock(); } time(&stop); printf("\nExecution time in miliseconds = %0.3f ms",((double)ticks/CLOCKS_PER_SEC)*1000); printf("\nExecution time in seconds = %0.3f s\n\n", (double)ticks/CLOCKS_PER_SEC); return 0; }
Cross compiling the applications
One Makefile can be used to cross compile both GPU and CPU applications. The following three variables need your attention. You can change them according to your system:
- ROOTFS_DIR -> the sysroots path of Apalis iMX6
- APPNAME -> the name of you application
- TOOLCHAIN -> the the cross compiler toolchain path
export ARCH=arm export ROOTFS_DIR=/usr/local/toradex-linux-v2.5/oe-core/build/out-glibc/sysroots/apalis-imx6 APPNAME = proc_sample TOOLCHAIN = /home/prjs/toolchain/gcc-linaro CROSS_COMPILER = $(TOOLCHAIN)/bin/arm-linux-gnueabihf- CC= $(CROSS_COMPILER)gcc DEL_FILE = rm -rf CP_FILE = cp -rf TARGET_PATH_LIB = $(ROOTFS_DIR)/usr/lib TARGET_PATH_INCLUDE = $(ROOTFS_DIR)/usr/include CFLAGS = -DLINUX -DUSE_SOC_MX6 -Wall -std=c99 -O2 -fsigned-char -march=armv7-a -mfpu=neon -DEGL_API_FB -DGPU_TYPE_VIV -DGL_GLEXT_PROTOTYPES -DENABLE_GPU_RENDER_20 -I../include -I$(TARGET_PATH_INCLUDE) LFLAGS = -Wl,--library-path=$(TARGET_PATH_LIB),-rpath-link=$(TARGET_PATH_LIB) -lm -lglib-2.0 -lOpenCL -lCLC -ldl -lpthread OBJECTS = $(APPNAME).o first: all all: $(APPNAME) $(APPNAME): $(OBJECTS) $(CC) $(LFLAGS) -o $(APPNAME) $(OBJECTS) $(APPNAME).o: $(APPNAME).c $(CC) $(CFLAGS) -c -o $(APPNAME).o $(APPNAME).c clean: $(DEL_FILE) $(APPNAME)
Save the Makefile in the same folder of your applications and run make.
Copy the generated binaries to Apalis iMX6.
End results
After running both applications we got the following results:
### Processor time Execution time in miliseconds = 778.999 ms Execution time in seconds = 0.779 s ### GPU time Execution time in milliseconds = 12.324 ms Execution time in seconds = 0.012 s
Based on these results, we can clearly see that we were able to speed up the array sum by using OpenCL with Apalis iMX6Q GPU's processing capabilities.
Conclusion
Customers willing to take advantage of Apalis iMX6Q GPU can, among other methods, use OpenCL to increase computing power. With OpenCL capabilities, it is possible to run code in devices ranging from graphics cards to supercomputers as well as embedded devices, as seen in this post. Customers could also go further combining, for example, OpenCL with OpenCV to increase performance in computer vision applications. This can be used as an example of endless possibilities of applications that a company can develop.
References
https://www.khronos.org/opencl/
https://en.wikipedia.org/wiki/OpenCL
http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854
https://community.freescale.com/docs/DOC-93984
https://community.freescale.com/docs/DOC-100694
http://developer.toradex.com/products/apalis-imx6
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
http://parallelis.com/how-to-measure-opencl-kernel-execution-time/
https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro
https://en.wikipedia.org/wiki/OpenCL
http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854
https://community.freescale.com/docs/DOC-93984
https://community.freescale.com/docs/DOC-100694
http://developer.toradex.com/products/apalis-imx6
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
http://parallelis.com/how-to-measure-opencl-kernel-execution-time/
https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro
This blog post was originally featured on Embarcados.com in Portuguese. See here.
Author: Giovanni Bauermeister, Toradex Brasil
Comments
Post a Comment