Experimenting with OpenCL on Apalis iMX6Q

Friday, December 2, 2016

Introduction

Nowadays, technological devices are demanding increasingly higher processing power and speed than ever before. To cope with the advances in technology, companies create various ways to achieve better processing performance. One such way that Apple Inc. found was to create the Open Computing Language (OpenCL). On June 16, 2008, Apple submitted a proposal to the Khronos Group to work on OpenCL. After five months of work, OpenCL 1.0 was released to the public on December 8, 2008.

OpenCL is a low level API for parallel programming of diverse processors found in personal computers, servers, mobile devices, as well as embedded devices. The programming language for OpenCL is a C-like language. It is made to work in heterogeneous platforms containing CPU's, GPU's, and processors from such popular manufacturers as NXP®, NVIDIA®, Intel®, AMD, IBM, etc. The purpose of OpenCL is to improve the speed and responsiveness of a wide range of applications such as gaming and, entertainment, as well as scientific and medical software.

In this post, we experiment with OpenCL using Toradex's Apalis iMX6Q SoM to compare two applications - one of them running on the GPU and the other on the CPU. At the end we share the results found in this experiment.

Hardware used

Toradex's Apalis iMX6Q SoM is based on NXP's iMX6Quad processor which offers efficient processing capabilities particularly suited to multimedia applications. The processor has four ARM® Cortex®-A9 cores up to 800 MHz per core. In addition to the processor, the Apalis SoM also offers up to 2GB DDR3 RAM (64 Bit) and 4GB eMMC FLASH.

Focusing on graphics and multimedia purposes, this processor also offers a Vivante GC2000 3D GPU which is capable of running OpenCL EP (Embedded Profile); therefore, we can use the i.MX6Q GPU processing power in any program.

OpenCL support in Toradex Embedded Linux image

We start from the point where we already have an OpenEmbedded build system configured ready to build an image for Apalis iMX6. This can be achieved following our OpenEmbedded (core) article.

For building an Embedded Linux image which supports OpenCL and includes its libraries, some additional steps need to be taken.

First, edit the following file in the directory;

/meta-toradex/recipes-fsl/packagegroups/packagegroup-fsl-tools-gpu.bbappend

adding the following content:

SOC_TOOLS_GPU_append_mx6 = " \
    libopencl-mx6 \
    libgles-mx6 \
"

Also, add the package imx-gpu-viv in local.conf file:

IMAGE_INSTALL_append = "imx-gpu-viv"

Now, build a Desktop based image:

bitbake angstrom-lxde-image

GPU and CPU code

All the code in this post can be found on GitHub.

As an example, we used two basic applications which perform a simple sum of arrays. The first code runs on the GPU and the second on the CPU. The consumed time is printed after the applications finish. The header needed to use OpenCL is cl.h which can be found at /usr/include/CL in the roofts. The libraries needed to link the program are libGAL.so and libOpenCL.so. They can be found at /usr/lib.

To calculate the consumed time, we created a queue with profiling enabled and got the profiling data at the end.

Follow the OpenCL code:

//************************************************************
// Demo OpenCL application to compute a simple vector addition
// computation between 2 arrays on the GPU
// ************************************************************
#include 
#include 
#include 
#include <CL/cl.h>
//
// OpenCL source code
const char* OpenCLSource[] = {
"__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",
"{",
" // Index of the elements to add \n",
" unsigned int n = get_global_id(0);",
" // Sum the nth element of vectors a and b and store in c \n",
" c[n] = a[n] + b[n];",
"}"
};
// Some interesting data for the vectors
Int InitialData1[80] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[80] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
// Number of elements in the vectors to be added
#define SIZE 600000
// Main function
// ************************************************************
int main(int argc, char **argv)
{ 
     // Two integer source vectors in Host memory
     int HostVector1[SIZE], HostVector2[SIZE];
     //Output Vector
     int HostOutputVector[SIZE];
     // Initialize with some interesting repeating data
     for(int c = 0; c < SIZE; c++)
     {
          HostVector1[c] = InitialData1[c%20];
          HostVector2[c] = InitialData2[c%20];
          HostOutputVector[c] = 0;
     }
     //Get an OpenCL platform
     cl_platform_id cpPlatform;
     clGetPlatformIDs(1, &amp;cpPlatform, NULL);
     // Get a GPU device
     cl_device_id cdDevice;
     clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &amp;cdDevice, NULL);
     char cBuffer[1024];
     clGetDeviceInfo(cdDevice, CL_DEVICE_NAME, sizeof(cBuffer), &amp;cBuffer, NULL);
     printf("CL_DEVICE_NAME: %s\n", cBuffer);
     clGetDeviceInfo(cdDevice, CL_DRIVER_VERSION, sizeof(cBuffer), &amp;cBuffer, NULL);
     printf("CL_DRIVER_VERSION: %s\n\n", cBuffer);
     // Create a context to run OpenCL enabled GPU
     cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);     
     // Create a command-queue on the GPU device
     cl_command_queue cqCommandQueue = clCreateCommandQueue(GPUContext, cdDevice, CL_QUEUE_PROFILING_ENABLE, NULL);
     // Allocate GPU memory for source vectors AND initialize from CPU memory
     cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
     CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);
     cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
     CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);
     // Allocate output memory on GPU
     cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,
     sizeof(int) * SIZE, NULL, NULL);
     // Create OpenCL program with source code
     cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource, NULL, NULL);
     // Build the program (OpenCL JIT compilation)
     clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);
     // Create a handle to the compiled OpenCL function (Kernel)
     cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);
     // In the next step we associate the GPU memory with the Kernel arguments
     clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem), (void*)&amp;GPUOutputVector);
     clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&amp;GPUVector1);
     clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&amp;GPUVector2);
 
     //create event
     cl_event event = clCreateUserEvent(GPUContext, NULL);
 
     // Launch the Kernel on the GPU
     // This kernel only uses global data
     size_t WorkSize[1] = {SIZE}; // one dimensional Range
     clEnqueueNDRangeKernel(cqCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, &amp;event);
     // Copy the output in GPU memory back to CPU memory
     clEnqueueReadBuffer(cqCommandQueue, GPUOutputVector, CL_TRUE, 0,
     SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);
     // Cleanup
     clReleaseKernel(OpenCLVectorAdd);
     clReleaseProgram(OpenCLProgram);
     clReleaseCommandQueue(cqCommandQueue);
     clReleaseContext(GPUContext);
     clReleaseMemObject(GPUVector1);
     clReleaseMemObject(GPUVector2);
     clReleaseMemObject(GPUOutputVector);    
 
     clWaitForEvents(1, &amp;event);
     cl_ulong start = 0, end = 0;
     double total_time;     
 
     clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &amp;start, NULL);
     clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &amp;end, NULL);
 
     total_time = end - start;     
 
     printf("\nExecution time in milliseconds = %0.3f ms", (total_time / 1000000.0) );
     printf("\nExecution time in seconds = %0.3f s\n\n", ((total_time / 1000000.0))/1000 );          
 
     return 0;
}

The CPU code is a simple C program which computes the same sum of arrays from above. To calculate the time consumed, we used the library time.h. The code is as follows:

#include 
#include 
#include  
 
int InitialData1[80] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17,37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[80] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15,35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
 
#define SIZE 600000
 
int main(int argc, char **argv)
{
time_t start, stop;
clock_t ticks;
 
time(&amp;start);    
// Two integer source vectors in Host memory
int HostVector1[SIZE], HostVector2[SIZE];
//Output Vector
int HostOutputVector[SIZE];
// Initialize with some interesting repeating data
//int n;
for(int c = 0; c < SIZE; c++)
{
HostVector1[c] = InitialData1[c%20];
HostVector2[c] = InitialData2[c%20];
HostOutputVector[c] = 0;
}
 
for(int i = 0; i < SIZE; i++)
{
        HostOutputVector[i] = HostVector1[i] + HostVector2[i];
        ticks = clock();
}     
 
time(&amp;stop);
 
printf("\nExecution time in miliseconds = %0.3f ms",((double)ticks/CLOCKS_PER_SEC)*1000);
 
printf("\nExecution time in seconds = %0.3f s\n\n", (double)ticks/CLOCKS_PER_SEC);
 
return 0;
}

Cross compiling the applications

One Makefile can be used to cross compile both GPU and CPU applications. The following three variables need your attention. You can change them according to your system:

ROOTFS_DIR -> the sysroots path of Apalis iMX6
APPNAME -> the name of you application
TOOLCHAIN -> the the cross compiler toolchain path

export ARCH=arm
export ROOTFS_DIR=/usr/local/toradex-linux-v2.5/oe-core/build/out-glibc/sysroots/apalis-imx6
 
APPNAME = proc_sample
TOOLCHAIN = /home/prjs/toolchain/gcc-linaro
 
CROSS_COMPILER = $(TOOLCHAIN)/bin/arm-linux-gnueabihf-
CC= $(CROSS_COMPILER)gcc
DEL_FILE = rm -rf
CP_FILE = cp -rf
TARGET_PATH_LIB = $(ROOTFS_DIR)/usr/lib
TARGET_PATH_INCLUDE = $(ROOTFS_DIR)/usr/include
CFLAGS = -DLINUX -DUSE_SOC_MX6 -Wall -std=c99 -O2 -fsigned-char -march=armv7-a -mfpu=neon -DEGL_API_FB -DGPU_TYPE_VIV -DGL_GLEXT_PROTOTYPES -DENABLE_GPU_RENDER_20 -I../include -I$(TARGET_PATH_INCLUDE)
LFLAGS = -Wl,--library-path=$(TARGET_PATH_LIB),-rpath-link=$(TARGET_PATH_LIB) -lm -lglib-2.0 -lOpenCL -lCLC -ldl -lpthread
OBJECTS = $(APPNAME).o
first: all
all: $(APPNAME)
$(APPNAME): $(OBJECTS)
$(CC) $(LFLAGS) -o $(APPNAME) $(OBJECTS)
$(APPNAME).o: $(APPNAME).c
$(CC) $(CFLAGS) -c -o $(APPNAME).o $(APPNAME).c
clean:
$(DEL_FILE) $(APPNAME)

Save the Makefile in the same folder of your applications and run make.
Copy the generated binaries to Apalis iMX6.

End results

After running both applications we got the following results:

### Processor time
Execution time in miliseconds = 778.999 ms
Execution time in seconds = 0.779 s 
 
### GPU time 
Execution time in milliseconds = 12.324 ms
Execution time in seconds = 0.012 s

Based on these results, we can clearly see that we were able to speed up the array sum by using OpenCL with Apalis iMX6Q GPU's processing capabilities.

Conclusion

Customers willing to take advantage of Apalis iMX6Q GPU can, among other methods, use OpenCL to increase computing power. With OpenCL capabilities, it is possible to run code in devices ranging from graphics cards to supercomputers as well as embedded devices, as seen in this post. Customers could also go further combining, for example, OpenCL with OpenCV to increase performance in computer vision applications. This can be used as an example of endless possibilities of applications that a company can develop.

References

https://www.khronos.org/opencl/
https://en.wikipedia.org/wiki/OpenCL
http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854
https://community.freescale.com/docs/DOC-93984
https://community.freescale.com/docs/DOC-100694
http://developer.toradex.com/products/apalis-imx6
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
http://parallelis.com/how-to-measure-opencl-kernel-execution-time/
https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro

This blog post was originally featured on Embarcados.com in Portuguese. See here.

Author: Giovanni Bauermeister, Toradex Brasil

Search This Blog

Embedded Press

Experimenting with OpenCL on Apalis iMX6Q

Comments

Post a Comment

Popular posts from this blog

Patching Kernel in OpenEmbedded

Toradex WinCE Libraries' Evolution: The New Library Approach Explained

Android 7.1 on Apalis TK1