DPC++
Open Source Your Knowledge, Become a Contributor
Technology knowledge has to be shared and made accessible for free. Join the movement.
Hello DPC++
Data Parallel C++ or DPC++ is an open
, standards-based
and cross-architecture
language for heterogeneous computing.
DPC++
= C++
+ SYCL
+ community extensions
The extensions aim to simplify programming and enhance performance by enabling programming to hardware specific features
DPC++ Demo
The DPC++ example shown below will allocate memory for an array which is initialized to some values, next the task of doubling the array values is offloaded to a device and then finally the result is printed out on the host.
Input array is inialized to: 0 1 2 3 4 5 6 7
Expected output after computation on device: 0 2 4 6 8 10 12 14
The sections below shows how to write a simple Data Parallel C++ (DPC++) program.
Include Header File
DPC++ program will have to include the SYCL header file and optionally specify namespace
#include <CL/sycl.hpp>
using namespace sycl;
Setup a SYCL queue and target device
To execute a task on device we have to setup a SYCL queue:
queue q;
The code above will select the default device which will pick the best device available on the system. You can also pick a specific device to offload computation, for example GPU:
queue q(gpu_selector{});
You can target computation to any desired device, you can use cpu_selector
, gpu_selector
, accelerator_selector
or intel::fpga_selector
depending on your application.
Setup Memory that can be accessed by target device
DPC++ introduces Unified Shared Memory (USM), which will allow you to allocate memory that can be modified by both host and device, and also supports implicit or explicit way of moving memory between host and device.
int *data = malloc_shared<int>(N, q);
malloc_shared
will allocate memory that can be accessed on host and device, and implicitly moves the memory between host and device.
You can also use concept of SYCL Buffers & Accessors to manage memory on target device, this is explained in one of the other lessons, and more details about Unified Shared Memory is also in future lesson.
Offload compute task to target device
q.parallel_for
will submit the task to the device associated with the queue, which will compute and modify the memory on device. The function is performed on N
items. This function that executes on the device for each item is also to as kernel
function or lambda
function. We will learn more about how kernel function works in later lessons.
q.parallel_for(range<1>(N), [=] (id<1> i){
data[i] *= 2;
}).wait();
.wait()
will ensure that the computation is complete before proceeding further to print the output on host.
Since the memory is allocated using malloc_shared
, the memory is implicitly moved to device before computation on device and is copied back to host after computation is complete. The Unified Shared Memory lesson will explore more about Unified Shared Memory and operation.
Compiling DPC++ code
To compile a DPC++ program, initialize environment variables and then use dpcpp
to compile as shown below:
source /opt/intel/inteloneapi/setvars.sh
dpcpp hello.cpp
Summary
From the example you can notice that DPC++ uses familiar C++ constructs and SYCL classes to offload computation to heterogeneous devices, and uses new DPC++ extension "Unified Shared Memory" to simplify memory management on heterogeneous devices.