Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Hello DPC++

Data Parallel C++ or DPC++ is an open, standards-based and cross-architecture language for heterogeneous computing.

DPC++ = C++ + SYCL + community extensions

The extensions aim to simplify programming and enhance performance by enabling programming to hardware specific features

DPC++ Demo

The DPC++ example shown below will allocate memory for an array which is initialized to some values, next the task of doubling the array values is offloaded to a device and then finally the result is printed out on the host.

Input array is inialized to: 0 1 2 3 4 5 6 7

Expected output after computation on device: 0 2 4 6 8 10 12 14


The sections below shows how to write a simple Data Parallel C++ (DPC++) program.

Include Header File

DPC++ program will have to include the SYCL header file and optionally specify namespace

#include <CL/sycl.hpp>

using namespace sycl;

Setup a SYCL queue and target device

To execute a task on device we have to setup a SYCL queue:

queue q;

The code above will select the default device which will pick the best device available on the system. You can also pick a specific device to offload computation, for example GPU:

queue q(gpu_selector{});

You can target computation to any desired device, you can use cpu_selector, gpu_selector, accelerator_selector or intel::fpga_selector depending on your application.

Setup Memory that can be accessed by target device

DPC++ introduces Unified Shared Memory (USM), which will allow you to allocate memory that can be modified by both host and device, and also supports implicit or explicit way of moving memory between host and device.

int *data = malloc_shared<int>(N, q);

malloc_shared will allocate memory that can be accessed on host and device, and implicitly moves the memory between host and device.

You can also use concept of SYCL Buffers & Accessors to manage memory on target device, this is explained in one of the other lessons, and more details about Unified Shared Memory is also in future lesson.

Offload compute task to target device

q.parallel_for will submit the task to the device associated with the queue, which will compute and modify the memory on device. The function is performed on N items. This function that executes on the device for each item is also to as kernel function or lambda function. We will learn more about how kernel function works in later lessons.

  q.parallel_for(range<1>(N), [=] (id<1> i){
    data[i] *= 2;

.wait() will ensure that the computation is complete before proceeding further to print the output on host.

Since the memory is allocated using malloc_shared, the memory is implicitly moved to device before computation on device and is copied back to host after computation is complete. The Unified Shared Memory lesson will explore more about Unified Shared Memory and operation.

Compiling DPC++ code

To compile a DPC++ program, initialize environment variables and then use dpcpp to compile as shown below:

source /opt/intel/inteloneapi/setvars.sh

dpcpp hello.cpp
Check DPC++ Compiler Version


From the example you can notice that DPC++ uses familiar C++ constructs and SYCL classes to offload computation to heterogeneous devices, and uses new DPC++ extension "Unified Shared Memory" to simplify memory management on heterogeneous devices.


Data Parallel C++ Reference

Data Parallel C++ Specification

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content