Sycl, OneAPI, DPC++

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 41.1 : Logistics
41.2 : Platforms and devices
41.3 : Queues
41.3.1 : Device selectors
41.3.2 : Queue execution
41.3.3 : Kernel ordering
41.4 : Kernels
41.5 : Parallel operations
41.5.1 : Loops
41.5.1.1 : Loop bounds: ranges
41.5.1.2 : Loop indices
41.5.1.3 : Multi-dimensional indexing
41.5.2 : Task dependencies
41.5.3 : Race conditions
41.5.4 : Reductions
41.6 : Memory access
41.6.1 : Unified shared memory
41.6.2 : Buffers and accessors
41.6.3 : Querying
41.7 : Parallel output
41.8 : DPCPP extensions
41.9 : Intel devcloud notes
41.10 : Examples
41.10.1 : Kernels in a loop
41.10.2 : Stencil computations
Back to Table of Contents

41 Sycl, OneAPI, DPC++

This chapter explains the basic concepts of Sycl/Dpc++, and helps you get started on running your first program.

  • SYCL is a C++-based language for portable parallel programming.
  • DPCPP is Intel's extension of Sycl.
  • OneAPI is Intel's compiler suite, which contains the DPCPP compiler.

\begin{dpcppnote} The various Intel extensions are listed here: https://spec.oneapi.com/versions/latest/elements/dpcpp/source/index.html#extensions-table \end{dpcppnote}

41.1 Logistics

crumb trail: > dpcpp > Logistics

Headers:

#include <CL/sycl.hpp>

You can now include namespace, but with care! If you use

using namespace cl;
you have to prefix all SYCL class with sycl:: ,

which is a bit of a bother. However, if you use

using namespace cl::sycl;

you run into the fact that SYCL has its own versions of many STL commands, and so you will get name collisions. The most obvious example is that the cl::sycl name space has its own versions of \n{cout} and \n{endl}. Therefore you have to use explicitly \lstinline+std::cout+ and std::end . Using the wrong I/O will cause tons of inscrutable error messages. Additionally, SYCL has its own version of free , and of several math routines.

\begin{dpcppnote}

    using namespace sycl;

\end{dpcppnote}

41.2 Platforms and devices

crumb trail: > dpcpp > Platforms and devices

Since DPCPP is cross-platform, we first need to discovers the devices.

First we list the platforms:

// devices.cxx
std::vector<sycl::platform> platforms = sycl::platform::get_platforms();
for (const auto &plat : platforms) {
// get_info is a template. So we pass the type as an `arguments`.
  std::cout << "Platform: "
            << plat.get_info<sycl::info::platform::name>() << " "
            << plat.get_info<sycl::info::platform::vendor>() << " "
            << plat.get_info<sycl::info::platform::version>() << std::endl;

Then for each platform we list the devices:

std::vector<sycl::device> devices = plat.get_devices();
for (const auto &dev : devices) {
  std::cout << "-- Device: "
            << dev.get_info<sycl::info::device::name>()
            << (dev.is_host() ? ": is the host" : "")
            << (dev.is_cpu() ? ": is a cpu" : "")
            << (dev.is_gpu() ? ": is a gpu" : "")
            << std::endl;

You can query what type of device you are dealing with by

41.3 Queues

crumb trail: > dpcpp > Queues

The execution mechanism of SYCL is the queue : a sequence of actions that will be executed on a selected device. The only user action is submitting actions to a queue; the queue is executed at the end of the scope where it is declared.

Queue execution is asynchronous with host code.

41.3.1 Device selectors

crumb trail: > dpcpp > Queues > Device selectors

You need to select a device on which to execute the queue. A single queue can only dispatch to a single device.

A queue is coupled to one specific device, so it can not spread work over multiple devices. You can find a default device for the queue with

  sycl::queue myqueue;

The following example explicitly assigns the queue to the CPU device using the sycl::

// cpuname.cxx
sycl::queue myqueue( sycl::cpu_selector{} );

The sycl::

make the code run on the host.

It is good for your sanity to print the name of the device you are running on:

// devname.cxx
std::cout << myqueue.get_device().get_info<sycl::info::device::name>()
          << std::endl;

If you try to select a device that is not available, a sycl::

\begin{dpcppnote}

    #include "CL/sycl/intel/fpga_extensions.hpp"
    fpga_selector

\end{dpcppnote}

41.3.2 Queue execution

crumb trail: > dpcpp > Queues > Queue execution

It seems that queue kernels will also be executed when only they go out of scope, but not the queue:

cpu_selector selector;
queue q(selector);
{
  q.submit( /* some kernel */ );
} // here the kernel executes

41.3.3 Kernel ordering

crumb trail: > dpcpp > Queues > Kernel ordering

Kernels are not necessarily executed in the order in which they are submitted. You can enforce this by specifying an in-order queue :

sycl::queue myqueue{property::queue::inorder()};

41.4 Kernels

crumb trail: > dpcpp > Kernels

One kernel per submit.

myqueue.submit( [&] ( handler &commandgroup ) {
    commandgroup.parallel_for<uniquename>
      ( range<1>{N},
        [=] ( id<1> idx ) { ... idx }
      )
    } );

Note that the lambda in the kernel captures by value. Capturing by reference makes no sense, since the kernel is executed on a device.

cgh.single_task(
  [=]() {
    // kernel function is executed EXACTLY once on a SINGLE work-item
});

The

auto myevent = myqueue.submit( /* stuff */ );

This can be used for two purposes:

  1. It becomes possible to wait for this specific event:

    myevent.wait();
    
  2. It can be used to indicate kernel dependencies:

    myqueue.submit( [=] (handler &h) {
        h.depends_on(myevent);
        /* stuff */
        } );
    

41.5 Parallel operations

crumb trail: > dpcpp > Parallel operations

41.5.1 Loops

crumb trail: > dpcpp > Parallel operations > Loops

cgh.parallel_for(
  range<3>(1024,1024,1024),
  // using 3D in this example
  [=](id<3> myID) {
    // kernel function is executed on an n-dimensional range (NDrange)
});


cgh.parallel_for(
  nd_range<3>( {1024,1024,1024},{16,16,16} ),
  // using 3D in this example
  [=](nd_item<3> myID) {
    // kernel function is executed on an n-dimensional range (NDrange)
});


cgh.parallel_for_work_group(
  range<2>(1024,1024),
  // using 2D in this example
  [=](group<2> myGroup) {
    // kernel function is executed once per work-group
});


grp.parallel_for_work_item(
  range<1>(1024),
  // using 1D in this example
  [=](h_item<1> myItem) {
    // kernel function is executed once per work-item
});

41.5.1.1 Loop bounds: ranges

crumb trail: > dpcpp > Parallel operations > Loops > Loop bounds: ranges

SYCL adopts the modern C++ philosophy that one does not iterate over by explicitly enumerating indices, but by indicating their range. This is realized by the which is templated over the number of space dimensions.

sycl::range<2> matrix{10,10};

Some compilers are sensitive to the type of the integer arguments:

sycl::range<1> array{ static_cast<size_t>(size)} ;

41.5.1.2 Loop indices

crumb trail: > dpcpp > Parallel operations > Loops > Loop indices

Kernels such as expects two arguments:

  • a
  • a lambda of one argument: an index.

There are several ways of indexing. The

myHandle.parallel_for<class uniqueID>
   ( mySize,
     [=]( id<1> index ) {
       float x = index.get(0) * h;
       deviceAccessorA[index] *= 2.;
     }
   )

cgh.parallel_for<class foo>(
    range<1>{D*D*D},
    [=](id<1> item) {
        xx[ item[0] ] = 2 * item[0] + 1;
    }
)

While the C++ vectors remain one-dimensional, DPCPP allows you to make multi-dimensional buffers:

std::vector<int> y(D*D*D);
buffer<int,1> y_buf(y.data(), range<1>(D*D*D));
cgh.parallel_for<class foo2D>
   (range<2>{D,D*D},
    [=](id<2> item) {
        yy[ item[0] + D*item[1] ] = 2;
    }
   );

\begin{dpcppnote} There is an implicit conversion from the one-dimensional sycl:: to size_t , so

[=](sycl::id<1> i) {
   data[i] = i;
}

is legal, which in SYCL requires

data[i[0]] = i[0];

\end{dpcppnote}

41.5.1.3 Multi-dimensional indexing

crumb trail: > dpcpp > Parallel operations > Loops > Multi-dimensional indexing

// stencil2d.cxx
sycl::range<2> stencil_range(N, M);
sycl::range<2> alloc_range(N + 2, M + 2);
std::vector<float>
  input(alloc_range.size()),
  output(alloc_range.size());
  sycl::buffer<float, 2> input_buf(input.data(), alloc_range);
  sycl::buffer<float, 2> output_buf(output.data(), alloc_range);

constexpr size_t B = 4;
sycl::range<2> local_range(B, B);
sycl::range<2> tile_range = local_range + sycl::range<2>(2, 2); // Includes boundary cells
auto tile = local_accessor<float, 2>(tile_range, h); // see templated def'n above

We first copy global data into an array local to the work group:

sycl::id<2> offset(1, 1);
h.parallel_for
  ( sycl::nd_range<2>(stencil_range, local_range, offset),
    [=] ( sycl::nd_item<2> it ) {
// Load this tile into work-group local memory
    sycl::id<2>    lid    = it.get_local_id();
    sycl::range<2> lrange = it.get_local_range();
    for   (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
      for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
        int gi = ti + B * it.get_group(0);
        int gj = tj + B * it.get_group(1);
        tile[ti][tj] = input[gi][gj];
      }
    }

Global coordinates in the input are computed from the

[=] ( sycl::nd_item<2> it ) {
for   (int ti ... ) {
  for (int tj ... ) {
    int gi = ti + B * it.get_group(0);
    int gj = tj + B * it.get_group(1);
    ... = input[gi][gj];

Local coordinates in the tile, including boundary, I DON'T QUITE GET THIS YET.

[=] ( sycl::nd_item<2> it ) {
sycl::id<2>    lid    = it.get_local_id();
sycl::range<2> lrange = it.get_local_range();
for   (int ti = lid[0]; ti < B + 2; ti += lrange[0]) {
  for (int tj = lid[1]; tj < B + 2; tj += lrange[1]) {
    tile[ti][tj] = ..

41.5.2 Task dependencies

crumb trail: > dpcpp > Parallel operations > Task dependencies

Each submit Since it returns a token, it becomes possible to specify task dependencies by refering to a token as a dependency in a later specified task.

queue myQueue;
auto myTokA = myQueue.submit
   ( [&](handler& h) {
       h.parallel_for<class taskA>(...);
     }
   );
auto myTokB = myQueue.submit
   ( [&](handler& h) {
       h.depends_on(myTokA);
       h.parallel_for<class taskB>(...);
     }
   );

41.5.3 Race conditions

crumb trail: > dpcpp > Parallel operations > Race conditions

Sycl has the same problems with race conditions that other shared memory system have:

// sum1d.cxx
auto array_accessor =
  array_buffer.get_access<sycl::access::mode::read>(h);
auto scalar_accessor =
  scalar_buffer.get_access<sycl::access::mode::read_write>(h);
h.parallel_for<class uniqueID>
  ( array_range,
    [=](sycl::id<1> index)
    {
      scalar_accessor[0] += array_accessor[index];
    }
    ); // end of parallel for

To get this working correctly would need either a reduction primitive or atomics on the accumulator. The 2020 proposed standard has improved atomics.

// reduct1d.cxx
	auto input_values = array_buffer.get_access<sycl::access::mode::read>(h);
	auto sum_reduction = sycl::reduction( scalar_buffer,h,std::plus<>() );
        h.parallel_for
          ( array_range,sum_reduction,
            [=]( sycl::id<1> index,auto& sum )
            {
	      sum += input_values[index];
            }
            ); // end of parallel for

41.5.4 Reductions

crumb trail: > dpcpp > Parallel operations > Reductions

Reduction operations were added in the the SYCL 2020 Provisional Standard, meaning that they are not yet finalized.

Here is one approach, which works in hipsycl :

// reductscalar.cxx
auto reduce_to_sum =
  sycl::reduction( sum_array, static_cast<float>(0.), std::plus<float>() );
myqueue.parallel_for// parallel_for<reduction_kernel<T,BinaryOp,__LINE__>>
  ( array_range,    // sycl::range<1>(input_size),
    reduce_to_sum,  // sycl::reduction(output, identity, op),
    [=] (sycl::id<1> idx, auto& reducer) { // type of reducer is impl-dependent, so use auto
    reducer.combine(shared_array[idx[0]]); //(input[idx[0]]);
//reducer += shared_array[idx[0]]; // see line 216: add_reducer += input0[idx[0]];
  } ).wait();
Here a sycl:: from the target data and the reduction operator. This is then passed to the and its

41.6 Memory access

crumb trail: > dpcpp > Memory access

Memory treatment in SYCL is a little complicated, because is (at the very least) host memory and device memory, which are not necessarily coherent.

There are also three mechanisms:

  • Unified Shared Memory, based on ordinary C/C++ `star'-pointers.
  • Buffers, using the this needs the
  • Images.

TABLE: Memory types and treatments

\toprule Locationallocationcoherencecopy to/from device
\midrule Host malloc explicit transfer Device Shared \bottomrule

41.6.1 Unified shared memory

crumb trail: > dpcpp > Memory access > Unified shared memory

Memory allocated with is visible on the host:

// outshared.cxx
floattype
  *host_float = (floattype*)malloc_host( sizeof(floattype),ctx ),
  *shar_float = (floattype*)malloc_shared( sizeof(floattype),dev,ctx );
    cgh.single_task
	( [=] () {
	  shar_float[0] = 2 * host_float[0];
	  sout << "Device sets " << shar_float[0] << sycl::endl;
	} );

Device memory is allocated with passing the queue as parameter:

// reductimpl.cxx
floattype
  *host_float = (floattype*)malloc( sizeof(floattype) ),
  *devc_float = (floattype*)malloc_device( sizeof(floattype),dev,ctx );
   [&](sycl::handler &cgh) {
     cgh.memcpy(devc_float,host_float,sizeof(floattype));
   }
Note the corresponding that also has the queue as parameter.

Note that you need to be in a parallel task. The following gives a segmentation error:

  [&](sycl::handler &cgh) {
    shar_float[0] = host_float[0];
  }

Ordinary memory, for instance from has to be copied in a kernel:

   [&](sycl::handler &cgh) {
     cgh.memcpy(devc_float,host_float,sizeof(floattype));
   }
   [&](sycl::handler &cgh) {
     sycl::stream sout(1024, 256, cgh);
     cgh.single_task
	 (
	  [=] () {
	    sout << "Number " << devc_float[0] << sycl::endl;
	  }
	  );
   } // end of submitted lambda
free(host_float);
sycl::free(devc_float,myqueue);

41.6.2 Buffers and accessors

crumb trail: > dpcpp > Memory access > Buffers and accessors

Arrays need to be declared in a way such that they can be access from any device.

// forloop.cxx
std::vector<int> myArray(SIZE);
  range<1> mySize{myArray.size()};
  buffer<int, 1> bufferA(myArray.data(), myArray.size());

Inside the kernel, the array is then unpacked from the buffer:

myqueue.submit( [&] (handler &h) {
	auto deviceAccessorA =
	  bufferA.get_access<access::mode::read_write>(h);

However, the in a sycl:: The precise type is templated and complicated, so this is a good place to use auto .

Accessors can have a mode associated: sycl::access::mode:: sycl::access::mode::

\begin{dpcppnote}

    array<floattype,1> leftsum{0.};
#ifdef __INTEL_CLANG_COMPILER
    sycl::buffer leftbuf(leftsum);
#else
    sycl::range<1> scalar{1};
    sycl::buffer<floattype,1> leftbuf(leftsum.data(),scalar);

\end{dpcppnote}

\begin{dpcppnote} there are modes

// standard
sycl::accessor acc = buffer.get_access<sycl::access::mode:write>(h);
// dpcpp extension
sycl::accessor acc( buffer,h,sycl::read_only );
sycl::accessor acc( buffer,h,sycl::write_only );

\end{dpcppnote}

41.6.3 Querying

crumb trail: > dpcpp > Memory access > Querying

The function or an accessor: \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclbufrange} \cxxverbatimsnippet[code/dpcpp/cxx/range2.cxx]{syclaccrange}

41.7 Parallel output

crumb trail: > dpcpp > Parallel output

There is a sycl::

// hello.cxx
[&](sycl::handler &cgh) {
  sycl::stream sout(1024, 256, cgh);
  cgh.parallel_for<class hello_world>
	 (
	  sycl::range<1>(global_range), [=](sycl::id<1> idx) {
	    sout << "Hello, World: World rank " << idx << sycl::endl;
	  }); // End of the kernel function
}

Since the end of a queue does not flush stdout, it may be necessary to call sycl::queue::

myQueue.wait();

41.8 DPCPP extensions

crumb trail: > dpcpp > DPCPP extensions

Intel has made some extensions to SYCL:

  • Unified Shared Memory,
  • Ordered queues.

41.9 Intel devcloud notes

crumb trail: > dpcpp > Intel devcloud notes

qsub -I for interactive session.

gdb-oneapi for debugging.

https://community.intel.com/t5/Intel-oneAPI-Toolkits/ct-p/oneapi for support.

41.10 Examples

crumb trail: > dpcpp > Examples

41.10.1 Kernels in a loop

crumb trail: > dpcpp > Examples > Kernels in a loop

The following idiom works: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclkernelloop}

41.10.2 Stencil computations

crumb trail: > dpcpp > Examples > Stencil computations

The problem with stencil computations is that only interior points are updated. Translated to SYCL: we need to iterate over a subrange of the range over which the buffer is defined. First let us define these ranges: \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{syclrangebc} Note the boundary value $1.$ on the right boundary.

Restricting the iteration to the interior points is done through the \cxxverbatimsnippet[code/dpcpp/cxx/jacobi1d.cxx]{sycliteratebc}

Back to Table of Contents