OpenMP topic: Offloading

Experimental html version of downloadable textbook, see
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 28.1.1 : Targets and tasks
28.2 : Data on the device
28.3 : Execution on the device
Back to Table of Contents

28 OpenMP topic: Offloading

This chapter explains the mechanisms for offloading work to a GPU .

The memory of a processor and that of an attached GPU are not coherent : there are separate memory spaces and writing data in one is not automatically reflected in the other.

OpenMP transfers data (or maps it) when you enter an construct.

#pragma omp target
  // do stuff on the GPU

You can test whether the target region is indeed executed on a device with omp_is_initial_device :

#pragma omp target
  if (omp_is_initial_device()) printf("Offloading failed\n");

28.1.1 Targets and tasks

crumb trail: > omp-gpu

The target task . This is a task running on the host, dedicated to managing the offloaded region.

The by a new initial task . This is distinct from the initial task that executes the main program.

The task that created the target task is called the generating task .

By default, the generating task is blocked while the task on the device is running, but adding the This requires a taskwait directive to synchronize host and device.

28.2 Data on the device

crumb trail: > omp-gpu > Data on the device

  • Scalars are treated as they are copied in but not out.
  • Stack arrays
  • Heap arrays are not mapped by default.

For explicit mapping with

#pragma omp target map(...)
  // do stuff on the GPU

The following map options exist:

  • map(to: x,y,z) copy from host to device when entering the target region.
  • map(from: x,y,z) copy from devince to host when exiting the target region.
  • map(tofrom: x,y,z) is equivalent to combining the previous two.
  • map(allo: x,y,z) allocates data on the device.

Fortran note

If the compiler can deduce the array bounds and size, it is not necessary to specify them in the `map' clause.

Data transfer to a device is probably slow, so mapping the data at the start of an offloaded section of code is probably not the best idea. Additionally, in many cases data will stay resident on the device throughout several iterations of, for instance, a time-stepping PDE solver. For such reasons, it is possible to move data onto, and off from, the device explicitly, using the

#pragma omp target enter data map(to: x,y)
#pragma omp target
  // do something
#pragma omp target enter data map(from: x,y)

Also (synchronize data from host to device), (synchronize data to host from device).

28.3 Execution on the device

crumb trail: > omp-gpu > Execution on the device

For parallel execution of a loop on the device use the

#pragma omp target teams distribute parallel do

On GPU devices and the like, there is a structure to threads:

  • threads are grouped in and they can be synchronized only within these teams;
  • teams are groups in and no synchronization between leagues is possible inside a target region.

The combination teams distribute splits the iteration space over teams. By default a static schedule is used, but the option However, this combination only gives the chunk of space to the master thread in each team. Next we need parallel for or parallel do to spread the chunk over the threads in the team.

Back to Table of Contents