OpenMP remaining topics

28.2 : Timing
28.4 : Performance and tuning
28.5 : Accelerators

28 OpenMP remaining topics

28.1 Runtime functions and internal control variables

crumb trail: > openmp > Runtime functions and internal control variables

OpenMP has a number of settings that can be set through environment variables , and both queried and set through library routines . These settings are called \emph{ \acfp{ICV} }: an OpenMP implementation behaves as if there is an internal variable storing this setting.

The runtime functions are:

• omp_set_dynamic
• omp_get_dynamic
• omp_set_nested
• omp_get_nested
• omp_get_wtime
• omp_get_wtick
• omp_set_schedule
• omp_get_schedule
• omp_set_max_active_levels
• omp_get_max_active_levels
• omp_get_level
• omp_get_active_level
• omp_get_team_size

Here are the OpenMP environment variables :

• OMP_CANCELLATION Set whether cancellation is activated
• OMP_DISPLAY_ENV Show OpenMP version and environment variables
• OMP_DEFAULT_DEVICE Set the device used in target regions
• OMP_MAX_ACTIVE_LEVELS Set the maximum number of nested parallel regions; section~ 17.1 .
• OMP_NESTED Nested parallel regions
• OMP_PROC_BIND Whether theads may be moved between CPUs; section~ 24.1 .
• OMP_PLACES Specifies on which CPUs the theads should be placed; section~ 24.1 .
• OMP_STACKSIZE Set default thread stack size
• OMP_SCHEDULE How threads are scheduled
• OMP_WAIT_POLICY How waiting threads are handled; ICV wait-policy-var . Values: ACTIVE for keeping threads spinning, PASSIVE for possibly yielding the processor when threads are waiting.

There are 4 ICVs that behave as if each thread has its own copy of them. The default is implementation-defined unless otherwise noted.

• It may be possible to adjust dynamically the number of threads for a parallel region. Variable: OMP_DYNAMIC ; routines: omp_set_dynamic , omp_get_dynamic .
• If a code contains nested parallel regions , the inner regions may create new teams, or they may be executed by the single thread that encounters them. Variable: OMP_NESTED ; routines omp_set_nested , omp_get_nested . Allowed values are TRUE and FALSE ; the default is false.
• The schedule for a parallel loop can be set. Variable: OMP_SCHEDULE ; routines omp_set_schedule , omp_get_schedule .

Nonobvious syntax:

export OMP_SCHEDULE="static,100"


Other settings:

• omp_get_num_threads : query the number of threads active at the current place in the code; this can be lower than what was set with omp_set_num_threads . For a meaningful answer, this should be done in a parallel region.
• omp_in_parallel : test if you are in a parallel region (see for instance section  16.5 ).
• omp_get_num_procs : query the physical number of cores available.

Other environment variables:

• OMP_STACKSIZE controls the amount of space that is allocated as per-thread stack ; the space for private variables.
• OMP_WAIT_POLICY determines the behaviour of threads that wait, for instance for critical section :

• ACTIVE puts the thread in a spin-lock , where it actively checks whether it can continue;
• PASSIVE puts the thread to sleep until the OS wakes it up.

The active' strategy uses CPU while the thread is waiting; on the other hand, activating it after the wait is instantaneous. With the passive' strategy, the thread does not use any CPU while waiting, but activating it again is expensive. Thus, the passive strategy only makes sense if threads will be waiting for a (relatively) long time.

• OMP_PROC_BIND with values TRUE and FALSE can bind threads to a processor. On the one hand, doing so can minimize data movement; on the other hand, it may increase load imbalance.

28.2 Timing

crumb trail: > openmp > Timing

OpenMP has a wall clock timer routine omp_get_wtime

double omp_get_wtime(void);


The starting point is arbitrary and is different for each program run; however, in one run it is identical for all threads. This timer has a resolution given by omp_get_wtick .

Exercise

Use the timing routines to demonstrate speedup from using multiple threads.

• Write a code segment that takes a measurable amount of time, that is, it should take a multiple of the tick time.
• Write a parallel loop and measure the speedup. You can for instance do this

for (int use_threads=1; use_threads<=nthreads; use_threads++) {
for (int i=0; i<nthreads; i++) {
.....
}
time1 = tend-tstart;
else // compute speedup

• In order to prevent the compiler from optimizing your loop away, let the body compute a result and use a reduction to preserve these results.

crumb trail: > openmp > Thread safety

With OpenMP it is relatively easy to take existing code and make it parallel by introducing parallel sections. If you're careful to declare the appropriate variables shared and private, this may work fine. However, your code may include calls to library routines that include a race condition ; such code is said not to be thread-safe .

For example a routine

static int isave;
int next_one() {
int i = isave;
isave += 1;
return i;
}

...
for ( .... ) {
int ivalue = next_one();
}


has a clear race condition, as the iterations of the loop may get different next_one values, as they are supposed to, or not. This can be solved by using an critical pragma for the next_one call; another solution is to use an threadprivate declaration for isave . This is for instance the right solution if the next_one routine implements a random number generator .

28.4 Performance and tuning

crumb trail: > openmp > Performance and tuning

[epcc-ompbench]

.

The performance of an OpenMP code can be influenced by the following.

• [Amdahl effects] Your code needs to have enough parts that are parallel (see  Eijkhout:IntroHPC ). Sequential parts may be sped up by having them executed redundantly on each thread, since that keeps data locally.
• [Dynamism] Creating a thread team takes time. In practice, a team is not created and deleted for each parallel region, but creating teams of different sizes, or recursize thread creation, may introduce overhead.
• [Load imbalance] Even if your program is parallel, you need to worry about load balance. In the case of a parallel loop you can set the dynamic , which evens out the work, but may cause increased communication.
• [Communication] Cache coherence causes communication. Threads should, as much as possible, refer to their own data.

• Threads are likely to read from each other's data. That is largely unavoidable.
• Threads writing to each other's data should be avoided: it may require synchronization, and it causes coherence traffic.
• If threads can migrate, data that was local at one time is no longer local after migration.
• Reading data from one socket that was allocated on another socket is inefficient; see section  24.2 .
• [Affinity] Both data and execution threads can be bound to a specific locale to some extent. Using local data is more efficient than remote data, so you want to use local data, and minimize the extent to which data or execution can move.

• See the above points about phenomena that cause communication.
• Section  24.1.1 describes how you can specify the binding of threads to places. There can, but does not need, to be an effect on affinity. For instance, if an OpenMP thread can migrate between hardware threads, cached data will stay local. Leaving an OpenMP thread completely free to migrate can be advantageous for load balancing, but you should only do that if data affinity is of lesser importance.
• Static loop schedules have a higher chance of using data that has affinity with the place of execution, but they are worse for load balancing. On the other hand, the can aleviate some of the problems with static loop schedules.
• [Binding] You can choose to put OpenMP threads close together or to spread them apart. Having them close together makes sense if they use lots of shared data. Spreading them apart may increase bandwidth. (See the examples in section  24.1.2 .)
• [Synchronization] Barriers are a form of synchronization. They are expensive by themselves, and they expose load imbalance. Implicit barriers happen at the end of worksharing constructs; they can be removed with nowait .

Critical sections imply a loss of parallelism, but they are also slow as they are realized through operating system functions. These are often quite costly, taking many thousands of cycles. Critical sections should be used only if the parallel work far outweighs it.

28.5 Accelerators

crumb trail: > openmp > Accelerators

In OpenMP 4.0 there is support for offloading work to an accelerator or co-processor :

#pragma omp target [clauses]


with clauses such as

• data : place data
• update : make data consistent between host and device