# OpenMP topic: Affinity

23.1.2 : Effects of thread binding
23.1.3 : Place definition
23.1.4 : Binding possibilities
23.2 : First-touch
23.3 : Affinity control outside OpenMP

# 23.1 OpenMP thread affinity control

Top > OpenMP thread affinity control

The matter of thread affinity becomes important on multi-socket nodes ; see the example in section  23.2 .

Thread placement can be controlled with two environment variables:

• the environment variable \indexompshowdef{OMP_PROC_BIND} describes how threads are bound to OpenMP places ; while

• the variable \indexompshowdef{OMP_PLACES} describes these places in terms of the available hardware.

• When you're experimenting with these variables it is a good idea to set \indexompshowdef{OMP_DISPLAY_ENV} to true, so that OpenMP will print out at runtime how it has interpreted your specification. The examples in the following sections will display this output.

The variable \indexompshowdef{OMP_PLACES} defines a series of places to which the threads are assigned.

Example: if you have two sockets and you define

OMP_PLACES=sockets

then

• thread 0 goes to socket 0,

• thread 1 goes to socket 1,

• thread 2 goes to socket 0 again,

• and so on.

On the other hand, if the two sockets have a total of sixteen cores and you define
OMP_PLACES=cores
OMP_PROC_BIND=close

then

• thread 0 goes to core 0, which is on socket 0,

• thread 1 goes to core 1, which is on socket 0,

• thread 2 goes to core 2, which is on socket 0,

• and so on, until thread 7 goes to core 7 on socket 0, and

• thread 8 goes to core 8, which is on socket 1,

• et cetera.

The value OMP_PROC_BIND=close means that the assignment goes successively through the available places. The variable OMP_PROC_BIND can also be set to spread , which spreads the threads over the places. With
OMP_PLACES=cores

you find that

• thread 0 goes to core 0, which is on socket 0,

• thread 1 goes to core 8, which is on socket 1,

• thread 2 goes to core 1, which is on socket 0,

• thread 3 goes to core 9, which is on socket 1,

• and so on, until thread 14 goes to core 7 on socket 0, and

• thread 15 goes to core 15, which is on socket 1.

So you see that OMP_PLACES=cores and OMP_PROC_BIND=spread very similar to OMP_PLACES=sockets . The difference is that the latter choice does not bind a thread to a specific core, so the operating system can move threads about, and it can put more than one thread on the same core, even if there is another core still unused.

The value OMP_PROC_BIND=master puts the threads in the same place as the master of the team. This is convenient if you create teams recursively. In that case you would use the \indexclause{proc\_bind} clause rather than the environment variable, set to spread for the initial team, and to master for the recursively created team.

## 23.1.2 Effects of thread binding

Let's consider two example program. First we consider the program for computing $\pi$, which is purely compute-bound.

1 0.359 0.354 0.353
2 0.177 0.177 0.177
4 0.088 0.088 0.088
6 0.059 0.059 0.059
8 0.044 0.044 0.044
12 0.029 0.045 0.029
16 0.022 0.050 0.022

{|r|rrr|}

We see pretty much perfect speedup for the OMP_PLACES=cores strategy; with OMP_PLACES=sockets we probably get occasional collisions where two threads wind up on the same core.

Next we take a program for computing the time evolution of the heat equation : $$t=0,1,2,…\colon \forall_i\colon x^{(t+1)}_i = 2x^{(t)}_i-x^{(t)}_{i-1}-x^{(t)}_{i+1}$$ This is a bandwidth-bound operation because the amount of computation per data item is low.

1 2.88 2.89 2.88
2 1.71 1.41 1.42
4 1.11 0.74 0.74
6 1.09 0.57 0.57
8 1.12 0.57 0.53
12 0.72 0.53 0.52
16 0.52 0.61 0.53

{|r|rrr|}

Again we see that OMP_PLACES=sockets gives worse performance for high core counts, probably because of threads winding up on the same core. The thing to observe in this example is that with 6 or 8 cores the OMP_PROC_BIND=spread strategy gives twice the performance of OMP_PROC_BIND=close .

The reason for this is that a single socket does not have enough bandwidth for all eight cores on the socket. Therefore, dividing the eight threads over two sockets gives each thread a higher available bandwidth than putting all threads on one socket.

## 23.1.3 Place definition

Top > OpenMP thread affinity control > Place definition

There are three predefined values for the \indexompshow{OMP_PLACES} variable: sockets, cores, threads . You have already seen the first two; the threads value becomes relevant on processors that have hardware threads. In that case, OMP_PLACES=cores does not tie a thread to a specific hardware thread, leading again to possible collisions as in the above example. Setting OMP_PLACES=threads ties each OpenMP thread to a specific hardware thread.

There is also a very general syntax for defining places that uses a

  location:number:stride

syntax. Examples:

• 
OMP_PLACES="{0:8:1},{8:8:1}"

is equivalent to sockets on a two-socket design with eight cores per socket: it defines two places, each having eight consecutive cores. The threads are then places alternating between the two places, but not further specified inside the place.

• The setting cores is equivalent to

OMP_PLACES="{0},{1},{2},...,{15}"


• On a four-socket design, the specification

OMP_PLACES="{0:4:8}:4:1"

states that the place 0,8,16,24 needs to be repeated four times, with a stride of one. In other words, thread 0 winds up on core 0 of some socket, the thread 1 winds up on core 1 of some socket, et cetera.

## 23.1.4 Binding possibilities

Top > OpenMP thread affinity control > Binding possibilities

Values for \indexompshow{OMP_PROC_BIND} are: false, true, master, close, spread .

• close: place threads close to the master in the places list

This effect can be made local by giving the \indexclause{proc\_bind} clause in the parallel directive.

A safe default setting is

export OMP_PROC_BIND=true

which prevents the operating system from migrating a thread . This prevents many scaling problems.

Good examples of thread placement on the Intel Knight's Landing : https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200

As an example, consider a code where two threads write to a shared location.


// sharing.c
#pragma omp parallel
{ // not a parallel for: just a bunch of reps
for (int j = 0; j < reps; j++) {
#pragma omp for schedule(static,1)
for (int i = 0; i < N; i++){
#pragma omp atomic
a++;
}

}
}


There is now a big difference in runtime depending on how close the threads are. We test this on a processor with both cores and hyperthreads. First we bind the OpenMP threads to the cores:

OMP_NUM_THREADS=2 OMP_PLACES=cores OMP_PROC_BIND=close ./sharing
run time = 4752.231836usec
sum = 80000000.0

Next we force the OpenMP threads to bind to hyperthreads inside one core:
OMP_PLACES=threads OMP_PROC_BIND=close ./sharing
run time = 941.970110usec
sum = 80000000.0

Of course in this example the inner loop is pretty much meaningless and parallelism does not speed up anything:
OMP_NUM_THREADS=1 OMP_PLACES=cores OMP_PROC_BIND=close ./sharing
run time = 806.669950usec
sum = 80000000.0

However, we see that the two-thread result is almost as fast, meaning that there is very little parallelization overhead.

# 23.2 First-touch

Top > First-touch

The affinity issue shows up in the first-touch phenomemon. Memory allocated with malloc and like routines is not actually allocated; that only happens when data is written to it. In light of this, consider the following OpenMP code:

double *x = (double*) malloc(N*sizeof(double));

for (i=0; i<N; i++)
x[i] = 0;

#pragma omp parallel for
for (i=0; i<N; i++)
.... something with x[i] ...

Since the initialization loop is not parallel it is executed by the master thread, making all the memory associated with the socket of that thread. Subsequent access by the other socket will then access data from memory not attached to that socket.

Exercise

\label{ex:first-touch} Finish the following fragment and run it with first all the cores of one socket, then all cores of both sockets. (If you know how to do explicit placement, you can also try fewer cores.) \begin{verbatim} for (int i=0; i<nlocal+2; i++) in[i] = 1.; for (int i=0; i<nlocal; i++) out[i] = 0.; for (int step=0; step<nsteps; step++) { #pragma omp parallel for schedule(static) for (int i=0; i<nlocal; i++) { out[i] = ( in[i]+in[i+1]+in[i+2] )/3.; } #pragma omp parallel for schedule(static) for (int i=0; i<nlocal; i++) in[i+1] = out[i]; in[0] = 0; in[nlocal+1] = 1; } \end{verbatim}

Exercise

How do the OpenMP dynamic schedules relate to this?

C++ valarray does initialization, so it will allocate memory on thread 0.

You could move pages with move_pages .

By regarding affinity, in effect you are adopting an SPMD style of programming. You could make this explicit by having each thread allocate its part of the arrays separately, and storing a private pointer as threadprivate   [Liu:2003:OMP-SPMD] . However, this makes it impossible for threads to access each other's parts of the distributed array, so this is only suitable for total data parallel or embarrassingly parallel applications.

# 23.3 Affinity control outside OpenMP

Top > Affinity control outside OpenMP

There are various utilities to control process and thread placement.

Process placement can be controlled on the Operating system level by numactl TACC

(the TACC utility \indextermttdef{tacc_affinity} is a wrapper around this)

on Linux (also taskset ); Windows start/affinity .

Corresponding system calls: pbing on Solaris, sched_setaffinity on Linux, SetThreadAffinityMask on Windows.

Corresponding environment variables: SUNW_MP_PROCBIND on Solaris, KMP_AFFINITY on Intel.

The Intel compiler has an environment variable for affinity control:

export KMP_AFFINITY=verbose,scatter

values: none,scatter,compact

For gcc :

export GOMP_CPU_AFFINITY=0,8,1,9


For the Sun compiler :

SUNW_MP_PROCBIND