OpenMP topic: Affinity

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 24.1 : OpenMP thread affinity control
24.1.1 : Thread binding
24.1.2 : Effects of thread binding
24.1.3 : Place definition
24.1.4 : Binding possibilities
24.2 : First-touch
24.3 : Affinity control outside OpenMP
Back to Table of Contents

24 OpenMP topic: Affinity

24.1 OpenMP thread affinity control

crumb trail: > omp-affinity > OpenMP thread affinity control

The matter of thread affinity becomes important on multi-socket nodes ; see the example in section~ 24.2 .

Thread placement can be controlled with two environment variables:

  • the environment variable OMP_PROC_BIND describes how threads are bound to OpenMP places ; while
  • the variable OMP_PLACES describes these places in terms of the available hardware.
  • When you're experimenting with these variables it is a good idea to set OMP_DISPLAY_ENV to true, so that OpenMP will print out at runtime how it has interpreted your specification. The examples in the following sections will display this output.

24.1.1 Thread binding

crumb trail: > omp-affinity > OpenMP thread affinity control > Thread binding

The variable OMP_PLACES defines a series of places to which the threads are assigned.

Example: if you have two sockets and you define

OMP_PLACES=sockets

then

  • thread 0 goes to socket 0,
  • thread 1 goes to socket 1,
  • thread 2 goes to socket 0 again,
  • and so on.

On the other hand, if the two sockets have a total of sixteen cores and you define

OMP_PLACES=cores
OMP_PROC_BIND=close

then

  • thread 0 goes to core 0, which is on socket 0,
  • thread 1 goes to core 1, which is on socket 0,
  • thread 2 goes to core 2, which is on socket 0,
  • and so on, until thread 7 goes to core 7 on socket 0, and
  • thread 8 goes to core 8, which is on socket 1,
  • et cetera.

The value OMP_PROC_BIND=close means that the assignment goes successively through the available places. The variable OMP_PROC_BIND can also be set to spread , which spreads the threads over the places. With

OMP_PLACES=cores
OMP_PROC_BIND=spread

you find that

  • thread 0 goes to core 0, which is on socket 0,
  • thread 1 goes to core 8, which is on socket 1,
  • thread 2 goes to core 1, which is on socket 0,
  • thread 3 goes to core 9, which is on socket 1,
  • and so on, until thread 14 goes to core 7 on socket 0, and
  • thread 15 goes to core 15, which is on socket 1.

So you see that OMP_PLACES=cores and OMP_PROC_BIND=spread very similar to OMP_PLACES=sockets . The difference is that the latter choice does not bind a thread to a specific core, so the operating system can move threads about, and it can put more than one thread on the same core, even if there is another core still unused.

The value OMP_PROC_BIND=master puts the threads in the same place as the master of the team. This is convenient if you create teams recursively. In that case you would use the clause rather than the environment variable, set to spread for the initial team, and to master for the recursively created team.

24.1.2 Effects of thread binding

crumb trail: > omp-affinity > OpenMP thread affinity control > Effects of thread binding

Let's consider two example program. First we consider the program for computing $\pi$, which is purely compute-bound.

\#threads close/cores spread/sockets spread/cores
1 0.359 0.354 0.353
2 0.177 0.177 0.177
4 0.088 0.088 0.088
6 0.059 0.059 0.059
8 0.044 0.044 0.044
12 0.029 0.045 0.029
16 0.022 0.050 0.022

We see pretty much perfect speedup for the OMP_PLACES=cores strategy; with OMP_PLACES=sockets we probably get occasional collisions where two threads wind up on the same core.

Next we take a program for computing the time evolution of the heat equation : \[ t=0,1,2,…\colon \forall_i\colon x^{(t+1)}_i = 2x^{(t)}_i-x^{(t)}_{i-1}-x^{(t)}_{i+1} \] This is a bandwidth-bound operation because the amount of computation per data item is low.

\#threads close/cores spread/sockets spread/cores
1 2.88 2.89 2.88
2 1.71 1.41 1.42
4 1.11 0.74 0.74
6 1.09 0.57 0.57
8 1.12 0.57 0.53
12 0.72 0.53 0.52
16 0.52 0.61 0.53

Again we see that OMP_PLACES=sockets gives worse performance for high core counts, probably because of threads winding up on the same core. The thing to observe in this example is that with 6 or 8 cores the OMP_PROC_BIND=spread strategy gives twice the performance of OMP_PROC_BIND=close .

The reason for this is that a single socket does not have enough bandwidth for all eight cores on the socket. Therefore, dividing the eight threads over two sockets gives each thread a higher available bandwidth than putting all threads on one socket.

24.1.3 Place definition

crumb trail: > omp-affinity > OpenMP thread affinity control > Place definition

There are three predefined values for the OMP_PLACES variable: sockets, cores, threads . You have already seen the first two; the threads value becomes relevant on processors that have hardware threads. In that case, OMP_PLACES=cores does not tie a thread to a specific hardware thread, leading again to possible collisions as in the above example. Setting OMP_PLACES=threads ties each OpenMP thread to a specific hardware thread.

There is also a very general syntax for defining places that uses a

  location:number:stride

syntax. Examples:

  • OMP_PLACES="{0:8:1},{8:8:1}"
    

    is equivalent to sockets on a two-socket design with eight cores per socket: it defines two places, each having eight consecutive cores. The threads are then places alternating between the two places, but not further specified inside the place.

  • The setting cores is equivalent to

    OMP_PLACES="{0},{1},{2},...,{15}"
    
  • On a four-socket design, the specification

    OMP_PLACES="{0:4:8}:4:1"
    

    states that the place 0,8,16,24 needs to be repeated four times, with a stride of one. In other words, thread 0 winds up on core 0 of some socket, the thread 1 winds up on core 1 of some socket, et cetera.

24.1.4 Binding possibilities

crumb trail: > omp-affinity > OpenMP thread affinity control > Binding possibilities

Values for OMP_PROC_BIND are: false, true, master, close, spread .

  • false: set no binding
  • true: lock threads to a core
  • master: collocate threads with the master thread
  • close: place threads close to the master in the places list
  • spread: spread out threads as much as possible

This effect can be made local by giving the parallel directive.

A safe default setting is

export OMP_PROC_BIND=true

which prevents the operating system from migrating a thread . This prevents many scaling problems.

Good examples of thread placement on the Intel Knight's Landing : https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200

As an example, consider a code where two threads write to a shared location.

// sharing.c
#pragma omp parallel
  { // not a parallel for: just a bunch of reps
    for (int j = 0; j < reps; j++) {
#pragma omp for schedule(static,1)
      for (int i = 0; i < N; i++){
#pragma omp atomic
	a++;
      }

    }
  }
There is now a big difference in runtime depending on how close the threads are. We test this on a processor with both cores and hyperthreads. First we bind the OpenMP threads to the cores:

OMP_NUM_THREADS=2 OMP_PLACES=cores OMP_PROC_BIND=close ./sharing
run time = 4752.231836usec
sum = 80000000.0

Next we force the OpenMP threads to bind to hyperthreads inside one core:

OMP_PLACES=threads OMP_PROC_BIND=close ./sharing
run time = 941.970110usec
sum = 80000000.0

Of course in this example the inner loop is pretty much meaningless and parallelism does not speed up anything:

OMP_NUM_THREADS=1 OMP_PLACES=cores OMP_PROC_BIND=close ./sharing
run time = 806.669950usec
sum = 80000000.0

However, we see that the two-thread result is almost as fast, meaning that there is very little parallelization overhead.

24.2 First-touch

crumb trail: > omp-affinity > First-touch

The affinity issue shows up in the first-touch phenomemon. Memory allocated with malloc and like routines is not actually allocated; that only happens when data is written to it. In light of this, consider the following OpenMP code:

double *x = (double*) malloc(N*sizeof(double));


for (i=0; i<N; i++)
  x[i] = 0;


#pragma omp parallel for
for (i=0; i<N; i++)
  .... something with x[i] ...

Since the initialization loop is not parallel it is executed by the master thread, making all the memory associated with the socket of that thread. Subsequent access by the other socket will then access data from memory not attached to that socket.

Exercise

Finish the following fragment and run it with first all the cores of one socket, then all cores of both sockets. (If you know how to do explicit placement, you can also try fewer cores.)

  for (int i=0; i<nlocal+2; i++)
    in[i] = 1.;
  for (int i=0; i<nlocal; i++)
    out[i] = 0.;


  for (int step=0; step<nsteps; step++) {
#pragma omp parallel for schedule(static)
    for (int i=0; i<nlocal; i++) {
      out[i] = ( in[i]+in[i+1]+in[i+2] )/3.;
    }
#pragma omp parallel for schedule(static)
    for (int i=0; i<nlocal; i++)
      in[i+1] = out[i];
    in[0] = 0; in[nlocal+1] = 1;
  }

Exercise

How do the OpenMP dynamic schedules relate to this?

C++ valarray does initialization, so it will allocate memory on thread 0.

You could move pages with move_pages .

By regarding affinity, in effect you are adopting an SPMD style of programming. You could make this explicit by having each thread allocate its part of the arrays separately, and storing a private pointer as threadprivate   [Liu:2003:OMP-SPMD] . However, this makes it impossible for threads to access each other's parts of the distributed array, so this is only suitable for total data parallel or embarrassingly parallel applications.

24.3 Affinity control outside OpenMP

crumb trail: > omp-affinity > Affinity control outside OpenMP

There are various utilities to control process and thread placement.

Process placement can be controlled on the Operating system level by

(the TACC utility is a wrapper around this)

on Linux (also taskset ); Windows start/affinity .

Corresponding system calls: pbing on Solaris, sched_setaffinity on Linux, SetThreadAffinityMask on Windows.

Corresponding environment variables: SUNW_MP_PROCBIND on Solaris, KMP_AFFINITY on Intel.

The Intel compiler has an environment variable for affinity control:

export KMP_AFFINITY=verbose,scatter

values: none,scatter,compact

For gcc :

export GOMP_CPU_AFFINITY=0,8,1,9

For the Sun compiler :

SUNW_MP_PROCBIND

Back to Table of Contents