OpenMP topic: Affinity

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 23.1 : OpenMP thread affinity control
23.1.1 : Thread binding
23.1.2 : Effects of thread binding
Back to Table of Contents

23 OpenMP topic: Affinity

23.1 OpenMP thread affinity control

Top > OpenMP thread affinity control

The matter of thread affinity becomes important on multi-socket nodes ; see the example in section  23.2 .

Thread placement can be controlled with two environment variables:

23.1.1 Thread binding

Top > OpenMP thread affinity control > Thread binding

The variable \indexompshowdef{OMP_PLACES} defines a series of places to which the threads are assigned.

Example: if you have two sockets and you define

OMP_PLACES=sockets
then On the other hand, if the two sockets have a total of sixteen cores and you define
OMP_PLACES=cores
OMP_PROC_BIND=close
then The value OMP_PROC_BIND=close means that the assignment goes successively through the available places. The variable OMP_PROC_BIND can also be set to spread , which spreads the threads over the places. With
OMP_PLACES=cores
OMP_PROC_BIND=spread
you find that

So you see that OMP_PLACES=cores and OMP_PROC_BIND=spread very similar to OMP_PLACES=sockets . The difference is that the latter choice does not bind a thread to a specific core, so the operating system can move threads about, and it can put more than one thread on the same core, even if there is another core still unused.

The value OMP_PROC_BIND=master puts the threads in the same place as the master of the team. This is convenient if you create teams recursively. In that case you would use the \indexclause{proc\_bind} clause rather than the environment variable, set to spread for the initial team, and to master for the recursively created team.

23.1.2 Effects of thread binding

Top > OpenMP thread affinity control > Effects of thread binding

Let's consider two example program. First we consider the program for computing $\pi$, which is purely compute-bound.

\#threads
close/cores
spread/sockets
spread/cores
1 0.359 0.354 0.353
2 0.177 0.177 0.177
4 0.088 0.088 0.088
6 0.059 0.059 0.059
8 0.044 0.044 0.044
12 0.029 0.045 0.029
16 0.022 0.050 0.022

We see pretty much perfect speedup for the OMP_PLACES=cores strategy; with OMP_PLACES=sockets we probably get occasional collisions where two threads wind up on the same core.

Next we take a program for computing the time evolution of the heat equation : \begin{equation} t=0,1,2,\ldots\colon \forall_i\colon x^{(t+1)}_i = 2x^{(t)}_i-x^{(t)}_{i-1}-x^{(t)}_{i+1} \end{equation} This is a bandwidth-bound operation because the amount of computation per data item is low.

\#threads
close/cores
spread/sockets
spread/cores
1 2.88 2.89 2.88
2 1.71 1.41 1.42
4 1.11 0.74 0.74
6 1.09 0.57 0.57
8 1.12 0.57 0.53
12 0.72 0.53 0.52
16 0.52 0.61 0.53

Again we see that OMP_PLACES=sockets gives worse performance for high core counts, probably because of threads winding up on the same core. The thing to observe in this example is that with 6nbsp;ornbsp;8 cores the OMP_PROC_BIND=spread strategy gives twice the performance of OMP_PROC_BIND=close .

The reason for this is that a single socket does not have enough bandwidth for all eight cores on the socket. Therefore, dividing the eight threads over two sockets gives each thread a higher available bandwidth than putting all threads on one socket.

23.1.3 Place definition

Top > OpenMP thread affinity control > Place definition

There are three predefined values for the \indexompshow{OMP_PLACES} variable: sockets, cores, threads . You have already seen the first two; the threads value becomes relevant on processors that have hardware threads. In that case, OMP_PLACES=cores does not tie a thread to a specific hardware thread, leading again to possible collisions as in the above example. Setting OMP_PLACES=threads ties each OpenMP thread to a specific hardware thread.

There is also a very general syntax for defining places that uses a

  location:number:stride
  
syntax. Examples:

23.1.4 Binding possibilities

Top > OpenMP thread affinity control > Binding possibilities

Values for \indexompshow{OMP_PROC_BIND} are: false, true, master, close, spread .

This effect can be made local by giving the \indexclause{proc\_bind} clause in the parallel directive.

A safe default setting is

  export OMP_PROC_BIND=true
  
which prevents the operating system from migrating a thread . This prevents many scaling problems.

Good examples of thread placement on the Intel Knight's Landing : \url{https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200}

As an example, consider a code where two threads write to a shared location.

  // sharing.c
  #pragma omp parallel
  { // not a parallel for: just a bunch of reps
  for (int j = 0; j lt; reps; j++) {
  #pragma omp for schedule(static,1)
  for (int i = 0; i lt; N; i++){
  #pragma omp atomic
  a++;
  }
 
  }
  }
  

There is now a big difference in runtime depending on how close the threads are. We test this on a processor with both cores and hyperthreads. First we bind the OpenMP threads to the cores:

  OMP_NUM_THREADS=2 OMP_PLACES=cores OMP_PROC_BIND=close ./sharing
  run time = 4752.231836usec
  sum = 80000000.0
  
Next we force the OpenMP threads to bind to hyperthreads inside one core:
  OMP_PLACES=threads OMP_PROC_BIND=close ./sharing
  run time = 941.970110usec
  sum = 80000000.0
  
Of course in this example the inner loop is pretty much meaningless and parallelism does not speed up anything:
  OMP_NUM_THREADS=1 OMP_PLACES=cores OMP_PROC_BIND=close ./sharing
  run time = 806.669950usec
  sum = 80000000.0
  
However, we see that the two-thread result is almost as fast, meaning that there is very little parallelization overhead.

23.2 First-touch

Top > First-touch

The affinity issue shows up in the first-touch phenomemon. Memory allocated with malloc and like routines is not actually allocated; that only happens when data is written to it. In light of this, consider the following OpenMP code:

  double *x = (double*) malloc(N*sizeof(double));
  

for (i=0; ilt;N; i++) x[i] = 0;

#pragma omp parallel for for (i=0; ilt;N; i++) .... something with x[i] ...

Since the initialization loop is not parallel it is executed by the master thread, making all the memory associated with the socket of that thread. Subsequent access by the other socket will then access data from memory not attached to that socket.

Exercise

Finish the following fragment and run it with first all the cores of one socket, then all cores of both sockets. (If you know how to do explicit placement, you can also try fewer cores.)

  for (int i=0; ilt;nlocal+2; i++)
  in[i] = 1.;
  for (int i=0; ilt;nlocal; i++)
  out[i] = 0.;
  

for (int step=0; steplt;nsteps; step++) { #pragma omp parallel for schedule(static) for (int i=0; ilt;nlocal; i++) { out[i] = ( in[i]+in[i+1]+in[i+2] )/3.; } #pragma omp parallel for schedule(static) for (int i=0; ilt;nlocal; i++) in[i+1] = out[i]; in[0] = 0; in[nlocal+1] = 1; }

Exercise

How do the OpenMP dynamic schedules relate to this?

C++ valarray does initialization, so it will allocate memory on threadnbsp;0.

You could move pages with move_pages .

By regarding affinity, in effect you are adopting an SPMD style of programming. You could make this explicit by having each thread allocate its part of the arrays separately, and storing a private pointer as threadprivate nbsp; [Liu:2003:OMP-SPMD] . However, this makes it impossible for threads to access each other's parts of the distributed array, so this is only suitable for total data parallel or embarrassingly parallel applications.

23.3 Affinity control outside OpenMP

Top > Affinity control outside OpenMP

There are various utilities to control process and thread placement.

Process placement can be controlled on the Operating system level by numactl TACC

(the TACC utility \indextermttdef{tacc_affinity} is a wrapper around this)

on Linux (also taskset ); Windows start/affinity .

Corresponding system calls: pbing on Solaris, sched_setaffinity on Linux, SetThreadAffinityMask on Windows.

Corresponding environment variables: SUNW_MP_PROCBIND on Solaris, KMP_AFFINITY on Intel.

The Intel compiler has an environment variable for affinity control:

  export KMP_AFFINITY=verbose,scatter
  
values: none,scatter,compact

For gcc :

  export GOMP_CPU_AFFINITY=0,8,1,9
  

For the Sun compiler :

  SUNW_MP_PROCBIND
  

{|r|rrr|}

{|r|rrr|}

Back to Table of Contents