Process and thread affinity

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 44.1 : What does the hardware look like?
44.2 : Affinity control
Back to Table of Contents

44 Process and thread affinity

In the preceeding chapters we mostly considered all MPI nodes or OpenMP thread as being in one flat pool. However, for high performance you need to worry about affinity : the question of which process or thread is placed where, and how efficiently they can interact.

FIGURE 44.1: The NUMA structure of a Ranger node

Here are some situations where you affinity becomes a concern.

  • In pure MPI mode processes that are on the same node can typically communicate faster than processes on different nodes. Since processes are typically placed sequentially, this means that a scheme where process~$p$ interacts mostly with $p+1$ will be efficient, while communication with large jumps will be less so.
  • If the cluster network has a structure ( processor grid as opposed to fat-tree ), placement of processes has an effect on program efficiency. MPI tries to address this with graph topology ; section~ 11.2 .
  • Even on a single node there can be asymmetries. Figure~ 44.1 illustrates the structure of the four sockets of the Ranger supercomputer (no longer in production). Two cores have no direct connection.

    This asymmetry affects both MPI processes and threads on that node.

  • Another problem with multi-socket designs is that each socket has memory attached to it. While every socket can address all the memory on the node, its local memory is faster to access. This asymmetry becomes quite visible in the first-touch phenomemon; section~ 24.2 .
  • If a node has fewer MPI processes than there are cores, you want to be in control of their placement. Also, the operating system can migrate processes, which is detrimental to performance since it negates data locality. For this reason, utilities such as numactl

    (and at TACC tacc_affinity )

    can be used to pin a thread or process to a specific core.

  • Processors with hyperthreading or hardware threads introduce another level or worry about where threads go.

44.1 What does the hardware look like?

crumb trail: > affinity > What does the hardware look like?

If you want to optimize affinity, you should first know what the hardware looks like. The utility is valuable here~ [goglin:hwloc] ( https://www.open-mpi.org/projects/hwloc/ ).

FIGURE 44.2: Structure of a Stampede compute node

FIGURE 44.3: Structure of a Stampede largemem four-socket compute node

FIGURE 44.4: Structure of a Lonestar5 compute node

Figure~ 44.2 depicts a Stampede compute node , which is a two-socket Intel Sandybridge design; figure~ 44.3 shows a Stampede largemem node , which is a four-socket design. Finally, figure~ 44.4 shows a Lonestar5 compute node, a~two-socket design with 12-core Intel Haswell processors with two hardware threads each.

44.2 Affinity control

crumb trail: > affinity > Affinity control

See chapter~ OpenMP topic: Affinity for OpenMP affinity control.

Back to Table of Contents