Hybrid computing

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 45.1 : Discussion
45.2 : Hybrid MPI-plus-threads execution
Back to Table of Contents

45 Hybrid computing

So far, you have learned to use MPI for distributed memory and OpenMP for shared memory parallel programming. However, distribute memory architectures actually have a shared memory component, since each cluster node is typically of a multicore design. Accordingly, you could program your cluster using MPI for inter-node and OpenMP for intra-node parallelism.

Say you use 100 cluster nodes, each with 16 cores. You could then start 1600 MPI processes, one for each core, but you could also start 100 processes, and give each access to 16 OpenMP threads.

In your slurm scripts, the first scenario would be specified \n{-N 100 -n 1600}, and the second as

#$ SBATCH -N 100
#$ SBATCH -n 100


There is a third choice, in between these extremes, that makes sense. A cluster node often has more than one socket, so you could put one MPI process on each socket , and use a number of threads equal to the number of cores per socket.

The script for this would be:

#$ SBATCH -N 100
#$ SBATCH -n 200

ibrun tacc_affinity yourprogram

The tacc_affinity script unsets the following variables:


If you don't use tacc_affinity you may want to do this by hand, otherwise mvapich2 will use its own affinity rules.

FIGURE 45.1: Three modes of MPI/OpenMP usage on a multi-core cluster

Figure  45.1 illustrates these three modes: pure MPI with no threads used; one MPI process per node and full multi-threading; two MPI processes per node, one per socket, and multiple threads on each socket.

45.1 Discussion

crumb trail: > hybrid > Discussion

The performance implications of the pure MPI strategy versus hybrid are subtle.

  • First of all, we note that there is no obvious speedup: in a well balanced MPI application all cores are busy all the time, so using threading can give no immediate improvement.
  • Both MPI and OpenMP are subject to Amdahl's law that quantifies the influence of sequential code; in hybrid computing there is a new version of this law regarding the amount of code that is MPI-parallel, but not OpenMP-parallel.
  • MPI processes run unsynchronized, so small variations in load or in processor behaviour can be tolerated. The frequent barriers in OpenMP constructs make a hybrid code more tightly synchronized, so load balancing becomes more critical.
  • On the other hand, in OpenMP codes it is easier to divide the work into more tasks than there are threads, so statistically a certain amount of load balancing happens automatically.
  • Each MPI process has its own buffers, so hybrid takes less buffer overhead.


Review the scalability argument for 1D versus 2D matrix decomposition in Eijkhout:IntroHPC . Would you get scalable performance from doing a 1D decomposition (for instance, of the rows) over MPI processes, and decomposing the other directions (the columns) over OpenMP threads?

Another performance argument we need to consider concerns message traffic. If let all threads make MPI calls (see section  45.2 ) there is going to be little difference. However, in one popular hybrid computing strategy we would keep MPI calls out of the OpenMP regions and have them in effect done by the master thread. In that case there are only MPI messages between nodes, instead of between cores. This leads to a decrease in message traffic, though this is hard to quantify. The number of messages goes down approximately by the number of cores per node, so this is an advantage if the average message size is small. On the other hand, the amount of data sent is only reduced if there is overlap in content between the messages.

Limiting MPI traffic to the master thread also means that no buffer space is needed for the on-node communication.

45.2 Hybrid MPI-plus-threads execution

crumb trail: > hybrid > Hybrid MPI-plus-threads execution

In hybrid execution, the main question is whether all threads are allowed to make MPI calls. To determine this, replace the MPI_Init call by

int MPI_Init_thread(int *argc, char ***argv, int required, int *provided)

MPI_Init_thread(required, provided, ierror)
INTEGER, INTENT(IN) :: required
INTEGER, INTENT(OUT) :: provided
MPI_Init_thread Here the required and provided parameters can take the following (monotonically increasing) values:

  • MPI_THREAD_SINGLE : Only a single thread will execute.
  • MPI_THREAD_FUNNELED : The program may use multiple threads, but only the main thread will make MPI calls.

    The main thread is usually the one selected by the master directive, but technically it is the only that executes MPI_Init_thread . If you call this routine in a parallel region, the main thread may be different from the master.

  • MPI_THREAD_SERIALIZED : The program may use multiple threads, all of which may make MPI calls, but there will never be simultaneous MPI calls in more than one thread.
  • MPI_THREAD_MULTIPLE : Multiple threads may issue MPI calls, without restrictions.

After the initialization call, you can query the support level with

int MPI_Query_thread(int *provided)

MPI_Query_thread(provided, ierror)
INTEGER, INTENT(OUT) :: provided
MPI_Query_thread .

In case more than one thread performs communication,

int MPI_Is_thread_main(int *flag)

MPI_Is_thread_main(flag, ierror)
MPI_Is_thread_main can determine whether a thread is the main thread.

MPL note MPL

always calls MPI_Init_thread requesting the highest level MPI_THREAD_MULTIPLE .

enum mpl::threading_modes {
  mpl::threading_modes::single = MPI_THREAD_SINGLE,
  mpl::threading_modes::funneled = MPI_THREAD_FUNNELED,
  mpl::threading_modes::serialized = MPI_THREAD_SERIALIZED,
  mpl::threading_modes::multiple = MPI_THREAD_MULTIPLE
threading_modes mpl::environment::threading_mode ();
bool mpl::environment::is_thread_main ();
End of MPL note

The mvapich implementation of MPI does have the required threading support, but you need to set this environment variable:


Another solution is to run your code like this:

  ibrun tacc_affinity <my_multithreaded_mpi_executable

Intel MPI uses an environment variable to turn on thread support:

release : multi-threaded with global lock
release_mt : multi-threaded with per-object lock for thread-split

The mpiexec program usually propagates environment variables , so the value of OMP_NUM_THREADS when you call mpiexec will be seen by each MPI process.

  • It is possible to use blocking sends in threads, and let the threads block. This does away with the need for polling.
  • You can not send to a thread number: use the MPI message tag to send to a specific thread.


Consider the 2D heat equation and explore the mix of MPI/OpenMP parallelism:

  • Give each node one MPI process that is fully multi-threaded.
  • Give each core an MPI process and don't use multi-threading.

Discuss theoretically why the former can give higher performance. Implement both schemes as special cases of the general hybrid case, and run tests to find the optimal mix.

// thread.c

if (procno==0) {
  switch (threading) {
  case MPI_THREAD_MULTIPLE : printf("Glorious multithreaded MPI\n"); break;
  case MPI_THREAD_SERIALIZED : printf("No simultaneous MPI from threads\n"); break;
  case MPI_THREAD_FUNNELED : printf("MPI from main thread\n"); break;
  case MPI_THREAD_SINGLE : printf("no threading supported\n"); break;
Back to Table of Contents