OpenMP Examples

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 30.1 : N-body problems
30.1.1 : Solution 1: no conflicting writes
30.1.2 : Solution 2: using atomics
30.1.3 : Solution 3: all interactions atomic
30.2 : Tree traversal
30.3 : Depth-first search
Back to Table of Contents

30 OpenMP Examples

30.1 N-body problems

crumb trail: > omp-examples > N-body problems

So-called N-body problem s come up with we describe the interactions between a, probably large, number of entities under a force such as gravity. Examples are molecular dynamics and star clusters.

While clever algorithms exist that take into account the decay of the force over distance, we here consider the naive algorithm that explicitly computes all interactions.

A particle has $x,y$ coordinates and a mass~$c$. For two particles $(x_1,y_1,c_1)$, $(x_2,y_2,c_2)$ the force on particle~1 from particle~2 is: \[ \overrightarrow F_{12} = \frac{c_1\cdot c_2}{\sqrt{ (x_2-x_1)^2+(y_2-y_1)^2 }} \cdot \overrightarrow r_{12} \] where $\overrightarrow r_{12}$ is the unit vector pointing from particle 2 to~1. With $n$ particles, each particle~$i$ feels a force \[ \overrightarrow F_i = \sum_{j\not=i} \overrightarrow F_{ij}. \]

Let's start with a couple of building blocks.

// molecular.c
struct point{ double x,y; double c; };
struct force{ double x,y; double f; };

/* Force on p1 from p2 */
struct force force_calc( struct point p1,struct point p2 ) {
  double dx = p2.x - p1.x, dy = p2.y - p1.y;
  double  f = p1.c * p2.c / sqrt( dx*dx + dy*dy );
  struct force exert = {dx,dy,f};
  return exert;
}
Force accumulation:

void add_force( struct force *f,struct force g ) {
  f->x += g.x; f->y += g.y; f->f += g.f;
}
void sub_force( struct force *f,struct force g ) {
  f->x -= g.x; f->y -= g.y; f->f += g.f;
}

For reference, this is the sequential code:

for (int ip=0; ip<N; ip++) {
  for (int jp=ip+1; jp<N; jp++) {
    struct force f = force_calc(points[ip],points[jp]);
    add_force( forces+ip,f );
    sub_force( forces+jp,f );
  }
}
Here $\overrightarrow F_{ij}$ is only computed for $j>i$, and then added to both $\overrightarrow F_i$ and $\overrightarrow F_j$.

Exercise

Argue that both the outer loop and the inner are not directly parallelizable.

We will now explore a number of different strategies for parallelization. All tests are done on the TACC Frontera cluster, which has dual-socket Intel Cascade Lake nodes, with a total of 56 cores. Our code uses 10 thousand particles, and each interaction evaluation is repeated 10 times to eliminate cache loading effects.

30.1.1 Solution 1: no conflicting writes

crumb trail: > omp-examples > N-body problems > Solution 1: no conflicting writes

In our first attempt at an efficient parallel code, we compute the full $N^2$ interactions. One solution would be to compute the $\overrightarrow F_{ij}$ interactions for all $i,j$, so that there are no conflicting writes.

for (int ip=0; ip<N; ip++) {
  double sumx=0., sumy=0., sumf=0.;
#pragma omp parallel for reduction(+:sumx,sumy,sumf)
  for (int jp=0; jp<N; jp++) {
    if (ip==jp) continue;
    struct force f = force_calc(points[ip],points[jp]);
    sumx += f.x; sumy += f.y; sumf += f.f;
  } // end parallel jp loop
  struct force sumforce = {sumx,sumy,sumf};
  add_force( forces+ip, sumforce );
} // end ip loop

This increases the scalar work by a factor of two, but surprisingly, on a single thread the run time improves: we measure a speedup of 6.51 over the supposedly `optimal' code.

Exercise

What would be an explanation?

FIGURE 30.1: {Speedup of reduction variant over sequential}

However, increasing the number of threads has limited benefits for this strategy. Figure  30.1 shows that the speedup is not only sublinear: it actually decreases with increasing core count.

Exercise

What would be an explanation?

30.1.2 Solution 2: using atomics

crumb trail: > omp-examples > N-body problems > Solution 2: using atomics

Next we try to parallelize the outer loop.

#pragma omp parallel for schedule(guided,4)
      for (int ip=0; ip<N; ip++) {
        for (int jp=ip+1; jp<N; jp++) {
          struct force f = force_calc(points[ip],points[jp]);
          add_force( forces+ip,f );
          sub_force( forces+jp,f );
        }
      }
To deal with the conflicting jp writes, we make the writes atomic:

void sub_force( struct force *f,struct force g ) {
#pragma omp atomic
  f->x -= g.x;
#pragma omp atomic
  f->y -= g.y;
#pragma omp atomic
  f->f += g.f;
}

FIGURE 30.2: Speedup of triangular loop with atomic update

This works fairly well, as figure  30.2 shows.

30.1.3 Solution 3: all interactions atomic

crumb trail: > omp-examples > N-body problems > Solution 3: all interactions atomic

But if we decide to use atomic updates, we can take the full square loop, collapse the two loops, and make every write atomic.

#pragma omp parallel for collapse(2)
      for (int ip=0; ip<N; ip++) {
        for (int jp=0; jp<N; jp++) {
          if (ip==jp) continue;
          struct force f = force_calc(points[ip],points[jp]);
          add_force( forces+ip, f );
        } // end parallel jp loop
      } // end ip loop

FIGURE 30.3: Speedup of atomic full interaction calculation

Figure  30.3 shows that this is pretty close to perfect.

Everything in one plot in figure  30.4 .

FIGURE 30.4: All strategies together

30.2 Tree traversal

crumb trail: > omp-examples > Tree traversal

OpenMP tasks are a great way of handling trees.

In post-order tree traversal you visit the subtrees before visiting the root. This is the traversal that you use to find summary information about a tree, for instance the sum of all nodes, and the sums of nodes of all subtrees:

\For{all children $c$} {compute the sum $s\_c$}\

$ s \leftarrow \sum\_c s\_c$

Another example is matrix factorization: \[ S = A_{33} - A_{31}A_{11}\inv A_{13} - A_{32}A_{22}\inv A_{23} \] where the two inverses $A_{11}\inv,A_{22}\inv$ can be computed indepedently and recursively.

30.3 Depth-first search

crumb trail: > omp-examples > Depth-first search

UNKNOWN

{r}{3in} In this section we look at the `eight queens' problem, as an example of DFS : is it possible to put eight queens on a chess board so that none of them threaten each other? With DFS , the search space of possibilities is organized as a tree -- each partial solution leads to several possibilities for the next steps -- which is traversed in a particular manner: a chain of possibilities is extended as far as feasible, after which the search backtracks to the next chain.

The sequential implementation is easy enough. The main program fires off:

placement initial; initial.fill(empty);
auto solution = place_queen(0,initial);

where I hope you can take the details on trust.

The recursive call then has this structure:

optional<placement> place_queen(int iqueen,const placement& current) {
  for (int col=0; col<N; col++) {
    placement next = current;
    next.at(iqueen) = col;
    if (feasible(next)) {
      if (iqueen==N-1)
	return next;
      auto attempt = place_queen(iqueen+1,next);
      if (attempt.has_value())
	return attempt;
    } // end if(feasible)
  }
  return {};
};

(This uses the C++17 optional header.) At each iqueen level we

  • go through a loop of all column positions;
  • filter out positions that are not feasible;
  • report success if this was the last level; or
  • recursively continue the next level otherwise.

This problem seems a prime candidate for OpenMP tasks, so we start with the usual idiom for the main program:

placement initial; initial.fill(empty);
#pragma omp parallel
#pragma omp single

We create a task for each column, and since they are in a loop we use taskgroup rather than taskwait .

#pragma omp taskgroup
  for (int col=0; col<N; col++) {
    placement next = current;
    next.at(iqueen) = col;
#pragma omp task firstprivate(next)
    if (feasible(next)) {
    // stuff
    } // end if(feasible)
  }

However, the sequential program had return and break statements in the loop, which is not allowed in workshare constructs such as taskgroup . Therefore we introduce a return variable, declared as shared:

// queens0.cxx
optional<placement> result = {};
#pragma omp taskgroup
for (int col=0; col<N; col++) {
  placement next = current;
  next.at(iqueen) = col;
#pragma omp task firstprivate(next) shared(result)
  if (feasible(next)) {
    if (iqueen==N-1) {
	result = next;
    } else { // do next level
	auto attempt = place_queen(iqueen+1,next);
	if (attempt.has_value()) {
	  result = attempt;
	}
    } // end if(iqueen==N-1)
  } // end if(feasible)
}
return result;

So that was easy, this computes the right solution, and it uses OpenMP tasks. Done?

FIGURE 30.6: Using taskgroups for $N=12$; left Intel compiler, right GCC

Actually this runs very slowly because, now that we've dispensed with all early breaks from the loop, we in effect traverse the whole search tree. (It's not quite breadth-first, though.) Figure  30.6 shows this for $N=12$ with the Intel compiler (version 2019) in the left panel, and the GNU compiler (version 9.1) in the middle. In both cases, the blue bars give the result for the code with only the taskgroup directive, with time plotted as function of core count.

We see that, for the Intel compiler, running time indeed goes down with core count. So, while we compute too much (the whole search space), at least parallelization helps. With a number of threads greater than the problem size, the benefit of parallelization disappears, which makes some sort of sense.

We also see that the GCC compiler is really bad at OpenMP tasks: the running time actually increases with the number of threads.

Fortunately, with OpenMP- we can break out of the loop with a cancel of the task group:

// queenfinal.cxx
if (feasible(next)) {
  if (iqueen==N-1) {
	result = next;
#pragma omp cancel taskgroup
  } else { // do next level
	auto attempt = place_queen(iqueen+1,next);
	if (attempt.has_value()) {
	  result = attempt;
#pragma omp cancel taskgroup
	}
  } // end if (iqueen==N-1)
} // end if (feasible)

Surprisingly, this does not immediately give a performance improvement. The reason for this is that cancellation is disabled by default, and we have to set the environment variable

OMP_CANCELLATION=true

FIGURE 30.7: Using taskgroup cancelling, Intel compiler

With that, we get very good performance, as figure  30.7 shows, which lists sequential time, and multicore running time on the code with cancel directives. Running time is now approximately the same as the sequential time. Some questions are still left:

  • Why does the time go up with core count?
  • Why is the multicore code slower than the sequential code, and would the parallel code be faster than sequential if the amount of scalar work (for instance in the feasible function) would be larger?

One observation not reported here is that the GNU compiler has basically the same running time with and without cancellation. This is again shows that the GNU compiler is really bad at OpenMP tasks.

Back to Table of Contents