crumb trail: > omp-examples > N-body problems
So-called N-body problem s come up with we describe the interactions between a, probably large, number of entities under a force such as gravity. Examples are molecular dynamics and star clusters.
While clever algorithms exist that take into account the decay of the force over distance, we here consider the naive algorithm that explicitly computes all interactions.
A particle has $x,y$ coordinates and a mass~$c$. For two particles $(x_1,y_1,c_1)$, $(x_2,y_2,c_2)$ the force on particle~1 from particle~2 is: \[ \overrightarrow F_{12} = \frac{c_1\cdot c_2}{\sqrt{ (x_2-x_1)^2+(y_2-y_1)^2 }} \cdot \overrightarrow r_{12} \] where $\overrightarrow r_{12}$ is the unit vector pointing from particle 2 to~1. With $n$ particles, each particle~$i$ feels a force \[ \overrightarrow F_i = \sum_{j\not=i} \overrightarrow F_{ij}. \]
Let's start with a couple of building blocks.
// molecular.c struct point{ double x,y; double c; }; struct force{ double x,y; double f; }; /* Force on p1 from p2 */ struct force force_calc( struct point p1,struct point p2 ) { double dx = p2.x - p1.x, dy = p2.y - p1.y; double f = p1.c * p2.c / sqrt( dx*dx + dy*dy ); struct force exert = {dx,dy,f}; return exert; }
void add_force( struct force *f,struct force g ) { f->x += g.x; f->y += g.y; f->f += g.f; } void sub_force( struct force *f,struct force g ) { f->x -= g.x; f->y -= g.y; f->f += g.f; }
For reference, this is the sequential code:
for (int ip=0; ip<N; ip++) { for (int jp=ip+1; jp<N; jp++) { struct force f = force_calc(points[ip],points[jp]); add_force( forces+ip,f ); sub_force( forces+jp,f ); } }
Argue that both the outer loop and the inner are not directly parallelizable.
We will now explore a number of different strategies for parallelization. All tests are done on the TACC Frontera cluster, which has dual-socket Intel Cascade Lake nodes, with a total of 56 cores. Our code uses 10 thousand particles, and each interaction evaluation is repeated 10 times to eliminate cache loading effects.
crumb trail: > omp-examples > N-body problems > Solution 1: no conflicting writes
In our first attempt at an efficient parallel code, we compute the full $N^2$ interactions. One solution would be to compute the $\overrightarrow F_{ij}$ interactions for all $i,j$, so that there are no conflicting writes.
for (int ip=0; ip<N; ip++) { double sumx=0., sumy=0., sumf=0.; #pragma omp parallel for reduction(+:sumx,sumy,sumf) for (int jp=0; jp<N; jp++) { if (ip==jp) continue; struct force f = force_calc(points[ip],points[jp]); sumx += f.x; sumy += f.y; sumf += f.f; } // end parallel jp loop struct force sumforce = {sumx,sumy,sumf}; add_force( forces+ip, sumforce ); } // end ip loop
This increases the scalar work by a factor of two, but surprisingly, on a single thread the run time improves: we measure a speedup of 6.51 over the supposedly `optimal' code.
What would be an explanation?
However, increasing the number of threads has limited benefits for this strategy. Figure 31.1 shows that the speedup is not only sublinear: it actually decreases with increasing core count.
What would be an explanation?
crumb trail: > omp-examples > N-body problems > Solution 2: using atomics
Next we try to parallelize the outer loop.
#pragma omp parallel for schedule(guided,4) for (int ip=0; ip<N; ip++) { for (int jp=ip+1; jp<N; jp++) { struct force f = force_calc(points[ip],points[jp]); add_force( forces+ip,f ); sub_force( forces+jp,f ); } }
void sub_force( struct force *f,struct force g ) { #pragma omp atomic f->x -= g.x; #pragma omp atomic f->y -= g.y; #pragma omp atomic f->f += g.f; }
This works fairly well, as figure 31.2 shows.
crumb trail: > omp-examples > N-body problems > Solution 3: all interactions atomic
But if we decide to use atomic updates, we can take the full square loop, collapse the two loops, and make every write atomic.
#pragma omp parallel for collapse(2) for (int ip=0; ip<N; ip++) { for (int jp=0; jp<N; jp++) { if (ip==jp) continue; struct force f = force_calc(points[ip],points[jp]); add_force( forces+ip, f ); } // end parallel jp loop } // end ip loop
Figure 31.3 shows that this is pretty close to perfect.
Everything in one plot in figure 31.4 .
crumb trail: > omp-examples > Tree traversal
OpenMP tasks are a great way of handling trees.
In post-order tree traversal you visit the subtrees before visiting the root. This is the traversal that you use to find summary information about a tree, for instance the sum of all nodes, and the sums of nodes of all subtrees:
\For{all children $c$} {compute the sum $s\_c$}\
$ s \leftarrow \sum\_c s\_c$Another example is matrix factorization: \[ S = A_{33} - A_{31}A_{11}\inv A_{13} - A_{32}A_{22}\inv A_{23} \] where the two inverses $A_{11}\inv,A_{22}\inv$ can be computed indepedently and recursively.
crumb trail: > omp-examples > Depth-first search
{r}{3in} In this section we look at the `eight queens' problem, as an example of DFS : is it possible to put eight queens on a chess board so that none of them threaten each other? With DFS , the search space of possibilities is organized as a tree -- each partial solution leads to several possibilities for the next steps -- which is traversed in a particular manner: a chain of possibilities is extended as far as feasible, after which the search backtracks to the next chain.
The sequential implementation is easy enough. The main program fires off:
placement initial; initial.fill(empty); auto solution = place_queen(0,initial);
where I hope you can take the details on trust.
The recursive call then has this structure:
optional<placement> place_queen(int iqueen,const placement& current) { for (int col=0; col<N; col++) { placement next = current; next.at(iqueen) = col; if (feasible(next)) { if (iqueen==N-1) return next; auto attempt = place_queen(iqueen+1,next); if (attempt.has_value()) return attempt; } // end if(feasible) } return {}; };
(This uses the C++17 optional header.) At each iqueen level we
This problem seems a prime candidate for OpenMP tasks, so we start with the usual idiom for the main program:
placement initial; initial.fill(empty); #pragma omp parallel #pragma omp single
We create a task for each column, and since they are in a loop we use taskgroup rather than taskwait .
#pragma omp taskgroup for (int col=0; col<N; col++) { placement next = current; next.at(iqueen) = col; #pragma omp task firstprivate(next) if (feasible(next)) { // stuff } // end if(feasible) }
However, the sequential program had return and break statements in the loop, which is not allowed in workshare constructs such as taskgroup . Therefore we introduce a return variable, declared as shared:
// queens0.cxx optional<placement> result = {}; #pragma omp taskgroup for (int col=0; col<N; col++) { placement next = current; next.at(iqueen) = col; #pragma omp task firstprivate(next) shared(result) if (feasible(next)) { if (iqueen==N-1) { result = next; } else { // do next level auto attempt = place_queen(iqueen+1,next); if (attempt.has_value()) { result = attempt; } } // end if(iqueen==N-1) } // end if(feasible) } return result;
So that was easy, this computes the right solution, and it uses OpenMP tasks. Done?
Actually this runs very slowly because, now that we've dispensed with all early breaks from the loop, we in effect traverse the whole search tree. (It's not quite breadth-first, though.) Figure 31.6 shows this for $N=12$ with the Intel compiler (version 2019) in the left panel, and the GNU compiler (version 9.1) in the middle. In both cases, the blue bars give the result for the code with only the taskgroup directive, with time plotted as function of core count.
We see that, for the Intel compiler, running time indeed goes down with core count. So, while we compute too much (the whole search space), at least parallelization helps. With a number of threads greater than the problem size, the benefit of parallelization disappears, which makes some sort of sense.
We also see that the GCC compiler is really bad at OpenMP tasks: the running time actually increases with the number of threads.
Fortunately, with OpenMP- we can break out of the loop with a cancel of the task group:
// queenfinal.cxx if (feasible(next)) { if (iqueen==N-1) { result = next; #pragma omp cancel taskgroup } else { // do next level auto attempt = place_queen(iqueen+1,next); if (attempt.has_value()) { result = attempt; #pragma omp cancel taskgroup } } // end if (iqueen==N-1) } // end if (feasible)
Surprisingly, this does not immediately give a performance improvement. The reason for this is that cancellation is disabled by default, and we have to set the environment variable
OMP_CANCELLATION=true
With that, we get very good performance, as figure 31.7 shows, which lists sequential time, and multicore running time on the code with cancel directives. Running time is now approximately the same as the sequential time. Some questions are still left:
One observation not reported here is that the GNU compiler has basically the same running time with and without cancellation. This is again shows that the GNU compiler is really bad at OpenMP tasks.