Parallel tasks often produce some quantity that needs to be summed or otherwise combined. In section 14.5 you saw an example, and it was stated that the solution given there was not very good.
The problem in that example was the race condition involving the result variable. The simplest solution is to eliminate the race condition by declaring a critical section :
double result = 0; #pragma omp parallel { double local_result; int num = omp_get_thread_num(); if (num==0) local_result = f(x); else if (num==1) local_result = g(x); else if (num==2) local_result = h(x); #pragma omp critical result += local_result; }
This is a good solution if the amount of serialization in the critical section is small compared to computing the functions $f,g,h$. On the other hand, you may not want to do that in a loop:
double result = 0; #pragma omp parallel { double local_result; #pragma omp for for (i=0; i<N; i++) { local_result = f(x,i); #pragma omp critical result += local_result; } // end of for loop }
Can you think of a small modification of this code, that still uses a critical section, that is more efficient? Time both codes.
The easiest way to effect a reduction is of course to use the clause. Adding this to an omp for or an omp sections construct has the following effect:
This is one of those cases where the parallel execution can have a slightly different value from the one that is computed sequentially, because floating point operations are not associative. See Eijkhout:IntroHPC for more explanation.
If your code can not be easily structure as a reduction, you can realize the above scheme by hand by `duplicating' the global variable and gather the contributions later. This example presumes three threads, and gives each a location of their own to store the result computed on that thread:
double result,local_results[3]; #pragma omp parallel { int num = omp_get_thread_num(); if (num==0) local_results[num] = f(x) else if (num==1) local_results[num] = g(x) else if (num==2) local_results[num] = h(x) } result = local_results[0]+local_results[1]+local_results[2]
While this code is correct, it may be inefficient because of a phenomemon called false sharing . Even though the threads write to separate variables, those variables are likely to be on the same cacheline (see Eijkhout:IntroHPC for an explanation). This means that the cores will be wasting a lot of time and bandwidth updating each other's copy of this cacheline.
False sharing can be prevent by giving each thread its own cacheline:
double result,local_results[3][8]; #pragma omp parallel { int num = omp_get_thread_num(); if (num==0) local_results[num][1] = f(x) // et cetera }
A more elegant solution gives each thread a true local variable, and uses a critical section to sum these, at the very end:
double result = 0; #pragma omp parallel { double local_result; local_result = ..... #pragam omp critical result += local_result; }
crumb trail: > omp-reduction > Built-in reduction operators
Arithmetic reductions: $+,*,-,\max,\min$
Logical operator reductions in C: & && | || ^
Logical operator reductions in Fortran: .and. .or. .eqv. .neqv. .iand. .ior. .ieor.
The maximum and minimum reductions were not added to OpenMP until version 3.1. Write a parallel loop that computes the maximum and minimum values in an array. Discuss the various options. Do timings to evaluate the speedup that is attained and to find the best option.
crumb trail: > omp-reduction > Initial value for reductions
The treatment of initial values in reductions is slightly involved.
x = init_x #pragma omp parallel for reduction(min:x) for (int i=0; i<N; i++) x = min(x,data[i]);
Each thread does a partial reduction, but its initial value is not the user-supplied init_x value, but a value dependent on the operator. In the end, the partial results will then be combined with the user initial value. The initialization values are mostly self-evident, such as zero for addition and one for multiplication. For min and max they are respectively the maximal and minimal representable value of the result type.
Figure 19.1 illustrates this, where 1,2,3,4 are four data items, i is the OpenMP initialization, and u is the user initialization; each p stands for a partial reduction value. The figure is based on execution using two threads.
Write a program to test the fact that the partial results are initialized to the unit of the reduction operator.
crumb trail: > omp-reduction > User-defined reductions
With user-defined reductions , the programmer specifies the function that does the elementwise comparison. This takes two steps.
This is the syntax of the definition of the reduction, which can then be used in multiple
#pragma omp declare reduction ( identifier : typelist : combiner ) [initializer(initializer-expression)]
where:
For instance, recreating the maximum reduction would look like this:
// ireduct.c int mymax(int r,int n) { // r is the already reduced value // n is the new value int m; if (n>r) { m = n; } else { m = r; } return m; } #pragma omp declare reduction \ (rwz:int:omp_out=mymax(omp_out,omp_in)) \ initializer(omp_priv=INT_MIN) m = INT_MIN; #pragma omp parallel for reduction(rwz:m) for (int idata=0; idata
Write a reduction routine that operates on an array of non-negative integers, finding the smallest nonzero one. If the array has size zero, or entirely consists of zeros, return -1 .
Support for C++ iterators
#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
crumb trail: > omp-reduction > Reductions and floating-point math
The mechanisms that OpenMP uses to make a reduction parallel go against the strict rules for floating point expression evaluation in C; see Eijkhout:IntroHPC . OpenMP ignores this issue: it is the programmer's job to ensure proper rounding behaviour.