Parallel tasks often produce some quantity that needs to be summed or otherwise combined. In section labelstring you saw an example, and it was stated that the solution given there was not very good.
The problem in that example was the
race condition
involving the
result
variable. The simplest solution is to eliminate the race condition
by declaring a
critical section
:
double result = 0; #pragma omp parallel { double local_result; int num = omp_get_thread_num(); if (num==0) local_result = f(x); else if (num==1) local_result = g(x); else if (num==2) local_result = h(x); #pragma omp critical result += local_result; }
This is a good solution if the amount of serialization in the critical section
is small compared to computing the functions $f,g,h$. On the other hand, you
may not want to do that in a loop:
double result = 0; #pragma omp parallel { double local_result; #pragma omp for for (i=0; i<N; i++) { local_result = f(x,i); #pragma omp critical result += local_result; } // end of for loop }
Can you think of a small modification of this code, that still uses a critical section, that is more efficient? Time both codes.
The easiest way to effect a reduction is of course to use the
\indexclause{reduction}
clause. Adding this to an
omp for
or an
omp sections
construct
has the following effect:
OpenMP will make a copy of the reduction variable per thread, initialized to the identity of the reduction operator, for instance $1$ for multiplication.
Each thread will then reduce into its local variable;
At the end of the loop, the local results are combined, again using the reduction operator, into the global variable.
If your code can not be easily structure as a reduction, you can
realize the above scheme by hand by
`duplicating' the global variable and gather the contributions later.
This example presumes three threads, and gives each a location of their
own to store the result computed on that thread:
double result,local_results[3]; #pragma omp parallel { int num = omp_get_thread_num(); if (num==0) local_results[num] = f(x) else if (num==1) local_results[num] = g(x) else if (num==2) local_results[num] = h(x) } result = local_results[0]+local_results[1]+local_results[2]While this code is correct, it may be inefficient because of a phenomemon called false sharing . Even though the threads write to separate variables, those variables are likely to be on the same cacheline (see Eijkhout:IntroHPC for an explanation). This means that the cores will be wasting a lot of time and bandwidth updating each other's copy of this cacheline.
False sharing can be prevent by giving each thread its own cacheline:
double result,local_results[3][8]; #pragma omp parallel { int num = omp_get_thread_num(); if (num==0) local_results[num][1] = f(x) // et cetera }A more elegant solution gives each thread a true local variable, and uses a critical section to sum these, at the very end:
double result = 0; #pragma omp parallel { double local_result; local_result = ..... #pragam omp critical result += local_result; }
Arithmetic reductions: $+,*,-,\max,\min$
Logical operator reductions in C: & && | || ^
Logical operator reductions in Fortran: .and. .or. .eqv. .neqv. .iand. .ior. .ieor.
The maximum and minimum reductions were not added to OpenMP until version 3.1. Write a parallel loop that computes the maximum and minimum values in an array. Discuss the various options. Do timings to evaluate the speedup that is attained and to find the best option.
The treatment of initial values in reductions is slightly involved.
x = init_x #pragma omp parallel for reduction(min:x) for (int i=0; i<N; i++) x = min(x,data[i]);Each thread does a partial reduction, but its initial value is not the user-supplied init_x value, but a value dependent on the operator. In the end, the partial results will then be combined with the user initial value. The initialization values are mostly self-evident, such as zero for addition and one for multiplication. For min and max they are respectively the maximal and minimal representable value of the result type.
Figure 1 illustrates this, where 1,2,3,4 are four data items, i is the OpenMP initialization, and u is the user initialization; each p stands for a partial reduction value. The figure is based on execution using two threads.
Write a program to test the fact that the partial results are initialized to the unit of the reduction operator.
With
user-defined reductions
, the programmer specifies the
function that does the elementwise comparison.
This takes two steps.
You need a function of two arguments that returns the result of the comparison. You can do this yourself, but, especially with the C++ standard library, you can use functions such as std::vector::insert .
Specifying how this function operates on two variables \indexompshow{omp_out} and \indexompshow{omp_in}, corresponding to the partially reduced result and the new operand respectively. The new partial result should be left in omp_out .
Optionally, you can specify the value to which the reduction should be initialized.
This is the syntax of the definition of the reduction, which can then
be used in multiple
\indexclause{reduction} clauses.
#pragma omp declare reduction ( identifier : typelist : combiner ) [initializer(initializer-expression)]where:
[ identifier ] is a name; this can be overloaded for different types, and redefined in inner scopes.
[ typelist ] is a list of types.
[ combiner ] is an expression that updates the internal variable \indexompshow{omp_out} as function of itself and \indexompshow{omp_in}.
[ initializer ] sets \indexompshow{omp_priv} to the identity of the reduction; this can be an expression or a brace initializer.
For instance, recreating the maximum reduction would look like this:
// ireduct.c int mymax(int r,int n) { // r is the already reduced value // n is the new value int m; if (n>r) { m = n; } else { m = r; } return m; } #pragma omp declare reduction \ (rwz:int:omp_out=mymax(omp_out,omp_in)) \ initializer(omp_priv=INT_MIN) m = INT_MIN; #pragma omp parallel for reduction(rwz:m) for (int idata=0; idata
Write a reduction routine that operates on an array of non-negative integers, finding the smallest nonzero one. If the array has size zero, or entirely consists of zeros, return -1 .
The mechanisms that OpenMP uses to make a reduction parallel go against the strict rules for floating point expression evaluation in C; see Eijkhout:IntroHPC . OpenMP ignores this issue: it is the programmer's job to ensure proper rounding behaviour.