# OpenMP topic: Reductions

$\newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{{\equiv\atop\mathrm{\scriptstyle D}}}}}$ 20.1 : Built-in reduction operators
20.2 : Initial value for reductions
20.3 : User-defined reductions
20.4 : Reductions and floating-point math

# 20 OpenMP topic: Reductions

Parallel tasks often produce some quantity that needs to be summed or otherwise combined. In section  labelstring you saw an example, and it was stated that the solution given there was not very good.

The problem in that example was the race condition involving the result variable. The simplest solution is to eliminate the race condition by declaring a critical section :

double result = 0;
#pragma omp parallel
{
double local_result;
if (num==0)      local_result = f(x);
else if (num==1) local_result = g(x);
else if (num==2) local_result = h(x);
#pragma omp critical
result += local_result;
}


This is a good solution if the amount of serialization in the critical section is small compared to computing the functions $f,g,h$. On the other hand, you may not want to do that in a loop:

double result = 0;
#pragma omp parallel
{
double local_result;
#pragma omp for
for (i=0; i<N; i++) {
local_result = f(x,i);
#pragma omp critical
result += local_result;
} // end of for loop
}

Exercise

Can you think of a small modification of this code, that still uses a critical section, that is more efficient? Time both codes.

The easiest way to effect a reduction is of course to use the \indexclause{reduction} clause. Adding this to an omp for or an omp sections construct has the following effect:

• OpenMP will make a copy of the reduction variable per thread, initialized to the identity of the reduction operator, for instance $1$ for multiplication.

• Each thread will then reduce into its local variable;

• At the end of the loop, the local results are combined, again using the reduction operator, into the global variable.

This is one of those cases where the parallel execution can have a slightly different value from the one that is computed sequentially, because floating point operations are not associative. See  Eijkhout:IntroHPC for more explanation.

If your code can not be easily structure as a reduction, you can realize the above scheme by hand by duplicating' the global variable and gather the contributions later. This example presumes three threads, and gives each a location of their own to store the result computed on that thread:

double result,local_results[3];
#pragma omp parallel
{
if (num==0)      local_results[num] = f(x)
else if (num==1) local_results[num] = g(x)
else if (num==2) local_results[num] = h(x)
}
result = local_results[0]+local_results[1]+local_results[2]

While this code is correct, it may be inefficient because of a phenomemon called false sharing . Even though the threads write to separate variables, those variables are likely to be on the same cacheline (see Eijkhout:IntroHPC for an explanation). This means that the cores will be wasting a lot of time and bandwidth updating each other's copy of this cacheline.

False sharing can be prevent by giving each thread its own cacheline:

double result,local_results[3][8];
#pragma omp parallel
{
if (num==0)      local_results[num][1] = f(x)
// et cetera
}

A more elegant solution gives each thread a true local variable, and uses a critical section to sum these, at the very end:
double result = 0;
#pragma omp parallel
{
double local_result;
local_result = .....
#pragam omp critical
result += local_result;
}


# 20.1 Built-in reduction operators

Top > Built-in reduction operators

Arithmetic reductions: $+,*,-,\max,\min$

Logical operator reductions in C: & && | || ^

Logical operator reductions in Fortran: .and. .or. .eqv. .neqv. .iand. .ior. .ieor.

Exercise

The maximum and minimum reductions were not added to OpenMP until version 3.1. Write a parallel loop that computes the maximum and minimum values in an array. Discuss the various options. Do timings to evaluate the speedup that is attained and to find the best option.

# 20.2 Initial value for reductions

Top > Initial value for reductions

The treatment of initial values in reductions is slightly involved.

x = init_x
#pragma omp parallel for reduction(min:x)
for (int i=0; i<N; i++)
x = min(x,data[i]);

Each thread does a partial reduction, but its initial value is not the user-supplied init_x value, but a value dependent on the operator. In the end, the partial results will then be combined with the user initial value. The initialization values are mostly self-evident, such as zero for addition and one for multiplication. For min and max they are respectively the maximal and minimal representable value of the result type.

Figure  1 illustrates this, where 1,2,3,4 are four data items, i  is the OpenMP initialization, and u  is the user initialization; each p  stands for a partial reduction value. The figure is based on execution using two threads.

Exercise

Write a program to test the fact that the partial results are initialized to the unit of the reduction operator.

# 20.3 User-defined reductions

Top > User-defined reductions

With user-defined reductions , the programmer specifies the function that does the elementwise comparison. This takes two steps.

1. You need a function of two arguments that returns the result of the comparison. You can do this yourself, but, especially with the C++ standard library, you can use functions such as std::vector::insert .

2. Specifying how this function operates on two variables \indexompshow{omp_out} and \indexompshow{omp_in}, corresponding to the partially reduced result and the new operand respectively. The new partial result should be left in omp_out .

3. Optionally, you can specify the value to which the reduction should be initialized.

This is the syntax of the definition of the reduction, which can then be used in multiple \indexclause{reduction} clauses.

#pragma omp declare reduction
( identifier : typelist : combiner )
[initializer(initializer-expression)]

where:

• [ identifier ] is a name; this can be overloaded for different types, and redefined in inner scopes.

• [ typelist ] is a list of types.

• [ combiner ] is an expression that updates the internal variable \indexompshow{omp_out} as function of itself and \indexompshow{omp_in}.

• [ initializer ] sets \indexompshow{omp_priv} to the identity of the reduction; this can be an expression or a brace initializer.

For instance, recreating the maximum reduction would look like this:

// ireduct.c
int mymax(int r,int n) {
// r is the already reduced value
// n is the new value
int m;
if (n>r) {
m = n;
} else {
m = r;
}
return m;
}
#pragma omp declare reduction \
(rwz:int:omp_out=mymax(omp_out,omp_in)) \
initializer(omp_priv=INT_MIN)
m = INT_MIN;
#pragma omp parallel for reduction(rwz:m)
for (int idata=0; idata
`

Exercise

Write a reduction routine that operates on an array of non-negative integers, finding the smallest nonzero one. If the array has size zero, or entirely consists of zeros, return  -1 .

# 20.4 Reductions and floating-point math

Top > Reductions and floating-point math

The mechanisms that OpenMP uses to make a reduction parallel go against the strict rules for floating point expression evaluation in C; see  Eijkhout:IntroHPC . OpenMP ignores this issue: it is the programmer's job to ensure proper rounding behaviour.