OpenMP topic: Reductions

Experimental html version of downloadable textbook, see
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 20.1 : Reductions: why, what, how?
20.1.1 : Code your own solution
20.1.2 : Reduction clause
20.2 : Built-in reduction operators
20.3 : Initial value for reductions
20.4 : User-defined reductions
20.5 : Reductions and floating-point math
Back to Table of Contents

20 OpenMP topic: Reductions

20.1 Reductions: why, what, how?

crumb trail: > omp-reduction > Reductions: why, what, how?

Parallel tasks often produce some quantity that needs to be summed or otherwise combined. In section~ 17.5 you saw an example, and it was stated that the solution given there was not very good.

The problem in that example was the race condition involving the result variable. We will discuss several strategies of dealing with this.

20.1.1 Code your own solution

crumb trail: > omp-reduction > Reductions: why, what, how? > Code your own solution

The simplest solution is to eliminate the race condition by declaring a critical section :

double result = 0;
#pragma omp parallel
  double local_result;
  int num = omp_get_thread_num();
  if (num==0)      local_result = f(x);
  else if (num==1) local_result = g(x);
  else if (num==2) local_result = h(x);
#pragma omp critical
  result += local_result;

This is a good solution if the amount of serialization in the critical section is small compared to computing the functions $f,g,h$. On the other hand, you may not want to do that in a loop:

double result = 0;
#pragma omp parallel
  double local_result;
#pragma omp for
  for (i=0; i<N; i++) {
    local_result = f(x,i);
#pragma omp critical
    result += local_result;
  } // end of for loop

Can you think of a small modification of this code, that still uses a critical section, that is more efficient? Time both codes.

20.1.2 Reduction clause

crumb trail: > omp-reduction > Reductions: why, what, how? > Reduction clause

The easiest way to effect a reduction is of course to use the clause. Adding this to an omp parallel region has the following effect:

  • OpenMP will make a copy of the reduction variable per thread, initialized to the identity of the reduction operator, for instance $1$ for multiplication.
  • Each thread will then reduce into its local variable;
  • At the end of the parallel region, the local results are combined, again using the reduction operator, into the global variable.
// reductpar.c
m = INT_MIN;
#pragma omp parallel reduction(max:m) num_threads(ndata)
  int t = omp_get_thread_num();
  int d = data[t];
  m = d>m ? d : m;

If you want to reduce multiple variables with the same operator, use


For multiple reduction with different operators, use more than one clause.

A reduction is one of those cases where the parallel execution can have a slightly different value from the one that is computed sequentially, because floating point operations are not associative. See  Eijkhout:IntroHPC for more explanation.

If your code can not be easily structured as a reduction, you can realize the above scheme by hand by `duplicating' the global variable and gather the contributions later. This example presumes three threads, and gives each a location of their own to store the result computed on that thread:

double result,local_results[3];
#pragma omp parallel
  int num = omp_get_thread_num();
  if (num==0)      local_results[num] = f(x)
  else if (num==1) local_results[num] = g(x)
  else if (num==2) local_results[num] = h(x)
result = local_results[0]+local_results[1]+local_results[2]

While this code is correct, it may be inefficient because of a phenomemon called false sharing . Even though the threads write to separate variables, those variables are likely to be on the same cacheline (see Eijkhout:IntroHPC for an explanation). This means that the cores will be wasting a lot of time and bandwidth updating each other's copy of this cacheline.

False sharing can be prevent by giving each thread its own cacheline:

double result,local_results[3][8];
#pragma omp parallel
  int num = omp_get_thread_num();
  if (num==0)      local_results[num][1] = f(x)
// et cetera

A more elegant solution gives each thread a true local variable, and uses a critical section to sum these, at the very end:

double result = 0;
#pragma omp parallel
  double local_result;
  local_result = .....
#pragam omp critical
  result += local_result;

20.2 Built-in reduction operators

crumb trail: > omp-reduction > Built-in reduction operators

Arithmetic reductions: +,*,-,max,min .

Logical operator reductions in C: & && | || ^

Logical operator reductions in Fortran: .and. .or. .eqv. .neqv. .iand. .ior. .ieor.


The maximum and minimum reductions were not added to OpenMP until OpenMP-. Write a parallel loop that computes the maximum and minimum values in an array without using the reduction directive. Discuss the various options. Do timings to evaluate the speedup that is attained and to find the best option.

20.3 Initial value for reductions

crumb trail: > omp-reduction > Initial value for reductions

The treatment of initial values in reductions is slightly involved.

  x = init_x
#pragma omp parallel for reduction(min:x)
  for (int i=0; i<N; i++)
    x = min(x,data[i]);

Each thread does a partial reduction, but its initial value is not the user-supplied init_x value, but a value dependent on the operator. In the end, the partial results will then be combined with the user initial value. The initialization values are mostly self-evident, such as zero for addition and one for multiplication. For min and max they are respectively the maximal and minimal representable value of the result type.

\caption{Reduction of four items on two threads, taking into account initial values.}

Figure  20.1 illustrates this, where 1,2,3,4 are four data items, i  is the OpenMP initialization, and u  is the user initialization; each p  stands for a partial reduction value. The figure is based on execution using two threads.


Write a program to test the fact that the partial results are initialized to the unit of the reduction operator.

20.4 User-defined reductions

crumb trail: > omp-reduction > User-defined reductions

With user-defined reductions , the programmer specifies the function that does the elementwise comparison. This takes two steps.

  1. You need a function of two arguments that returns the result of the comparison. You can do this yourself, but, especially with the C++ standard library, you can use functions such as std::vector::insert .
  2. Specifying how this function operates on two variables omp_out and omp_in , corresponding to the partially reduced result and the new operand respectively. The new partial result should be left in omp_out .
  3. Optionally, you can specify the value to which the reduction should be initialized.

This is the syntax of the definition of the reduction, which can then be used in multiple

#pragma omp declare reduction
    ( identifier : typelist : combiner )


  • [ identifier ] is a name; this can be overloaded for different types, and redefined in inner scopes.
  • [ typelist ] is a list of types.
  • [ combiner ] is an expression that updates the internal variable omp_out as function of itself and omp_in .
  • [ initializer ] sets omp_priv to the identity of the reduction; this can be an expression or a brace initializer.

For instance, recreating the maximum reduction would look like this:

// ireduct.c
int mymax(int r,int n) {
// r is the already reduced value
// n is the new value
  int m;
  if (n>r) {
    m = n;
  } else {
    m = r;
  return m;
#pragma omp declare reduction \
  (rwz:int:omp_out=mymax(omp_out,omp_in)) \
  m = INT_MIN;
#pragma omp parallel for reduction(rwz:m)
  for (int idata=0; idata<ndata; idata++)
    m = mymax(m,data[idata]);


Write a reduction routine that operates on an array of nonnegative integers, finding the smallest nonzero one. If the array has size zero, or entirely consists of zeros, return  -1 .

\begin{cppnote} Support for C++ iterators

#pragma omp declare reduction (merge : std::vector<int>
    : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))


20.5 Reductions and floating-point math

crumb trail: > omp-reduction > Reductions and floating-point math

The mechanisms that OpenMP uses to make a reduction parallel go against the strict rules for floating point expression evaluation in C; see  Eijkhout:IntroHPC . OpenMP ignores this issue: it is the programmer's job to ensure proper rounding behavior.

Back to Table of Contents