OpenMP Review

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 29.1 : Concepts review
29.1.1 : Basic concepts
29.1.2 : Parallel regions
29.1.3 : Work sharing
29.1.4 : Data scope
29.1.5 : Synchronization
29.1.6 : Tasks
29.2 : Review questions
29.2.1 : Directives
29.2.2 : Parallelism
29.2.3 : Data and synchronization
29.2.3.1 :
29.2.3.2 :
29.2.3.3 :
29.2.4 : Reductions
29.2.4.1 :
29.2.4.2 :
29.2.5 : Barriers
29.2.6 : Data scope
29.2.7 : Tasks
29.2.8 : Scheduling
Back to Table of Contents

29 OpenMP Review

29.1 Concepts review

crumb trail: > ompreview > Concepts review

29.1.1 Basic concepts

crumb trail: > ompreview > Concepts review > Basic concepts

  • process / thread / thread team
  • threads / cores / tasks
  • directives / library functions / environment variables

29.1.2 Parallel regions

crumb trail: > ompreview > Concepts review > Parallel regions

execution by a team

29.1.3 Work sharing

crumb trail: > ompreview > Concepts review > Work sharing

  • loop / sections / single / workshare
  • implied barrier
  • loop scheduling, reduction
  • sections
  • single vs master
  • (F) workshare

29.1.4 Data scope

crumb trail: > ompreview > Concepts review > Data scope

  • shared vs private, C vs F
  • loop variables and reduction variables
  • default declaration
  • firstprivate, lastprivate

29.1.5 Synchronization

crumb trail: > ompreview > Concepts review > Synchronization

  • barriers, implied and explicit
  • nowait
  • critical sections
  • locks, difference with critical

29.1.6 Tasks

crumb trail: > ompreview > Concepts review > Tasks

  • generation vs execution
  • dependencies

29.2 Review questions

crumb trail: > ompreview > Review questions

29.2.1 Directives

crumb trail: > ompreview > Review questions > Directives

What do the following program output?

\small

int main() {
  printf("procs %d\n",
    omp_get_num_procs());
  printf("threads %d\n",
    omp_get_num_threads());
  printf("num %d\n",
    omp_get_thread_num());
  return 0;
}

int main() {
#pragma omp parallel
  {
  printf("procs %d\n",
    omp_get_num_procs());
  printf("threads %d\n",
    omp_get_num_threads());
  printf("num %d\n",
    omp_get_thread_num());
  }
  return 0;
}

\small

Program main
  use omp_lib
  print *,"Procs:",&
    omp_get_num_procs()
  print *,"Threads:",&
    omp_get_num_threads()
  print *,"Num:",&
    omp_get_thread_num()
End Program

Program main
  use omp_lib
!$OMP parallel
  print *,"Procs:",&
    omp_get_num_procs()
  print *,"Threads:",&
    omp_get_num_threads()
  print *,"Num:",&
    omp_get_thread_num()
!$OMP end parallel
End Program

\vfill\pagebreak

29.2.2 Parallelism

crumb trail: > ompreview > Review questions > Parallelism

Can the following loops be parallelized? If so, how? (Assume that all arrays are already filled in, and that there are no out-of-bounds errors.)

\small

// variant #1
for (i=0; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i] = 2*x[i] + c[i+1];
}

// variant #2
for (i=0; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i] = 2*x[i+1] + c[i+1];
}

// variant #3
for (i=1; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i] = 2*x[i-1] + c[i+1];
}

// variant #4
for (i=1; i<N; i++) {
  x[i] = a[i]+b[i+1];
  a[i+1] = 2*x[i-1] + c[i+1];
}

\small

! variant #1
do i=1,N
  x(i) = a(i)+b(i+1)
  a(i) = 2*x(i) + c(i+1)
end do

! variant #2
do i=1,N
  x(i) = a(i)+b(i+1)
  a(i) = 2*x(i+1) + c(i+1)
end do

! variant #3
do i=2,N
  x(i) = a(i)+b(i+1)
  a(i) = 2*x(i-1) + c(i+1)
end do

! variant #3
do i=2,N
  x(i) = a(i)+b(i+1)
  a(i+1) = 2*x(i-1) + c(i+1)
end do

\vfill\pagebreak

29.2.3 Data and synchronization

crumb trail: > ompreview > Review questions > Data and synchronization

29.2.3.1

crumb trail: > ompreview > Review questions > Data and synchronization >

What is the output of the following fragments? Assume that there are four threads.

\small

// variant #1
int nt;
#pragma omp parallel
  {
  nt = omp_get_thread_num();
  printf("thread number: %d\n",nt);
  }

// variant #2
int nt;
#pragma omp parallel private(nt)
  {
  nt = omp_get_thread_num();
  printf("thread number: %d\n",nt);
  }

// variant #3
int nt;
#pragma omp parallel
  {
#pragma omp single
    {
    nt = omp_get_thread_num();
    printf("thread number: %d\n",nt);
    }
  }

// variant #4
int nt;
#pragma omp parallel
  {
#pragma omp master
    {
    nt = omp_get_thread_num();
    printf("thread number: %d\n",nt);
    }
  }

// variant #5
int nt;
#pragma omp parallel
  {
#pragma omp critical
    {
    nt = omp_get_thread_num();
    printf("thread number: %d\n",nt);
    }
  }

\small

! variant #1
  integer nt
!$OMP parallel
  nt = omp_get_thread_num()
  print *,"thread number:",nt
!$OMP end parallel

! variant #2
  integer nt
!$OMP parallel private(nt)
  nt = omp_get_thread_num()
  print *,"thread number:",nt
!$OMP end parallel

! variant #3
  integer nt
!$OMP parallel
!$OMP single
    nt = omp_get_thread_num()
    print *,"thread number:",nt
!$OMP end single
!$OMP end parallel

! variant #4
  integer nt
!$OMP parallel
!$OMP master
    nt = omp_get_thread_num()
    print *,"thread number:",nt
!$OMP end master
!$OMP end parallel

! variant #5
  integer nt
!$OMP parallel
!$OMP critical
    nt = omp_get_thread_num()
    print *,"thread number:",nt
!$OMP end critical
!$OMP end parallel

29.2.3.2

crumb trail: > ompreview > Review questions > Data and synchronization >

The following is an attempt to parallelize a serial code. Assume that all variables and arrays are defined. What errors and potential problems do you see in this code? How would you fix them?

\small

#pragma omp parallel
{
  x = f();
  #pragma omp for
  for (i=0; i<N; i++)
    y[i] = g(x,i);
  z = h(y);
}

!$OMP parallel
  x = f()
!$OMP do
  do i=1,N
    y(i) = g(x,i)
  end do
!$OMP end do
  z = h(y)
!$OMP end parallel

\vfill\pagebreak

29.2.3.3

crumb trail: > ompreview > Review questions > Data and synchronization >

Assume two threads. What does the following program output?

int a;
#pragma omp parallel private(a) {
  ...
  a = 0;
  #pragma omp for
  for (int i = 0; i < 10; i++)
  {
    #pragma omp atomic
    a++; }
  #pragma omp single
    printf("a=%e\n",a);
}

29.2.4 Reductions

crumb trail: > ompreview > Review questions > Reductions

29.2.4.1

crumb trail: > ompreview > Review questions > Reductions >

Is the following code correct? Is it efficient? If not, can you improve it?

#pragma omp parallel shared(r)
{
  int x;
  x = f(omp_get_thread_num());
#pragma omp critical
  r += f(x);
}

29.2.4.2

crumb trail: > ompreview > Review questions > Reductions >

Compare two fragments:

// variant 1
#pragma omp parallel reduction(+:s)
#pragma omp for
  for (i=0; i<N; i++)
    s += f(i);

// variant 2
#pragma omp parallel
#pragma omp for reduction(+:s)
  for (i=0; i<N; i++)
    s += f(i);

! variant 1
!$OMP parallel reduction(+:s)
!$OMP do
  do i=1,N
    s += f(i);
  end do
!$OMP end do
!$OMP end parallel

! variant 2
!$OMP parallel
!$OMP do reduction(+:s)
  do i=1,N
    s += f(i);
  end do
!$OMP end do
!$OMP end parallel

Do they compute the same thing?

\vfill\pagebreak

29.2.5 Barriers

crumb trail: > ompreview > Review questions > Barriers

Are the following two code fragments well defined?

#pragma omp parallel
{
#pragma omp for
for (mytid=0; mytid<nthreads; mytid++)
  x[mytid] = some_calculation();
#pragma omp for
for (mytid=0; mytid<nthreads-1; mytid++)
  y[mytid] = x[mytid]+x[mytid+1];
}

#pragma omp parallel
{
#pragma omp for
for (mytid=0; mytid<nthreads; mytid++)
  x[mytid] = some_calculation();
#pragma omp for nowait
for (mytid=0; mytid<nthreads-1; mytid++)
  y[mytid] = x[mytid]+x[mytid+1];
}

29.2.6 Data scope

crumb trail: > ompreview > Review questions > Data scope

The following program is supposed to initialize as many rows of the array as there are threads.

\small

int main() {
  int i,icount,iarray[100][100];
  icount = -1;
#pragma omp parallel private(i)
  {
#pragma omp critical
    { icount++; }
    for (i=0; i<100; i++)
      iarray[icount][i] = 1;
  }
  return 0;
}

Program main
  integer :: i,icount,iarray(100,100)
  icount = 0
!$OMP parallel private(i)
!$OMP critical
    icount = icount + 1
!$OMP end critical
    do i=1,100
      iarray(icount,i) = 1
    end do
!$OMP end parallel
End program

Describe the behavior of the program, with argumentation,

  • as given;
  • if you add a clause private (icount) to the parallel directive;
  • if you add a clause firstprivate (icount) .

What do you think of this solution:

\small

#pragma omp parallel private(i) shared(icount)
  {
#pragma omp critical
    { icount++;
      for (i=0; i<100; i++)
        iarray[icount][i] = 1;
    }
  }
  return 0;
}

!$OMP parallel private(i) shared(icount)
!$OMP critical
    icount = icount+1
    do i=1,100
      iarray(icount,i) = 1
    end do
!$OMP critical
!$OMP end parallel

29.2.7 Tasks

crumb trail: > ompreview > Review questions > Tasks

Fix two things in the following example:

\small

#pragma omp parallel
#pragma omp single
{
  int x,y,z;
#pragma omp task
  x = f();
#pragma omp task
  y = g();
#pragma omp task
  z = h();
  printf("sum=%d\n",x+y+z);
}

  integer :: x,y,z
!$OMP parallel
!$OMP single


!$OMP task
  x = f()
!$OMP end task


!$OMP task
  y = g()
!$OMP end task


!$OMP task
  z = h()
!$OMP end task


  print *,"sum=",x+y+z
!$OMP end single
!$OMP end parallel

29.2.8 Scheduling

crumb trail: > ompreview > Review questions > Scheduling

Compare these two fragments. Do they compute the same result? What can you say about their efficiency?

#pragma omp parallel
#pragma omp single
  {
    for (i=0; i<N; i++) {
    #pragma omp task
      x[i] = f(i)
    }
    #pragma omp taskwait
  }

#pragma omp parallel
#pragma omp for schedule(dynamic)
  {
    for (i=0; i<N; i++) {
      x[i] = f(i)
    }
  }

How would you make the second loop more efficient? Can you do something similar for the first loop?

Back to Table of Contents