Getting started with OpenMP

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 16.1 : The OpenMP model
16.1.1 : Target hardware
16.1.2 : Target software
16.1.3 : About threads and cores
16.1.4 : About thread data
16.2 : Compiling and running an OpenMP program
16.2.1 : Compiling
16.2.2 : Running an OpenMP program
16.3 : Your first OpenMP program
16.3.1 : Directives
16.3.2 : Parallel regions
16.3.3 : An actual OpenMP program!
16.3.4 : Code and execution structure
16.4 : Thread data
16.5 : Creating parallelism
Back to Table of Contents

16 Getting started with OpenMP

This chapter explains the basic concepts of OpenMP, and helps you get started on running your first OpenMP program.

16.1 The OpenMP model

crumb trail: > omp-basics > The OpenMP model

We start by establishing a mental picture of the hardware and software that OpenMP targets.

16.1.1 Target hardware

crumb trail: > omp-basics > The OpenMP model > Target hardware

Modern computers have a multi-layered design. Maybe you have access to a cluster, and maybe you have learned how to use MPI to communicate between cluster nodes. OpenMP, the topic of this chapter, is concerned with a single cluster node or motherboard , and getting the most out of the available parallelism available there.

FIGURE 16.1: A node with two sockets and a co-processor

Figure~ 16.1 pictures a typical design of a node: within one enclosure you find two sockets : single processor chips. Your personal laptop of computer will probably have one socket, most supercomputers have nodes with two or four sockets (the picture is of a Stampede node with two sockets)\footnote {In that picture you also see a co-processor: OpenMP is increasingly targeting those too.}, although the recent Intel Knight's Landing is again a single-socket design.

FIGURE 16.2: Structure of an Intel Sandybridge eight-core socket

To see where OpenMP operates we need to dig into the sockets. Figure~ 16.2 shows a picture of an Intel Sandybridge socket. You recognize a structure with eight cores core : independent processing units, that all have access to the same memory. (In figure~ 16.1 you saw four memory banks attached to each of the two sockets; all of the sixteen cores have access to all that memory.)

To summarize the structure of the architecture that OpenMP targets:

  • A node has up to four sockets;
  • each socket has up to~60 cores;
  • each core is an independent processing unit, with access to all the memory on the node.

16.1.2 Target software

crumb trail: > omp-basics > The OpenMP model > Target software

OpenMP is based on on two concepts: the use of threads and the fork/join model of parallelism. For now you can think of a thread as a sort of process: the computer executes a sequence of instructions. The fork/join model says that a thread can split itself (`fork') into a number of threads that are identical copies. At some point these copies go away and the original thread is left (`join'), but while the team of threads created by the fork exists, you have parallelism available to you. The part of the execution between fork and join is known as a parallel region .

Figure~ 16.3 gives a simple picture of this: a thread forks into a team of threads, and these threads themselves can fork again.

FIGURE 16.3: Thread creation and deletion during parallel execution

The threads that are forked are all copies of the master thread : they have access to all that was computed so far; this is their shared data . Of course, if the threads were completely identical the parallelism would be pointless, so they also have private data, and they can identify themselves: they know their thread number. This allows you to do meaningful parallel computations with threads.

This brings us to the third important concept: that of work sharing constructs. In a team of threads, initially there will be replicated execution; a work sharing construct divides available parallelism over the threads.

So there you have it: OpenMP uses teams of threads, and inside a parallel region the work is distributed over the threads with a work sharing construct. Threads can access shared data, and they have some private data.

An important difference between OpenMP and MPI is that parallelism in OpenMP is dynamically activated by a thread spawning a team of threads. Furthermore, the number of threads used can differ between parallel regions, and threads can create threads recursively. This is known as as dynamic mode . By contrast, in an MPI program the number of running processes is (mostly) constant throughout the run, and determined by factors external to the program.

16.1.3 About threads and cores

crumb trail: > omp-basics > The OpenMP model > About threads and cores

OpenMP programming is typically done to take advantage of multicore processors. Thus, to get a good speedup you would typically let your number of threads be equal to the number of cores. However, there is nothing to prevent you from creating more threads: the operating system will use time slicing to let them all be executed. You just don't get a speedup beyond the number of actually available cores.

On some modern processors there are hardware threads , meaning that a core can actually let more than thread be executed, with some speedup over the single thread. To use such a processor efficiently you would let the number of OpenMP threads be $2\times$ or $4\times$ the number of cores, depending on the hardware.

16.1.4 About thread data

crumb trail: > omp-basics > The OpenMP model > About thread data

In most programming languages, visibility of data is governed by rules on the scope of variables : a~variable is declared in a block, and it is then visible to any statement in that block and blocks with a lexical scope contained in it, but not in surrounding blocks:

main () {
  // no variable `x' define here
  {
    int x = 5;
    if (somecondition) { x = 6; }
    printf("x=%e\n",x); // prints 5 or 6
  }
  printf("x=%e\n",x); // syntax error: `x' undefined
}

In C, you can redeclare a variable inside a nested scope:

{
  int x;
  if (something) {
    double x; // same name, different entity
  }
  x = ... // this refers to the integer again
}

Doing so makes the outer variable inaccessible.

Fortran has simpler rules, since it does not have blocks inside blocks.

In OpenMP the situation is a bit more tricky because of the threads. When a team of threads is created they can all see the data of the master thread. However, they can also create data of their own. We will go into the details later.

16.2 Compiling and running an OpenMP program

crumb trail: > omp-basics > Compiling and running an OpenMP program

16.2.1 Compiling

crumb trail: > omp-basics > Compiling and running an OpenMP program > Compiling

Your file or Fortran module needs to contain

#include "omp.h"

in C, and

use omp_lib

or

#include "omp_lib.h"

for Fortran.

OpenMP is handled by extensions to your regular compiler, typically by adding an option to your commandline:

# gcc
gcc -o foo foo.c -fopenmp
# Intel compiler
icc -o foo foo.c -qopenmp

If you have separate compile and link stages, you need that option in both.

When you use the openmp compiler option, a cpp variable _OPENMP will be defined. Thus, you can have conditional compilation by writing

#ifdef _OPENMP
   ...
#else
   ...
#endif

16.2.2 Running an OpenMP program

crumb trail: > omp-basics > Compiling and running an OpenMP program > Running an OpenMP program

You run an OpenMP program by invoking it the regular way (for instance ./a.out ), but its behaviour is influenced by some OpenMP environment variables . The most important one is OMP_NUM_THREADS :

export OMP_NUM_THREADS=8

which sets the number of threads that a program will use. See section  28.1 for a list of all environment variables.

16.3 Your first OpenMP program

crumb trail: > omp-basics > Your first OpenMP program

In this section you will see just enough of OpenMP to write a first program and to explore its behaviour. For this we need to introduce a couple of OpenMP language constructs. They will all be discussed in much greater detail in later chapters.

16.3.1 Directives

crumb trail: > omp-basics > Your first OpenMP program > Directives

OpenMP is not magic, so you have to tell it when something can be done in parallel. This is mostly done through directives ; additional specifications can be done through library calls.

In C/C++ the pragma mechanism is used: annotations for the benefit of the compiler that are otherwise not part of the language. This looks like:

#pragma omp somedirective clause(value,othervalue)
  parallel statement;


#pragma omp somedirective clause(value,othervalue)
 {
  parallel statement 1;
  parallel statement 2;
 }

with

  • the #pragma omp sentinel to indicate that an OpenMP directive is coming;
  • a directive, such as parallel ;
  • and possibly clauses with values.
  • After the directive comes either a single statement or a block in curly braces .

Directives in C/C++ are case-sensitive. Directives can be broken over multiple lines by escaping the line end.

The sentinel in Fortran looks like a comment:

!$omp directive clause(value)
  statements
!$omp end directive

The difference with the C directive is that Fortran can not have a block, so there is an explicit end-of directive line.

If you break a directive over more than one line, all but the last line need to have a continuation character, and each line needs to have the sentinel:

!$OMP parallel do &
!%OMP   copyin(x),copyout(y)

The directives are case-insensitive. In Fortran fixed-form source files, c$omp and *$omp are allowed too.

16.3.2 Parallel regions

crumb trail: > omp-basics > Your first OpenMP program > Parallel regions

The simplest way to create parallelism in OpenMP is to use the parallel pragma. A block preceded by the omp parallel pragma is called a parallel region ; it is executed by a newly created team of threads. This is an instance of the SPMD model: all threads execute (redundantly) the same segment of code.

#pragma omp parallel
{
  // this is executed by a team of threads
}

Exercise

Write a `hello world' program, where the print statement is in a parallel region. Compile and run.

Run your program with different values of the environment variable OMP_NUM_THREADS . If you know how many cores your machine has, can you set the value higher?

We will go into much more detail in section  16.5 .

16.3.3 An actual OpenMP program!

crumb trail: > omp-basics > Your first OpenMP program > An actual OpenMP program!

Let's start exploring how OpenMP handles parallelism, using the following functions:

  • omp_get_num_threads reports how many threads are currently active, and
  • omp_get_thread_num reports the number of the thread that makes the call.
  • omp_get_num_procs reports the number of available cores.

Exercise

Take the hello world program of exercise  16.3.2 and insert the above functions, before, in, and after the parallel region. What are your observations?

Exercise

Extend the program from exercise  16.3.3 . Make a complete program based on these lines:

int tsum=0;
#pragma omp parallel
  tsum += /* the thread number */
printf("Sum is %d\n",tsum);

Compile and run again. (In fact, run your program a number of times.) Do you see something unexpected? Can you think of an explanation?

If the above puzzles you, read about race condition s in section  9.3.8 .

16.3.4 Code and execution structure

crumb trail: > omp-basics > Your first OpenMP program > Code and execution structure

Here are a couple of important concepts:

  • An OpenMP directive is followed by an structured block ; in C this is a single statement, a compound statement, or a block in braces; In Fortran it is delimited by the directive and its matching ` end ' directive.

    A structured block can not be jumped into, so it can not start with a labeled statement, or contain a jump statement leaving the block.

  • An OpenMP construct is the section of code starting with a directive and spanning the following structured block, plus in Fortran the end-directive. This is a lexical concept: it contains the statements directly enclosed, and not any subroutines called from them.
  • A region of code is defined as all statements that are dynamically encountered while executing the code of an OpenMP construct. This is a dynamic concept: unlike a `construct', it does include any subroutines that are called from the code in the structured block.

16.4 Thread data

crumb trail: > omp-basics > Thread data

In most programming languages, visibility of data is governed by rules on the scope of variables : a variable is declared in a block, and it is then visible to any statement in that block and blocks with a lexical scope contained in it, but not in surrounding blocks:

main () {
  // no variable `x' define here
  {
    int x = 5;
    if (somecondition) { x = 6; }
    printf("x=%e\n",x); // prints 5 or 6
  }
  printf("x=%e\n",x); // syntax error: `x' undefined
}

Fortran has simpler rules, since it does not have blocks inside blocks.

OpenMP has similar rules concerning data in parallel regions and other OpenMP constructs. First of all, data is visible in enclosed scopes:

main() {
  int x;
#pragma omp parallel
  {
     // you can use and set `x' here
  }
  printf("x=%e\n",x); // value depends on what
                      // happened in the parallel region
}

In C, you can redeclare a variable inside a nested scope:

{
  int x;
  if (something) {
    double x; // same name, different entity
  }
  x = ... // this refers to the integer again
}

Doing so makes the outer variable inaccessible.

OpenMP has a similar mechanism:

{
  int x;
#pragma omp parallel
  {
    double x;
  }
}

There is an important difference: each thread in the team gets its own instance of the enclosed variable.

FIGURE 16.4: Locality of variables in threads

This is illustrated in figure  16.4 .

In addition to such scoped variables, which live on a stack , there are variables on the heap , typically created by a call to malloc (in C) or new (in C++). Rules for them are more complicated.

Summarizing the above, there are

  • shared variables , where each thread refers to the same data item, and
  • private variables , where each thread has its own instance.

In addition to using scoping, OpenMP also uses options on the directives to control whether data is private or shared.

Many of the difficulties of parallel programming with OpenMP stem from the use of shared variables. For instance, if two threads update a shared variable, you not guarantee an order on the updates.

We will discuss all this in detail in section  19.3 .

16.5 Creating parallelism

crumb trail: > omp-basics > Creating parallelism

The fork/join model of OpenMP means that you need some way of indicating where an activity can be forked for independent execution. There are two ways of doing this:

  1. You can declare a parallel region and split one thread into a whole team of threads. We will discuss this next in section  16.5 . The division of the work over the threads is controlled by work sharing construct (section  18.8 ).
  2. Alternatively, you can use tasks and indicating one parallel activity at a time. You will see this in section  22.4 .

Note that OpenMP only indicates how much parallelism is present; whether independent activities are in fact executed in parallel is a runtime decision. The factors influencing this are discussed in section  16.5 .

Declaring a parallel region tells OpenMP that a team of threads can be created. The actual size of the team depends on various factors (see section  28.1 for variables and functions mentioned in this section).

  • The environment variable OMP_NUM_THREADS limits the number of threads that can be created.
  • If you don't set this variable, you can also set this limit dynamically with the library routine routines} omp_set_num_threads . This routine takes precedence over the aforementioned environment variable if both are specified.
  • A limit on the number of threads can also be set as a clause on a parallel region.

If you specify a greater amount of parallelism than the hardware supports, the runtime system will probably ignore your specification and choose a lower value. To ask how much parallelism is actually used in your parallel region, use omp_get_num_threads . To query these hardware limits, use omp_get_num_procs . You can query the maximum number of threads with omp_get_max_threads . This equals the value of OMP_NUM_THREADS , not the number of actually active threads in a parallel region.

\begin{multicols} {2}

// proccount.c
void nested_report() {
#pragma omp parallel
#pragma omp master
  printf("Parallel  : count %2d cores and %2d threads out of max %2d\n",
	 omp_get_num_procs(),omp_get_num_threads(),omp_get_max_threads());
}

int main(int argc,char **argv) {

  printf("---------------- Parallelism report ----------------\n");

  printf("Sequential: count %2d cores and %2d threads out of max %2d\n",
	 omp_get_num_procs(),omp_get_num_threads(),omp_get_max_threads());
#pragma omp parallel
#pragma omp master
  printf("Parallel  : count %2d cores and %2d threads out of max %2d\n",
	 omp_get_num_procs(),omp_get_num_threads(),omp_get_max_threads());

#pragma omp parallel
#pragma omp master
  nested_report();
//codesnippet ompproccount

  return 0;
}
\columnbreak \tiny

[c:48] for t in 1 2 4 8 16 ; do OMP_NUM_THREADS=$t ./proccount ; done
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  1
Parallel  : count  4 cores and  1 threads out of max  1
Parallel  : count  4 cores and  1 threads out of max  1
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  2
Parallel  : count  4 cores and  2 threads out of max  2
Parallel  : count  4 cores and  1 threads out of max  2
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  4
Parallel  : count  4 cores and  4 threads out of max  4
Parallel  : count  4 cores and  1 threads out of max  4
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max  8
Parallel  : count  4 cores and  8 threads out of max  8
Parallel  : count  4 cores and  1 threads out of max  8
---------------- Parallelism report ----------------
Sequential: count  4 cores and  1 threads out of max 16
Parallel  : count  4 cores and 16 threads out of max 16
Parallel  : count  4 cores and  1 threads out of max 16

\end{multicols}

Another limit on the number of threads is imposed when you use nested parallel regions. This can arise if you have a parallel region in a subprogram which is sometimes called sequentially, sometimes in parallel. The variable OMP_NESTED controls whether the inner region will create a team of more than one thread.

Back to Table of Contents