Getting started with OpenMP

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 15.1 : The OpenMP model
15.1.1 : Target hardware
15.1.2 : Target software
15.1.3 : About threads and cores
15.1.4 : About thread data
15.2 : Compiling and running an OpenMP program
15.2.1 : Compiling
15.2.2 : Running an OpenMP program
15.3 : Your first OpenMP program
15.3.1 : Directives
15.3.2 : Parallel regions
15.3.3 : An actual OpenMP program!
15.3.4 : Code and execution structure
Back to Table of Contents

15 Getting started with OpenMP

This chapter explains the basic concepts of OpenMP, and helps you get started on running your first OpenMP program.

15.1 The OpenMP model

crumb trail: > omp-basics > The OpenMP model

We start by establishing a mental picture of the hardware and software that OpenMP targets.

15.1.1 Target hardware

crumb trail: > omp-basics > The OpenMP model > Target hardware

Modern computers have a multi-layered design. Maybe you have access to a cluster, and maybe you have learned how to use MPI to communicate between cluster nodes. OpenMP, the topic of this chapter, is concerned with a single cluster node or motherboard , and getting the most out of the available parallelism available there.

A node with two sockets and a co-processor

Figure  labelstring pictures a typical design of a node: within one enclosure you find two sockets : single processor chips. Your personal laptop of computer will probably have one socket, most supercomputers have nodes with two or four sockets (the picture is of a Stampede node with two sockets)\footnote {In that picture you also see a co-processor: OpenMP is increasingly targeting those too.}, although the recent Intel Knight's Landing is again a single-socket design.

Structure of an Intel Sandybridge eight-core socket

To see where OpenMP operates we need to dig into the sockets. Figure  labelstring shows a picture of an Intel Sandybridge socket. You recognize a structure with eight cores core : independent processing units, that all have access to the same memory. (In figure  labelstring you saw four memory banks attached to each of the two sockets; all of the sixteen cores have access to all that memory.)

To summarize the structure of the architecture that OpenMP targets:

  • A node has up to four sockets;

  • each socket has up to 60 cores;
  • each core is an independent processing unit, with access to all the memory on the node.

15.1.2 Target software

crumb trail: > omp-basics > The OpenMP model > Target software

OpenMP is based on on two concepts: the use of threads and the fork/join model of parallelism. For now you can think of a thread as a sort of process: the computer executes a sequence of instructions. The fork/join model says that a thread can split itself (`fork') into a number of threads that are identical copies. At some point these copies go away and the original thread is left (`join'), but while the team of threads created by the fork exists, you have parallelism available to you. The part of the execution between fork and join is known as a parallel region .

Figure  labelstring gives a simple picture of this: a thread forks into a team of threads, and these threads themselves can fork again.

Thread creation and deletion during parallel execution

The threads that are forked are all copies of the master thread : they have access to all that was computed so far; this is their shared data . Of course, if the threads were completely identical the parallelism would be pointless, so they also have private data, and they can identify themselves: they know their thread number. This allows you to do meaningful parallel computations with threads.

This brings us to the third important concept: that of work sharing constructs. In a team of threads, initially there will be replicated execution; a work sharing construct divides available parallelism over the threads.

So there you have it: OpenMP uses teams of threads, and inside a parallel region the work is distributed over the threads with a work sharing construct. Threads can access shared data, and they have some private data.

An important difference between OpenMP and MPI is that parallelism in OpenMP is dynamically activated by a thread spawning a team of threads. Furthermore, the number of threads used can differ between parallel regions, and threads can create threads recursively. This is known as as dynamic mode . By contrast, in an MPI program the number of running processes is (mostly) constant throughout the run, and determined by factors external to the program.

15.1.3 About threads and cores

crumb trail: > omp-basics > The OpenMP model > About threads and cores

OpenMP programming is typically done to take advantage of multicore processors. Thus, to get a good speedup you would typically let your number of threads be equal to the number of cores. However, there is nothing to prevent you from creating more threads: the operating system will use time slicing to let them all be executed. You just don't get a speedup beyond the number of actually available cores.

On some modern processors there are hardware threads , meaning that a core can actually let more than thread be executed, with some speedup over the single thread. To use such a processor efficiently you would let the number of OpenMP threads be $2\times$ or $4\times$ the number of cores, depending on the hardware.

15.1.4 About thread data

crumb trail: > omp-basics > The OpenMP model > About thread data

In most programming languages, visibility of data is governed by rules on the scope of variables : a variable is declared in a block, and it is then visible to any statement in that block and blocks with a lexical scope contained in it, but not in surrounding blocks:

main () {
// no variable `x' define here
{
int x = 5;
if (somecondition) { x = 6; }
printf("x
}
printf("x
}

In C, you can redeclare a variable inside a nested scope:

{
int x;
if (something) {
double x; // same name, different entity
}
x = ... // this refers to the integer again
}

Doing so makes the outer variable inaccessible.

Fortran has simpler rules, since it does not have blocks inside blocks.

Locality of variables in threads

In OpenMP the situation is a bit more tricky because of the threads. When a team of threads is created they can all see the data of the master thread. However, they can also create data of their own. This is illustrated in figure  labelstring . We will go into the details later.

15.2 Compiling and running an OpenMP program

crumb trail: > omp-basics > Compiling and running an OpenMP program

15.2.1 Compiling

crumb trail: > omp-basics > Compiling and running an OpenMP program > Compiling

Your file or Fortran module needs to contain

#include "omp.h"

in C, and

use omp_lib

or

#include "omp_lib.h"

for Fortran.

OpenMP is handled by extensions to your regular compiler, typically by adding an option to your commandline:

# gcc
gcc -o foo foo.c -fopenmp
# Intel compiler
icc -o foo foo.c -openmp

If you have separate compile and link stages, you need that option in both.

When you use the openmp compiler option, a cpp variable \indexompshow{_OPENMP} will be defined. Thus, you can have conditional compilation by writing

#ifdef _OPENMP
...
#else
...
#endif

15.2.2 Running an OpenMP program

crumb trail: > omp-basics > Compiling and running an OpenMP program > Running an OpenMP program

You run an OpenMP program by invoking it the regular way (for instance ./a.out ), but its behaviour is influenced by some OpenMP environment variables . The most important one is \indexompshow{OMP_NUM_THREADS}:

export OMP_NUM_THREADS=8

which sets the number of threads that a program will use. See section  26.1 for a list of all environment variables.

15.3 Your first OpenMP program

crumb trail: > omp-basics > Your first OpenMP program

In this section you will see just enough of OpenMP to write a first program and to explore its behaviour. For this we need to introduce a couple of OpenMP language constructs. They will all be discussed in much greater detail in later chapters.

15.3.1 Directives

crumb trail: > omp-basics > Your first OpenMP program > Directives

OpenMP is not magic, so you have to tell it when something can be done in parallel. This is mostly done through directives ; additional specifications can be done through library calls.

In C/C++ the pragma mechanism is used: annotations for the benefit of the compiler that are otherwise not part of the language. This looks like:

#pragma omp somedirective clause(value,othervalue)
parallel statement;


#pragma omp somedirective clause(value,othervalue)
{
parallel statement 1;
parallel statement 2;
}

with

  • the #pragma omp sentinel to indicate that an OpenMP directive is coming;

  • a directive, such as parallel ;
  • and possibly clauses with values.
  • After the directive comes either a single statement or a block in curly braces .

Directives in C/C++ are case-sensitive. Directives can be broken over multiple lines by escaping the line end.

The sentinel in Fortran looks like a comment:

!$omp directive clause(value)
statements
!$omp end directive

The difference with the C directive is that Fortran can not have a block, so there is an explicit end-of directive line.

If you break a directive over more than one line, all but the last line need to have a continuation character, and each line needs to have the sentinel:

!$OMP parallel do &


The directives are case-insensitive. In Fortran fixed-form source files, c$omp and *$omp are allowed too.

15.3.2 Parallel regions

crumb trail: > omp-basics > Your first OpenMP program > Parallel regions

The simplest way to create parallelism in OpenMP is to use the parallel pragma. A block preceded by the omp parallel pragma is called a parallel region ; it is executed by a newly created team of threads. This is an instance of the SPMD model: all threads execute the same segment of code.

#pragma omp parallel
{
// this is executed by a team of threads
}

We will go into much more detail in section  labelstring .

15.3.3 An actual OpenMP program!

crumb trail: > omp-basics > Your first OpenMP program > An actual OpenMP program! Exercise 15.1

Write a program that contains the following lines:

printf("There are
#pragma omp parallel
printf("There are
/* !!!! something missing here !!!! */ );

The first print statement tells you the number of available cores in the hardware. Your assignment is to supply the missing function that reports the number of threads used. Compile and run the program. Experiment with the OMP_NUM_THREADS environment variable. What do you notice about the number of lines printed?

Exercise 15.2

Extend the program from exercise  labelstring . Make a complete program based on these lines:

int tsum=0;
#pragma omp parallel
tsum += /* the thread number */
printf("Sum is

Compile and run again. (In fact, run your program a number of times.) Do you see something unexpected? Can you think of an explanation?

15.3.4 Code and execution structure

crumb trail: > omp-basics > Your first OpenMP program > Code and execution structure

Here are a couple of important concepts:

Definition

  • [structured block] An OpenMP directive is followed by an structured block ; in C this is a single statement, a compound statement, or a block in braces; In Fortran it is delimited by the directive and its matching ` end ' directive.

    A structured block can not be jumped into, so it can not start with a labeled statement, or contain a jump statement leaving the block.

  • [construct] An OpenMP construct is the section of code starting with a directive and spanning the following structured block, plus in Fortran the end-directive. This is a lexical concept: it contains the statements directly enclosed, and not any subroutines called from them.
  • [region of code] A region of code is defined as all statements that are dynamically encountered while executing the code of an OpenMP construct. This is a dynamic concept: unlike a `construct', it does include any subroutines that are called from the code in the structured block.

  • Back to Table of Contents