MPI leftover topics

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 14.1 : Contextual information, attributes, etc.
14.1.1 : Info objects
14.1.1.1 : Environment information
14.1.1.2 : Communicator and window information
14.1.1.3 : File information
14.1.2 : Attributes
14.1.2.1 : Create new keyval attributes
14.1.3 : Processor name
14.1.4 : Version information
14.2 : Error handling
14.2.1 : Error codes
14.2.2 : Error handling
14.2.2.1 : Abort
14.2.2.2 : Return
14.2.2.3 : Error printing
14.2.3 : Defining your own MPI errors
14.3 : Fortran issues
14.3.1 : Assumed-shape arrays
14.4 : Progress
14.5 : Fault tolerance
14.6 : Performance, tools, and profiling
14.6.1 : Timing
14.6.1.1 : Global timing
14.6.1.2 : Local timing
14.6.2 : Simple profiling
14.6.3 : Tools interface
14.6.4 : Programming for performance
14.6.5 : MPIR
14.7 : Determinism
14.8 : Subtleties with processor synchronization
14.9 : Shell interaction
14.9.1 : Standard input
14.9.2 : Standard out and error
14.9.3 : Process status
14.9.4 : Multiple program start
14.10 : The origin of one-sided communication in ShMem
14.11 : Leftover topics
14.11.1 : MPI constants
14.11.2 : Cancelling messages
14.11.3 : Constants
14.12 : Literature
Back to Table of Contents

14 MPI leftover topics

14.1 Contextual information, attributes, etc.

crumb trail: > mpi > Contextual information, attributes, etc.

14.1.1 Info objects

crumb trail: > mpi > Contextual information, attributes, etc. > Info objects

Certain MPI routines can accept MPI_Info objects. These contain key-value pairs that can offer system or implementation dependent information.

Create an info object with

MPI_INFO_CREATE(info)
OUT info        info object created (handle)

C:
int MPI_Info_create(MPI_Info *info)

Fortran legacy:
MPI_INFO_CREATE(INFO, IERROR)
INTEGER INFO, IERROR
MPI_Info_create and delete it with
  MPI_INFO_FREE(info)
  INOUT infoinfo object (handle)
  int MPI_Info_free(MPI_Info *info)
  MPI_INFO_FREE(INFO, IERROR)
  INTEGER INFO, IERROR
MPI_Info_free .

Keys are then set with

  MPI_INFO_SET(info, key, value)
  INOUT infoinfo object (handle)
  IN keykey (string)
  IN valuevalue (string)
  int MPI_Info_set(MPI_Info info, char *key, char *value)
  MPI_INFO_SET(INFO, KEY, VALUE, IERROR)
  INTEGER INFO, IERROR
  CHARACTER*(*) KEY, VALUE
MPI_Info_set , and they can be queried with
  MPI_INFO_GET(info, key, valuelen, value, flag)
  IN infoinfo object (handle)
  IN keykey (string)
  IN valuelenlength of value arg (integer)
  OUT valuevalue (string)
  OUT flagtrue if key defined, false if not (boolean)
int MPI_Info_get(MPI_Info info, char *key, int valuelen, char *value,
int *flag)
MPI_INFO_GET(INFO, KEY, VALUELEN, VALUE, FLAG, IERROR)
INTEGER INFO, VALUELEN, IERROR
CHARACTER*(*) KEY, VALUE
LOGICAL FLAG
MPI_Info_get . Note that the output of the `get' routine is not allocated: it is a buffer that is passed. The maximum length of a key is given by the parameter MPI_MAX_INFO_KEY . You can delete a key from an info object with
  MPI_INFO_DELETE(info, key)
  INOUT infoinfo object (handle)
  IN keykey (string)
  int MPI_Info_delete(MPI_Info info, char *key)
  MPI_INFO_DELETE(INFO, KEY, IERROR)
  INTEGER INFO, IERROR
  CHARACTER*(*) KEY
MPI_Info_delete .

There is a straightforward duplication of info objects:

  MPI_INFO_DUP(info, newinfo)
  IN infoinfo object (handle)
  OUT newinfoinfo object (handle)
  int MPI_Info_dup(MPI_Info info, MPI_Info *newinfo)
  MPI_INFO_DUP(INFO, NEWINFO, IERROR)
  INTEGER INFO, NEWINFO, IERROR
MPI_Info_dup .

You can also query the number of keys in an info object with

  MPI_INFO_GET_NKEYS(info, nkeys)
  IN infoinfo object (handle)
  OUT nkeysnumber of defined keys (integer)
  int MPI_Info_get_nkeys(MPI_Info info, int *nkeys)
  MPI_INFO_GET_NKEYS(INFO, NKEYS, IERROR)
  INTEGER INFO, NKEYS, IERROR
MPI_Info_get_nkeys , after which the keys can be queried in succession with MPI_Info_get_nthkey

Info objects that are marked as `In' or `Inout' arguments are parsed before that routine returns. This means that in nonblocking routines they can be freed immediately, unlike, for instance, send buffers.

MPI 4 Standard only

The routines are not robust with respect to the C language null terminator . Therefore, they are deprecated, and should be replaced with MPI_Info_get_string , which always returns a null-terminated string.

int MPI_Info_get_string
   (MPI_Info info, const char *key,
    int *buflen, char *value, int *flag)
End of MPI 4 note

14.1.1.1 Environment information

crumb trail: > mpi > Contextual information, attributes, etc. > Info objects > Environment information

The object MPI_INFO_ENV is predefined, containing:

  • command Name of program executed.
  • argv Space separated arguments to command.
  • maxprocs Maximum number of MPI processes to start.
  • soft Allowed values for number of processors.
  • host Hostname.
  • arch Architecture name.
  • wdir Working directory of the MPI process.
  • file Value is the name of a file in which additional information is specified.
  • thread_level Requested level of thread support, if requested before the program started execution.

Note that these are the requested values; the running program can for instance have lower thread support.

14.1.1.2 Communicator and window information

crumb trail: > mpi > Contextual information, attributes, etc. > Info objects > Communicator and window information

MPI has a built-in possibility of attaching information to communicators and windows using the calls MPI_Comm_get_info MPI_Comm_set_info , MPI_Win_get_info , MPI_Win_set_info .

Copying a communicator with MPI_Comm_dup does not cause the info to be copied; to propagate information to the copy there is MPI_Comm_dup_with_info .

14.1.1.3 File information

crumb trail: > mpi > Contextual information, attributes, etc. > Info objects > File information

An MPI_Info object can be passed to the following file routines:

  • MPI_File_open
  • MPI_File_set_view
  • MPI_File_set_info ; collective.

The following keys are defined in the \mpistandard{2} standard:

  • access_style : A comma separated list of one or more of: read_once , write_once , read_mostly , write_mostly , sequential , reverse_sequential , random
  • collective_buffering : true or false; enables or disables buffering on collective I/O operations
  • cb_block_size : integer block size for collective buffering, in bytes
  • cb_buffer_size : integer buffer size for collective buffering, in bytes
  • cb_nodes : integer number of MPI processes used in collective buffering
  • chunked : a comma separated list of integers describing the dimensions of a multidimensional array to be accessed using subarrays, starting with the most significant dimension (1st in C, last in Fortran)
  • chunked_item : a comma separated list specifying the size of each array entry, in bytes
  • chunked_size : a comma separated list specifying the size of the subarrays used in chunking
  • file_perm : UNIX file permissions at time of creation, in octal
  • io_node_list : a comma separated list of I/O nodes to use

    MPI 4 Standard only

  • mpi_minimum_memory_alignment : aligment of allocated memory.

    End of MPI 4 note
  • nb_proc : integer number of processes expected to access a file simultaneously
  • num_io_nodes : integer number of I/O nodes to use
  • striping_factor : integer number of I/O nodes/devices a file should be striped across
  • striping_unit : integer stripe size, in bytes

Additionally, file system-specific keys can exist.

14.1.2 Attributes

crumb trail: > mpi > Contextual information, attributes, etc. > Attributes

Some runtime (or installation dependendent) values are available as attributes through

int MPI_Comm_get_attr(
    MPI_Comm comm, int keyval, void *attribute_val, int *flag)

Python:
MPI.Comm.Get_attr(self, int keyval)

return codes:
MPI_SUCCESS : No error; MPI routine completed successfully.
MPI_ERR_COMM : Invalid communicator.
MPI_ERR_KEYVAL : Invalid keyval
MPI_Comm_get_attr . (The \mpistandard{2} routine MPI_Attr_get is deprecated). The flag parameter has two functions:

  • it returns whether the attributed was found;
  • if on entry it was set to false, the value parameter is ignored and the routines only tests whether the key is present.

The return value parameter is subtle: while it is declared void* , it is actually the address of a void* pointer.

// tags.c
int tag_upperbound;
void *v; int flag=1;
ierr = MPI_Comm_get_attr(comm,MPI_TAG_UB,&v,&flag);
tag_upperbound = *(int*)v;
## tags.py
tag_upperbound = comm.Get_attr(MPI.TAG_UB)
if procid==0:
    print("Determined tag upperbound: {}".format(tag_upperbound))

Attributes are:

  • MPI_TAG_UB Upper bound for tag value . (The lower bound is zero.) Note that MPI_TAG_UB is the key, not the actual upper bound! This value has to be at least 32767.
  • Host process rank, if such exists, MPI_PROC_NULL , otherwise.
  • MPI_IO rank of a node that has regular I/O facilities (possibly myrank). Nodes in the same communicator may return different values for this parameter.
  • MPI_WTIME_IS_GLOBAL Boolean variable that indicates whether clocks are synchronized.

Also:

  • MPI_UNIVERSE_SIZE : the total number of processes that can be created. This can be more than the size of MPI_COMM_WORLD if the host list is larger than the number of initially started processes. See section  8.1 .

    Python: mpi4py.MPI.UNIVERSE_SIZE .

  • MPI_APPNUM : if MPI is used in MPMD mode, or if MPI_Comm_spawn_multiple is used, this attribute reports the how-manieth program we are in.

14.1.2.1 Create new keyval attributes

crumb trail: > mpi > Contextual information, attributes, etc. > Attributes > Create new keyval attributes

Create a key value with


MPI_Comm_create_keyval , MPI_Type_create_keyval , MPI_Win_create_keyval . Use this key to set new attributes with

MPI_Comm_set_attr , MPI_Type_set_attr , MPI_Win_set_attr . Free the attributed with MPI_Comm_delete_attr , MPI_Type_delete_attr , MPI_Win_delete_attr .

This uses a function type MPI_Comm_attr_function . This function is copied when a communicator is duplicated; section  7.2 . Free with MPI_Comm_free_keyval .

14.1.3 Processor name

crumb trail: > mpi > Contextual information, attributes, etc. > Processor name

You can query the hostname of a processor with MPI_Get_processor_name . This name need not be unique between different processor ranks.

You have to pass in the character storage: the character array must be at least MPI_MAX_PROCESSOR_NAME characters long. The actual length of the name is returned in the resultlen parameter.

14.1.4 Version information

crumb trail: > mpi > Contextual information, attributes, etc. > Version information

For runtime determination, The MPI version is available through two parameters MPI_VERSION and MPI_SUBVERSION or the function

Semantics:
  MPI_GET_VERSION( version, subversion )
  OUT version version number (integer)
  OUT subversion subversion number (integer)

C:
  int MPI_Get_version(int *version, int *subversion)

Fortran:
  MPI_GET_VERSION(VERSION, SUBVERSION, IERROR)
  INTEGER VERSION, SUBVERSION, IERROR

MPI_Get_version .

14.2 Error handling

crumb trail: > mpi > Error handling

Errors in normal programs can be tricky to deal with; errors in parallel programs can be even harder. This is because in addition to everything that can go wrong with a single executable (floating point errors, memory violation) you now get errors that come from faulty interaction between multiple executables.

A few examples of what can go wrong:

  • MPI errors: an MPI routine can exit prematurely for various reasons, such as receiving much more data than its buffer can accomodate. Such errors, as well as the more common type mentioned above, typically cause your whole execution to terminate. That is, if one incarnation of your executable exits, the MPI runtime will kill all others.
  • Deadlocks and other hanging executions: there are various scenarios where your processes individually do not exit, but are all waiting for each other. This can happen if two processes are both waiting for a message from each other, and this can be helped by using nonblocking calls. In another scenario, through an error in program logic, one process will be waiting for more messages (including nonblocking ones) than are sent to it.

14.2.1 Error codes

crumb trail: > mpi > Error handling > Error codes

There are a bunch of error codes. These are all positive int values, while MPI_SUCCESS is zero. The maximum value of any built-in error code is MPI_ERR_LASTCODE . User-defined error codes are all larger than this.

  • MPI_ERR_ARG : an argument was invalid that is not covered by another error code.
  • MPI_ERR_BUFFER The buffer pointer is invalid; this typically means that you have supplied a null pointer.
  • MPI_ERR_COMM : invalid communicator. A common error is to use a null communicator in a call.
  • MPI_ERR_INTERN An internal error in MPI has been detected.
  • MPI_ERR_IN_STATUS A functioning returning an array of statuses has at least one status where the MPI_ERROR field is set to other than MPI_SUCCESS . See section  4.3.2.3 .
  • MPI_ERR_INFO : invalid info object.
  • MPI_ERR_NO_MEM is returned by MPI_Alloc_mem if memory is exhausted.
  • MPI_ERR_OTHER : an error occurred; use MPI_Error_string to retrieve further information about this error; see section  14.2.2.3 .
  • MPI_ERR_PORT : invalid port; this applies to MPI_Comm_connect and such.

    MPI 4 Standard only

  • MPI_ERR_PROC_ABORTED is returned if a process tries to communicate with a process that has aborted.

    End of MPI 4 note
  • MPI_ERR_SERVICE : invalid service in MPI_Unpublish_name ; section  8.2.3 .

14.2.2 Error handling

crumb trail: > mpi > Error handling > Error handling

The MPI library has a general mechanism for dealing with errors that it detects: one can specify an error handler, specific to MPI objects.

  • Most commonly, an error handler is associated with a communicator: MPI_Comm_set_errhandler (and likewise it can be retrieved with MPI_Comm_get_errhandler );
  • other possibilities are MPI_File_set_errhandler , MPI_File_call_errhandler ,

    MPI 4 Standard only

    MPI_Session_set_errhandler , MPI_Session_call_errhandler ,

    End of MPI 4 note
    MPI_Win_set_errhandler

    , MPI_Win_call_errhandler .

Remark

The routine MPI_Errhandler_set is deprecated, replaced by its \mpistandard{2} variant MPI_Comm_set_errhandler .

Some handlers of type MPI_Errhandler are predefined, but you can define your own with MPI_Errhandler_create , to be freed later with MPI_Errhandler_free .

14.2.2.1 Abort

crumb trail: > mpi > Error handling > Error handling > Abort

The default behaviour, where the full run is aborted, is equivalent to your code having the following call to

MPI_Comm_set_errhandler(MPI_COMM_WORLD,MPI_ERRORS_ARE_FATAL);

The handler MPI_ERRORS_ARE_FATAL , even though it is associated with a communicator, causes the whole application to abort.

MPI 4 Standard only

The handler MPI_ERRORS_ABORT (\mpistandard{4}) aborts on the processes in the communicator for which it is specified.

End of MPI 4 note

14.2.2.2 Return

crumb trail: > mpi > Error handling > Error handling > Return

Another simple possibility is to specify

MPI_Comm_set_errhandler(MPI_COMM_WORLD,MPI_ERRORS_RETURN);

which causes the error code to be returned to the user. This gives you the opportunity to write code that handles the error return value.

14.2.2.3 Error printing

crumb trail: > mpi > Error handling > Error handling > Error printing

If the MPI_Errhandler value MPI_ERRORS_RETURN is used, you can compare the return code to MPI_SUCCESS and print out debugging information:

int ierr;
ierr = MPI_Something();
if (ierr!=MPI_SUCCESS) {
    // print out information about what your programming is doing
    MPI_Abort();
}

For instance,

Fatal error in MPI_Waitall:
See the MPI_ERROR field in MPI_Status for the error code

You could then retrieve the MPI_ERROR field of the status, and print out an error string with MPI_Error_string or maximal size MPI_MAX_ERROR_STRING :

MPI_Comm_set_errhandler(MPI_COMM_WORLD,MPI_ERRORS_RETURN);
ierr = MPI_Waitall(2*ntids-2,requests,status);
if (ierr!=0) {
   char errtxt[MPI_MAX_ERROR_STRING];
   for (int i=0; i<2*ntids-2; i++) {
     int err = status[i].MPI_ERROR;
     int len=MPI_MAX_ERROR_STRING;
     MPI_Error_string(err,errtxt,&len);
     printf("Waitall error: %d %s\n",err,errtxt);
   }
   MPI_Abort(MPI_COMM_WORLD,0);
}

One cases where errors can be handled is that of MPI file I/O : if an output file has the wrong permissions, code can possibly progress without writing data, or writing to a temporary file.

MPI operators ( MPI_Op ) do not return an error code. In case of an error they call MPI_Abort ; if MPI_ERRORS_RETURN is the error handler, error codes may be silently ignored.

You can create your own error handler with

Synopsis
MPI_Comm_create_errhandler( errhandler_fn, err_handler )
Input argument:
errhandler_fn

Output argument:
err_handler

C:
int MPI_Comm_create_errhandler
   ( MPI_Comm_errhandler_function *errhandler_fn,
     MPI_Errhandler *err_handler )
typedef void MPI_Comm_errhandler_function(MPI_Comm *, int *, ...)

Fortran:
Subroutine MPI_Comm_errhandler_function(comm, error_code) BIND(C)
TYPE(MPI_Comm) ::  comm
INTEGER ::  error_code
MPI_Comm_create_errhandler , which is then installed with MPI_Comm_set_errhandler . You can retrieve the error handler with MPI_Comm_get_errhandler .

MPL note MPL

does not allow for access to the wrapped communicators. However, for MPI_COMM_WORLD , the routine MPI_Comm_set_errhandler can be called directly. End of MPL note

14.2.3 Defining your own MPI errors

crumb trail: > mpi > Error handling > Defining your own MPI errors

You can define your own errors that behave like MPI errors. As an example, let's write a send routine that refuses to send zero-sized data.

The first step to defining a new error is to define an error class with MPI_Add_error_class :

int nonzero_class;
MPI_Add_error_class(&nonzero_class);

This error number is larger than MPI_ERR_LASTCODE , the upper bound on built-in error codes. The attribute MPI_LASTUSEDCODE records the last issued value.

Your new error code is then defined in this class with MPI_Add_error_code , and an error string can be added with MPI_Add_error_string :

int nonzero_code;
MPI_Add_error_code(nonzero_class,&nonzero_code);
MPI_Add_error_string(nonzero_code,"Attempting to send zero buffer");

You can then call an error handler with this code. For instance to have a wrapped send routine that will not send zero-sized messages:

// errorclass.c
int MyPI_Send( void *buffer,int n,MPI_Datatype type, int target,int tag,MPI_Comm comm) {
  if (n==0)
    MPI_Comm_call_errhandler( comm,nonzero_code );
   MPI_Ssend(buffer,n,type,target,tag,comm);
  return MPI_SUCCESS;
};
Here we used the default error handler associated with the communicator, but one can set a different one with MPI_Comm_create_errhandler .

We test our example:

for (int msgsize=1; msgsize>=0; msgsize--) {
  double buffer;
  if (procno==0) {
    printf("Trying to send buffer of length %d\n",msgsize);
    MyPI_Send(&buffer,msgsize,MPI_DOUBLE, 1,0,comm);
    printf(".. success\n");
  } else if (procno==1) {
    MPI_Recv (&buffer,msgsize,MPI_DOUBLE, 0,0,comm,MPI_STATUS_IGNORE);
  }
}
which gives:

Trying to send buffer of length 1
.. success
Trying to send buffer of length 0
Abort(1073742081) on node 0 (rank 0 in comm 0):
Fatal error in MPI_Comm_call_errhandler: Attempting to send zero buffer

14.3 Fortran issues

crumb trail: > mpi > Fortran issues

MPI is typically written in C, what if you program Fortran ?

See section  6.2.2.3 for MPI types corresponding to Fortran90 types .

14.3.1 Assumed-shape arrays

crumb trail: > mpi > Fortran issues > Assumed-shape arrays

Use of other than contiguous data, for instance A(1:N:2) , was a problem in MPI calls, especially nonblocking ones. In that case it was best to copy the data to a contiguous array. This has been fixed in \mpistandard{3}.

  • Fortran routines have the same prototype as C routines except for the addition of an integer error parameter.
  • The call for MPI_Init in Fortran does not have the commandline arguments; they need to be handled separately.
  • The routine MPI_Sizeof is only available in Fortran, it provides the functionality of the C/C++ operator sizeof .

14.4 Progress

crumb trail: > mpi > Progress

The concept asynchronous progress describes that MPI messages continue on their way through the network, while the application is otherwise busy.

The problem here is that, unlike straight MPI_Send and MPI_Recv calls, communication of this sort can typically not be off-loaded to the network card, so different mechanisms are needed.

This can happen in a number of ways:

  • Compute nodes may have a dedicated communications processor. The Intel Paragon was of this design; modern multicore processors are a more efficient realization of this idea.
  • The MPI library may reserve a core or thread for communications processing. This is implementation dependent; see Intel MPI information below.
  • Absent such dedicated resources, the application can force MPI to make progress by occasional calls to a polling routine such as MPI_Iprobe .

Remark

The MPI_Probe call is somewhat similar, in spirit if not quite in functionality, as MPI_Test . However, they behave differently with respect to progress. Quoting the standard:

The MPI implementation of MPI_Probe and MPI_Iprobe needs to guarantee progress: if a call to MPI_Probe has been issued by a process, and a send that matches the probe has been initiated by some process, then the call to MPI_Probe will return.

In other words: probing causes MPI to make progress. On the other hand,

A call to MPI_Test returns flag = true if the operation identified by request is complete.

In other words, if progress has been made, then testing will report completion, but by itself it does not cause completion.

A similar problem arises with passive target synchronization: it is possible that the origin process may hang until the target process makes an MPI call.

The following commands force progress: MPI_WIN_TEST , MPI_REQUEST_GET_STATUS

\begin{intelnote} Only available with the release_mt and debug_mt versions of the Intel MPI library. Set I_MPI_ASYNC_PROGRESS to 1 to enable asynchronous progress threads, and I_MPI_ASYNC_PROGRESS_THREADS to set the number of progress threads.

See https://software.intel.com/en-us/mpi-developer-guide-linux-asynchronous-progress-control , https://software.intel.com/en-us/mpi-developer-reference-linux-environment-variables-for-asynchronous-progress-control \end{intelnote}

14.5 Fault tolerance

crumb trail: > mpi > Fault tolerance

Processors are not completely reliable, so it may happen that one `breaks': for software or hardware reasons it becomes unresponsive. For an MPI program this means that it becomes impossible to send data to it, and any collective operation involving it will hang. Can we deal with this case? Yes, but it involves some programming.

First of all, one of the possible MPI error return codes (section  14.2 ) is MPI_ERR_COMM , which can be returned if a processor in the communicator is unavailable. You may want to catch this error, and add a `replacement processor' to the program. For this, the MPI_Comm_spawn can be used (see  8.1 for details). But this requires a change of program design: the communicator containing the new process(es) is not part of the old MPI_COMM_WORLD , so it is better to set up your code as a collection of inter-communicators to begin with.

14.6 Performance, tools, and profiling

crumb trail: > mpi > Performance, tools, and profiling

In most of this book we talk about functionality of the MPI library. There are cases where a problem can be solved in more than one way, and then we wonder which one is the most efficient. In this section we will explicitly address performance. We start with two sections on the mere act of measuring performance.

14.6.1 Timing

crumb trail: > mpi > Performance, tools, and profiling > Timing

MPI has a wall clock timer:

 C:
double MPI_Wtime(void);

Fortran:
DOUBLE PRECISION MPI_WTIME()

Python:
MPI.Wtime()
MPI_Wtime which gives the number of seconds from a certain point in the past. (Note the absence of the error parameter in the fortran call.)

double t;
t = MPI_Wtime();
for (int n=0; n<NEXPERIMENTS; n++) {
  // do something;
}
t = MPI_Wtime()-t; t /= NEXPERIMENTS;

The timer has a resolution of

C:
double MPI_Wtick(void);

Fortran:
DOUBLE PRECISION MPI_WTICK()

Python
MPI.Wtick()

MPI_Wtick .

MPL note

The timing routines wtime and wtick and wtime_is_global are environment methods:

double 	mpl::environment::wtime ();
double 	mpl::environment::wtick ();
bool mpl::environment::wtime_is_global ();
End of MPL note

Timing in parallel is a tricky issue. For instance, most clusters do not have a central clock, so you can not relate start and stop times on one process to those on another. You can test for a global clock as follows MPI_WTIME_IS_GLOBAL :

int *v,flag;
MPI_Attr_get( comm, MPI_WTIME_IS_GLOBAL, &v, &flag );
if (mytid==0) printf("Time synchronized? %d->%d\n",flag,*v);

Normally you don't worry about the starting point for this timer: you call it before and after an event and subtract the values.

t = MPI_Wtime();
// something happens here
t = MPI_Wtime()-t;

If you execute this on a single processor you get fairly reliable timings, except that you would need to subtract the overhead for the timer. This is the usual way to measure timer overhead:

t = MPI_Wtime();
// absolutely nothing here
t = MPI_Wtime()-t;

14.6.1.1 Global timing

crumb trail: > mpi > Performance, tools, and profiling > Timing > Global timing

However, if you try to time a parallel application you will most likely get different times for each process, so you would have to take the average or maximum. Another solution is to synchronize the processors by using a barrier MPI_Barrier :

MPI_Barrier(comm)
t = MPI_Wtime();
// something happens here
MPI_Barrier(comm)
t = MPI_Wtime()-t;

Exercise

This scheme also has some overhead associated with it. How would you measure that?

14.6.1.2 Local timing

crumb trail: > mpi > Performance, tools, and profiling > Timing > Local timing

Now suppose you want to measure the time for a single send. It is not possible to start a clock on the sender and do the second measurement on the receiver, because the two clocks need not be synchronized. Usually a ping-pong is done:

if ( proc_source ) {
  MPI_Send( /* to target */ );
  MPI_Recv( /* from target */ );
else if ( proc_target ) {
  MPI_Recv( /* from source */ );
  MPI_Send( /* to source */ );
}

No matter what sort of timing you are doing, it is good to know the accuracy of your timer. The routine MPI_Wtick gives the smallest possible timer increment. If you find that your timing result is too close to this `tick', you need to find a better timer (for CPU measurements there are cycle-accurate timers), or you need to increase your running time, for instance by increasing the amount of data.

14.6.2 Simple profiling

crumb trail: > mpi > Performance, tools, and profiling > Simple profiling

MPI allows you to write your own profiling interface. To make this possible, every routine MPI_Something calls a routine PMPI_Something that does the actual work. You can now write your MPI_... routine which calls PMPI_... , and inserting your own profiling calls.

FIGURE 14.1: Calling hierarchy of MPI and PMPI routines

See figure  14.1 .

By default, the MPI routines are defined as weak linker symbols as a synonym of the PMPI ones. In the gcc case:

#pragma weak MPI_Send = PMPI_Send

FIGURE 14.2: A stack trace, showing the \texttt{PMPI} calls.

As you can see in figure  14.2 , normally only the PMPI routines show up in the stack trace.

14.6.3 Tools interface

crumb trail: > mpi > Performance, tools, and profiling > Tools interface

Recent versions of MPI have a standardized way of reading out performance variables: the tools interface which improves on the old interface described in section  14.6.2 .

The realization of the tools interface is installation-dependent, you first need to query how much of the tools interface is provided.

// mpit.c
MPI_Init_thread(&argc,&argv,MPI_THREAD_SINGLE,&tlevel);
MPI_T_init_thread(MPI_THREAD_SINGLE,&tlevel);
int npvar;
MPI_T_pvar_get_num(&npvar);

int name_len=256,desc_len=256,
  verbosity,var_class,binding,isreadonly,iscontiguous,isatomic;
char var_name[256],description[256];
MPI_Datatype datatype; MPI_T_enum enumtype;
for (int pvar=0; pvar<npvar; pvar++) {
  MPI_T_pvar_get_info(pvar,var_name,&name_len,
                      &verbosity,&var_class,
                      &datatype,&enumtype,
                      description,&desc_len,
                      &binding,&isreadonly,&iscontiguous,&isatomic);
  if (procid==0)
    printf("pvar %d: %d/%s = %s\n",pvar,var_class,var_name,description);
}

14.6.4 Programming for performance

crumb trail: > mpi > Performance, tools, and profiling > Programming for performance

We outline some issues pertaining to performance.

Eager limit

Short blocking messages are handled by a simpler mechanism than longer. The limit on what is considered `short' is known as the eager limit (section  4.1.2.2 ), and you could tune your code by increasing its value. However, note that a process may likely have a buffer accomodating eager sends for every single other process. This may eat into your available memory.

Blocking versus nonblocking
The issue of blocking versus nonblocking communication is something of a red herring. While nonblocking communication allows latency hiding , we can not consider it an alternative to blocking sends, since replacing nonblocking by blocking calls will usually give deadlock .

Still, even if you use nonblocking communication for the mere avoidance of deadlock or serialization (section  4.1.2.3 ), bear in mind the possibility of overlap of communication and computation. This also brings us to our next point.

Looking at it the other way around, in a code with blocking sends you may get better performance from nonblocking, even if that is not structurally necessary.

Progress

MPI is not magically active in the background, especially if the user code is doing scalar work that does not involve MPI. As sketched in section  14.4 , there are various ways of ensuring that latency hiding actually happens.

Persistent sends

If a communication between the same pair of processes, involving the same buffer, happens regularly, it is possible to set up a persistent communication . See section  5.1 .

Buffering

MPI uses internal buffers, and the copying from user data to these buffers may affect performance. For instance, derived types (section  6.3 ) can typically not be streamed straight through the network (this requires special hardware support  [LI:MpiDataUMR] ) so they are first copied. Somewhat surprisingly, we find that buffered communication (section  5.5 ) does not help. Perhaps MPI implementors have not optimized this mode since it is so rarely used.

This is issue is extensively investigated in  [Eijkhout:MPItype-arxiv] .

Graph topology and neighborhood collectives

Load balancing and communication minimization are important in irregular applications. There are dedicated programs for this ( ParMetis , Zoltan ), and libraries such as PETSc may offer convenient access to such capabilities.

In the declaration of a graph topology (section  11.2 ) MPI is allowed to reorder processes, which could be used to support such activities. It can also serve for better message sequencing when neighborhood collectives are used.

Network issues

In the discussion so far we have assumed that the network is a perfect conduit for data. However, there are issues of port design, in particular caused by oversubscription that adversely affect performance. While in an ideal world it may be possible to set up routine to avoid this, in the actual practice of a supercomputer cluster, network contention or message collision from different user jobs is hard to avoid.

Offloading and onloading

There are different philosophies of network card design Mellanox , being a network card manufacturer, believes in off-loading network activity to the NIC , while Intel , being a processor manufacturer, believes in `on-loading' activity to the process. There are argument either way.

Either way, investigate the capabilities of your network.

14.6.5 MPIR

crumb trail: > mpi > Performance, tools, and profiling > MPIR

MPIR is the informally specified debugging interface for processes acquisition and message queue extraction.

14.7 Determinism

crumb trail: > mpi > Determinism

MPI processes are only synchronized to a certain extent, so you may wonder what guarantees there are that running a code twice will give the same result. You need to consider two cases: first of all, if the two runs are on different numbers of processors there are already numerical problems; see  Eijkhout:IntroHPC .

Let us then limit ourselves to two runs on the same set of processors. In that case, MPI is deterministic as long as you do not use wildcards such as MPI_ANY_SOURCE . Formally, MPI messages are `nonovertaking': two messages between the same sender-receiver pair will arrive in sequence. Actually, they may not arrive in sequence: they are matched in sequence in the user program. If the second message is much smaller than the first, it may actually arrive earlier in the lower transport layer.

14.8 Subtleties with processor synchronization

crumb trail: > mpi > Subtleties with processor synchronization

Blocking communication involves a complicated dialog between the two processors involved. Processor one says `I have this much data to send; do you have space for that?', to which processor two replies `yes, I do; go ahead and send', upon which processor one does the actual send. This back-and-forth (technically known as a handshake ) takes a certain amount of communication overhead. For this reason, network hardware will sometimes forgo the handshake for small messages, and just send them regardless, knowing that the other process has a small buffer for such occasions.

One strange side-effect of this strategy is that a code that should deadlock according to the MPI specification does not do so. In effect, you may be shielded from you own programming mistake! Of course, if you then run a larger problem, and the small message becomes larger than the threshold, the deadlock will suddenly occur. So you find yourself in the situation that a bug only manifests itself on large problems, which are usually harder to debug. In this case, replacing every MPI_Send with a MPI_Ssend will force the handshake, even for small messages.

Conversely, you may sometimes wish to avoid the handshake on large messages. MPI as a solution for this: the MPI_Rsend (`ready send') routine sends its data immediately, but it needs the receiver to be ready for this. How can you guarantee that the receiving process is ready? You could for instance do the following (this uses nonblocking routines, which are explained below in section  4.2.1 ):

if ( receiving ) {
  MPI_Irecv()   // post nonblocking receive
  MPI_Barrier() // synchronize
else if ( sending ) {
  MPI_Barrier() // synchronize
  MPI_Rsend()   // send data fast

When the barrier is reached, the receive has been posted, so it is safe to do a ready send. However, global barriers are not a good idea. Instead you would just synchronize the two processes involved.

Exercise

Give pseudo-code for a scheme where you synchronize the two processes through the exchange of a blocking zero-size message.

14.9 Shell interaction

crumb trail: > mpi > Shell interaction

MPI programs are not run directly from the shell, but are started through an ssh tunnel . We briefly discuss ramifications of this.

14.9.1 Standard input

crumb trail: > mpi > Shell interaction > Standard input

Letting MPI processes interact with the environment is not entirely straightforward. For instance, shell input redirection as in

mpiexec -n 2 mpiprogram < someinput

may not work.

Instead, use a script programscript that has one parameter:

#!/bin/bash
mpirunprogram < $1

and run this in parallel:

mpiexec -n 2 programscript someinput

14.9.2 Standard out and error

crumb trail: > mpi > Shell interaction > Standard out and error

The stdout and stderr streams of an MPI process are returned through the ssh tunnel. Thus they can be caught as the stdout/err of mpiexec .

// outerr.c
fprintf(stdout,"This goes to std out\n");
fprintf(stderr,"This goes to std err\n");

14.9.3 Process status

crumb trail: > mpi > Shell interaction > Process status

The return code of MPI_Abort is returned as the processes status of {mpiexec}. Running

// abort.c
if (procno==nprocs-1)
  MPI_Abort(comm,37);
as

mpiexec -n 4 ./abort ; \
echo "Return code from ${MPIRUN} is <<$$?>>"

gives

TACC:  Starting up job 3760534
TACC:  Starting parallel tasks...
application called MPI_Abort(MPI_COMM_WORLD, 37) - process 3
TACC:  MPI job exited with code: 37
TACC:  Shutdown complete. Exiting.
Return code from ibrun is <<37>>

14.9.4 Multiple program start

crumb trail: > mpi > Shell interaction > Multiple program start

The sort of script of section  14.9.1 can also be used to implement MPMD runs: we let the script start one of a number of programs. For this, we use the fact that the MPI rank is known in the environment as PMI_RANK . Use a script mpmdscript :

#!/bin/bash
if [ ${PMI_RANK} -eq 0 ] ; then
  ./programmaster
else
  ./programworker
fi

which is then run in parallel:

mpiexec -n 25 mpmdscript

14.10 The origin of one-sided communication in ShMem

crumb trail: > mpi > The origin of one-sided communication in ShMem

The Cray T3E had a library called shmem which offered a type of shared memory. Rather than having a true global address space it worked by supporting variables that were guaranteed to be identical between processors, and indeed, were guaranteed to occupy the same location in memory. Variables could be declared to be shared a `symmetric' pragma or directive; their values could be retrieved or set by shmem_get and shmem_put calls.

14.11 Leftover topics

crumb trail: > mpi > Leftover topics

14.11.1 MPI constants

crumb trail: > mpi > Leftover topics > MPI constants

MPI has a number of built-in constants . These do not all behave the same.

  • Some are compile-time constants. Examples are MPI_VERSION and MPI_MAX_PROCESSOR_NAME . Thus, they can be used in array size declarations, even before MPI_Init .
  • Some link-time constants get their value by MPI initialization, such as MPI_COMM_WORLD . Such symbols, which include all predefined handles, can be used in initialization expressions.
  • Some link-time symbols can not be used in initialization expressions, such as MPI_BOTTOM and MPI_STATUS_IGNORE .

For symbols, the binary realization is not defined. For instance, MPI_COMM_WORLD is of type MPI_Comm , but the implementation of that type is not specified.

See Annex A of the \mpistandard{3.1} standard for full lists.

The following are the compile-time constants:

  • MPI_MAX_PROCESSOR_NAME
  • MPI_MAX_LIBRARY_VERSION_STRING
  • MPI_MAX_ERROR_STRING
  • MPI_MAX_DATAREP_STRING
  • MPI_MAX_INFO_KEY
  • MPI_MAX_INFO_VAL
  • MPI_MAX_OBJECT_NAME
  • MPI_MAX_PORT_NAME
  • MPI_VERSION
  • MPI_SUBVERSION
  • MPI_STATUS_SIZE (Fortran only)
  • MPI_ADDRESS_KIND (Fortran only)
  • MPI_COUNT_KIND (Fortran only)
  • MPI_INTEGER_KIND (Fortran only)
  • MPI_OFFSET_KIND (Fortran only)
  • MPI_SUBARRAYS_SUPPORTED (Fortran only)
  • MPI_ASYNC_PROTECTS_NONBLOCKING (Fortran only)

The following are the link-time constants:

  • MPI_BOTTOM
  • MPI_STATUS_IGNORE
  • MPI_STATUSES_IGNORE
  • MPI_ERRCODES_IGNORE
  • MPI_IN_PLACE
  • MPI_ARGV_NULL
  • MPI_ARGVS_NULL
  • MPI_UNWEIGHTED
  • MPI_WEIGHTS_EMPTY

Assorted constants:

  • MPI_PROC_NULL
  • MPI_ANY_SOURCE
  • MPI_ANY_TAG
  • MPI_UNDEFINED
  • MPI_BSEND_OVERHEAD
  • MPI_KEYVAL_INVALID
  • MPI_LOCK_EXCLUSIVE
  • MPI_LOCK_SHARED
  • MPI_ROOT

(This section was inspired by http://blogs.cisco.com/performance/mpi-outside-of-c-and-fortran .)

14.11.2 Cancelling messages

crumb trail: > mpi > Leftover topics > Cancelling messages

In section  we showed a master-worker example where the master accepts in arbitrary order the messages from the workers. Here we will show a slightly more complicated example, where only the result of the first task to complete is needed. Thus, we issue an MPI_Recv with MPI_ANY_SOURCE as source. When a result comes, we broadcast its source to all processes. All the other workers then use this information to cancel their message with an MPI_Cancel operation.

// cancel.c
fprintf(stderr,"get set, go!\n");
if (procno==nprocs-1) {
  MPI_Status status;
  MPI_Recv(dummy,0,MPI_INT, MPI_ANY_SOURCE,0,comm,
                  &status);
  first_tid = status.MPI_SOURCE;
  MPI_Bcast(&first_tid,1,MPI_INT, nprocs-1,comm);
  fprintf(stderr,"[%d] first msg came from %d\n",procno,first_tid);
} else {
  float randomfraction = (rand() / (double)RAND_MAX);
  int randomwait = (int) ( nprocs * randomfraction );
  MPI_Request request;
  fprintf(stderr,"[%d] waits for %e/%d=%d\n",
	   procno,randomfraction,nprocs,randomwait);
  sleep(randomwait);
  MPI_Isend(dummy,0,MPI_INT, nprocs-1,0,comm,
                   &request);
  MPI_Bcast(&first_tid,1,MPI_INT, nprocs-1,comm
                  );
  if (procno!=first_tid) {
    MPI_Cancel(&request);
    fprintf(stderr,"[%d] canceled\n",procno);
  }
}

After the cancelling operation it is still necessary to call MPI_Request_free , MPI_Wait , or MPI_Test in order to free the request object.

The MPI_Cancel operation is local, so it can not be used for nonblocking collectives or one-sided transfers.

Remark

As of \mpistandard{3.2}, cancelling a send is deprecated.

14.11.3 Constants

crumb trail: > mpi > Leftover topics > Constants

MPI constants such as MPI_COMM_WORLD or MPI_INT are not necessarily statitally defined, such as by a #define statement: the best you can say is that they have a value after MPI_Init or MPI_Init_thread . That means you can not transfer a compiled MPI file between platforms, or even between compilers on one platform. However, a working MPI source on one MPI implementation will also work on another.

14.12 Literature

crumb trail: > mpi > Literature

Online resources:

Tutorial books on MPI:

Back to Table of Contents