MPI topic: Communication modes

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 5.1 : Persistent communication requests
5.1.1 : Persistent point-to-point communication
5.1.2 : Persistent collectives
5.1.3 : Persistent neighbor communications
5.2 : Partitioned communication
5.3 : Synchronous and asynchronous communication
5.3.1 : Synchronous send operations
5.4 : Local and nonlocal operations
5.5 : Buffered communication
5.5.1 : Buffer treatment
5.5.2 : Bufferend send calls
5.5.3 : Persistent buffered communication
Back to Table of Contents

5 MPI topic: Communication modes

5.1 Persistent communication requests

crumb trail: > mpi-persist > Persistent communication requests

Persistent communication is a mechanism for dealing with a repeating communication transaction, where the parameters of the transaction, such as sender, receiver, tag, root, and buffer type and size, stay the same. Only the contents of the buffers involved changes between the transactions.

You can imagine that setting up a communication carries some overhead, and if the same communication structure is repeated many times, this overhead may be avoided.

  1. For nonblocking communications MPI_Ixxx (both point-to-point and collective) there is a persistent variant MPI_Xxx_init with the same calling sequence. The `init' call produces an MPI_Request output parameter, which can be used to test for completion of the communication.
  2. The `init' routine does not start the actual communication: that is done in MPI_Start , or MPI_Startall for multiple requests.
  3. Any of the MPI `wait' calls can then be used to conclude the communication.
  4. The communication can then be restarted with another `start' call.
  5. The wait call does not release the request object, since it can be used for repeat occurrences of this transaction. The request object is freed, as usual, with MPI_Request_free .

MPI_Send_init( /* ... */ &request);
while ( /* ... */ ) {
  MPI_Start( request );
  MPI_Wait( request, &status );
}
MPI_Request_free( & request );

MPL note

MPL returns a prequest from persistent `init' routines, rather than an irequest (MPL note  4.2.2 ):

template<typename T >
prequest send_init (const T &data, int dest, tag t=tag(0)) const;

Likewise, there is a prequest_pool instead of an irequest_pool 4.2.2.1.3 ). End of MPL note

5.1.1 Persistent point-to-point communication

crumb trail: > mpi-persist > Persistent communication requests > Persistent point-to-point communication

The main persistent point-to-point routines are MPI_Send_init , which has the same calling sequence as MPI_Isend , and MPI_Recv_init , which has the same calling sequence as MPI_Irecv .

In the following example a ping-pong is implemented with persistent communication. Since we use persistent operations for both send and receive on the `ping' process, we use MPI_Startall to start both at the same time, and MPI_Waitall to test their completion.

// persist.c
if (procno==src) {
  MPI_Send_init(send,s,MPI_DOUBLE,tgt,0,comm,requests+0);
  MPI_Recv_init(recv,s,MPI_DOUBLE,tgt,0,comm,requests+1);
  printf("Size %d\n",s);
  t[cnt] = MPI_Wtime();
  for (int n=0; n<NEXPERIMENTS; n++) {
	fill_buffer(send,s,n);
	MPI_Startall(2,requests);
	MPI_Waitall(2,requests,MPI_STATUSES_IGNORE);
	int r = chck_buffer(send,s,n);
	if (!r) printf("buffer problem %d\n",s);
  }
  t[cnt] = MPI_Wtime()-t[cnt];
  MPI_Request_free(requests+0); MPI_Request_free(requests+1);
} else if (procno==tgt) {
  for (int n=0; n<NEXPERIMENTS; n++) {
	MPI_Recv(recv,s,MPI_DOUBLE,src,0,comm,MPI_STATUS_IGNORE);
	MPI_Send(recv,s,MPI_DOUBLE,src,0,comm);
  }
}
## persist.py
sendbuf = np.ones(size,dtype=np.int)
recvbuf = np.ones(size,dtype=np.int)
if procid==src:
    print("Size:",size)
    times[isize] = MPI.Wtime()
    for n in range(nexperiments):
        requests[0] = comm.Isend(sendbuf[0:size],dest=tgt)
        requests[1] = comm.Irecv(recvbuf[0:size],source=tgt)
        MPI.Request.Waitall(requests)
        sendbuf[0] = sendbuf[0]+1
    times[isize] = MPI.Wtime()-times[isize]
elif procid==tgt:
    for n in range(nexperiments):
        comm.Recv(recvbuf[0:size],source=src)
        comm.Send(recvbuf[0:size],dest=src)

As with ordinary send commands, there are persistent variants

  • MPI_Bsend_init for buffered communication, section  5.5 ;
  • MPI_Ssend_init for synchronous communication, section  5.3.1 ;
  • MPI_Rsend_init .

5.1.2 Persistent collectives

crumb trail: > mpi-persist > Persistent communication requests > Persistent collectives

MPI 4 Standard only

For each collective call, there is a persistent variant (\mpistandard{4}). As with persistent point-to-point calls (section  5.1.1 ), these have the same calling sequence as the nonpersistent variants, except for an added final MPI_Request parameter. This request (or an array of requests) can then be used by MPI_Start (or MPI_Startall ) to initiate the actual communication.

Some points.

  • Metadata arrays, such as of counts and datatypes, must not be altered until the MPI_Request_free call.
  • The initialization call is nonlocal, so it can block until all processes have performed it.
  • Multiple persistent collective can be initialized, in which case they satisfy the same restrictions as ordinary collectives, in particular on ordering. Thus, the following code is incorrect:

    // WRONG
    if (procid==0) {
      MPI_Reduce_init( /* ... */ &req1);
      MPI_Bcast_init( /* ... */ &req2);
    } else {
      MPI_Bcast_init( /* ... */ &req2);
      MPI_Reduce_init( /* ... */ &req1);
    }
    

    However, after initialization the start calls can be in arbitrary order, and in different order among the processes.

\begin{raggedlist} Available persistent collectives are: MPI_Barrier_init MPI_Bcast_init MPI_Reduce_init MPI_Allreduce_init MPI_Reduce_scatter_init MPI_Reduce_scatter_block_init MPI_Gather_init MPI_Gatherv_init MPI_Allgather_init MPI_Allgatherv_init MPI_Scatter_init MPI_Scatterv_init MPI_Alltoall_init MPI_Alltoallv_init MPI_Alltoallw_init MPI_Scan_init MPI_Exscan_init \end{raggedlist}

End of MPI 4 note

5.1.3 Persistent neighbor communications

crumb trail: > mpi-persist > Persistent communication requests > Persistent neighbor communications

MPI 4 Standard only

\begin{raggedlist} MPI_Neighbor_allgather_init , MPI_Neighbor_allgatherv_init , MPI_Neighbor_alltoall_init , MPI_Neighbor_alltoallv_init , MPI_Neighbor_alltoallw_init , \end{raggedlist}

End of MPI 4 note

5.2 Partitioned communication

crumb trail: > mpi-persist > Partitioned communication

MPI 4 Standard only

Partitioned communication is a variant on persistent communication , where a message is constructed in partitions.

  • The normal MPI_Send_init is replaced by MPI_Psend_init .
  • After this, the MPI_Start does not actually start the transfer; instead:
  • Each partition of the message is separately declared as read-to-be-sent with MPI_Pready .

A common scenario for this is in multi-threaded environments, where each thread can construct its own part of a message. Having partitioned messages means that partially constructed message buffers can be sent off without having to wait for all threads to finish.

Indicating that parts of a message are ready for sending is done by one of the following calls:

  • MPI_Pready for a single partition;
  • MPI_Pready_range for a range of partitions; and
  • MPI_Pready_list for an explicitly enumerated list of partitions.

The MPI_Psend_init call yields an MPI_Request object that can be used to test for completion (see sections 4.2.2.1 and  4.2.3 ) of the full operation started with MPI_Start .

On the receiving side:

  • A call to MPI_Recv_init is replaced by MPI_Precv_init .
  • Arrival of a partition can be tested with MPI_Parrived .

Again, the MPI_Request object from the receive-init call can be used to test for completion of the full receive operation.

End of MPI 4 note

5.3 Synchronous and asynchronous communication

crumb trail: > mpi-persist > Synchronous and asynchronous communication

It is easiest to think of blocking as a form of synchronization with the other process, but that is not quite true. Synchronization is a concept in itself, and we talk about synchronous communication if there is actual coordination going on with the other process, and asynchronous communication if there is not. Blocking then only refers to the program waiting until the user data is safe to reuse; in the synchronous case a blocking call means that the data is indeed transferred, in the asynchronous case it only means that the data has been transferred to some system buffer.

FIGURE 5.1: Blocking and synchronicity

The four possible cases are illustrated in figure  5.1 .

5.3.1 Synchronous send operations

crumb trail: > mpi-persist > Synchronous and asynchronous communication > Synchronous send operations

MPI has a number of routines for synchronous communication, such as MPI_Ssend . Driving home the point that nonblocking and asynchronous are different concepts, there is a routine MPI_Issend , which is synchronous but nonblocking. These routines have the same calling sequence as their not-explicitly synchronous variants, and only differ in their semantics.

See section  4.1.2.2 for examples.

5.4 Local and nonlocal operations

crumb trail: > mpi-persist > Local and nonlocal operations

The MPI standard does not dictate whether communication is buffered. If a message is buffered, a send call can complete, even if no corresponding send has been posted yet. See section  4.1.2.2 . Thus, in the standard communication, a send operation is nonlocal : its completion may be depend on whether the corresponding receive has been posted. A

On the other hand, buffered communication (routines MPI_Bsend , MPI_Ibsend , MPI_Bsend_init ; section  5.5 ) is local : the presence of an explicit buffer means that a send operation can complete no matter whether the receive has been posted.

The synchronous send (routines MPI_Ssend , MPI_Issend , MPI_Ssend_init ; section  14.8 ) is again nonlocal (even in the nonblocking variant) since it will only complete when the receive call has completed.

Finally, the ready mode send ( MPI_Rsend , MPI_Irsend ) is nonlocal in the sense that its only correct use is when the corresponding receive has been issued.

5.5 Buffered communication

crumb trail: > mpi-persist > Buffered communication

FIGURE 5.2: User communication routed through an attached buffer

By now you have probably got the notion that managing buffer space in MPI is important: data has to be somewhere, either in user-allocated arrays or in system buffers. Using buffered communication is yet another way of managing buffer space.

  1. You allocate your own buffer space, and you attach it to your process. This buffer is not a send buffer: it is a replacement for buffer space used inside the MPI library or on the network card; figure  5.2 . If high-bandwdith memory is available, you could create your buffer there.
  2. You use the
    C:
    int MPI_Bsend
       (const void *buf, int count, MPI_Datatype datatype,
        int dest, int tag,MPI_Comm comm)
    
    Input Parameters
    buf : initial address of send buffer (choice)
    count : number of elements in send buffer (nonnegative integer)
    datatype : datatype of each send buffer element (handle)
    dest : rank of destination (integer)
    tag : message tag (integer)
    comm : communicator (handle)
    
    MPI_Bsend (or its local variant MPI_Ibsend ) call for sending, using otherwise normal send and receive buffers;
  3. You detach the buffer when you're done with the buffered sends.

One advantage of buffered sends is that they are nonblocking: since there is a guaranteed buffer long enough to contain the message, it is not necessary to wait for the receiving process.

We illustrate the use of buffered sends:

// bufring.c
int bsize = BUFLEN*sizeof(float);
float
  *sbuf = (float*) malloc( bsize ),
  *rbuf = (float*) malloc( bsize );
MPI_Pack_size( BUFLEN,MPI_FLOAT,comm,&bsize);
bsize += MPI_BSEND_OVERHEAD;
float
  *buffer = (float*) malloc( bsize );

MPI_Buffer_attach( buffer,bsize );
err = MPI_Bsend(sbuf,BUFLEN,MPI_FLOAT,next,0,comm);
MPI_Recv (rbuf,BUFLEN,MPI_FLOAT,prev,0,comm,MPI_STATUS_IGNORE);
MPI_Buffer_detach( &buffer,&bsize );

5.5.1 Buffer treatment

crumb trail: > mpi-persist > Buffered communication > Buffer treatment

There can be only one buffer per process, attached with

int MPI_Buffer_attach( void *buffer, int size );

Input arguments:
buffer : initial buffer address (choice)
size : buffer size, in bytes (integer)
MPI_Buffer_attach . Its size should be enough for all MPI_Bsend calls that are simultaneously outstanding. You can compute the needed size of the buffer with MPI_Pack_size ; see section  6.6.2 . Additionally, a term of MPI_BSEND_OVERHEAD is needed. See the above code fragment.

The buffer is detached with MPI_Buffer_detach :

int MPI_Buffer_detach(
  void *buffer, int *size);

This returns the address and size of the buffer; the call blocks until all buffered messages have been delivered.

Note that both MPI_Buffer_attach and MPI_Buffer_detach have a void* argument for the buffer, but

  • in the attach routine this is the address of the buffer,
  • while the detach routine it is the address of the buffer pointer.

This is done so that the detach routine can zero the buffer pointer.

While the buffered send is nonblocking like an MPI_Isend , there is no corresponding wait call. You can force delivery by

MPI_Buffer_detach( &b, &n );
MPI_Buffer_attach( b, n );

MPL note

Creating and attaching a buffer is done through bsend_buffer and a support routine bsend_size the buffer size:

// bufring.cxx
vector<float> sbuf(BUFLEN), rbuf(BUFLEN);
int size{ comm_world.bsend_size<float>(mpl::contiguous_layout<float>(BUFLEN)) };
mpl::bsend_buffer<> buff(size);
comm_world.bsend(sbuf.data(),mpl::contiguous_layout<float>(BUFLEN), next);

Constant: mpl:: bsend_overhead

to the MPI constant MPI_BSEND_OVERHEAD . End of MPL note

MPL note

There is a separate attach routine, but normally this is called by the constructor of the bsend_buffer . Likewise, the detach routine is called in the buffer destructor.

void mpl::environment::buffer_attach (void *buff, int size);
std::pair< void *, int > mpl::environment::buffer_detach ();
End of MPL note

5.5.2 Bufferend send calls

crumb trail: > mpi-persist > Buffered communication > Bufferend send calls

The possible error codes are

  • MPI_SUCCESS the routine completed successfully.
  • MPI_ERR_BUFFER The buffer pointer is invalid; this typically means that you have supplied a null pointer.
  • MPI_ERR_INTERN An internal error in MPI has been detected.

The asynchronous version is MPI_Ibsend , the persistent (see section  5.1 ) call is MPI_Bsend_init .

5.5.3 Persistent buffered communication

crumb trail: > mpi-persist > Buffered communication > Persistent buffered communication

There is a persistent variant

Synopsis
int MPI_Bsend_init
   (const void *buf, int count, MPI_Datatype datatype,
    int dest, int tag, MPI_Comm comm,
    MPI_Request *request)

Input Parameters
buf : initial address of send buffer (choice)
count : number of elements sent (integer)
datatype : type of each element (handle)
dest : rank of destination (integer)
tag : message tag (integer)
comm : communicator (handle)

Output Parameters
request : communication request (handle)
MPI_Bsend_init of buffered sends, as with regular sends (section  5.1 ).

Back to Table of Contents