MPI topic: One-sided communication

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 9.1 : Windows
9.1.1 : Window creation and allocation
9.2 : Active target synchronization: epochs
9.3 : Put, get, accumulate
9.3.1 : Put
9.3.2 : Get
9.3.3 : Put and get example: halo update
9.3.4 : Accumulate
9.3.5 : Ordering and coherence of RMA operations
9.3.6 : Request-based operations
9.3.7 : More active target synchronization
9.3.8 : Atomic operations
9.3.8.1 : A case study in atomic operations
9.4 : Passive target synchronization
9.4.1 : Lock types
9.4.2 : Lock all
9.4.3 : Completion and consistency in passive target synchronization
9.4.3.1 : Local completion
9.4.3.2 : Remote completion
9.4.3.3 : Window synchronization
9.5 : More about window memory
9.5.1 : Memory models
9.5.2 : Dynamically attached memory
9.5.3 : Window usage hints
9.5.4 : Window information
9.6 : Assertions
9.7 : Implementation
9.8 : Review questions
Back to Table of Contents

9 MPI topic: One-sided communication

Above, you saw point-to-point operations of the two-sided type: they require the co-operation of a sender and receiver. This co-operation could be loose: you can post a receive with MPI_ANY_SOURCE as sender, but there had to be both a send and receive call. This two-sidedness can be limiting. Consider code where the receiving process is a dynamic function of the data:

x = f();
p = hash(x);
MPI_Send( x, /* to: */ p );

The problem is now: how does p know to post a receive, and how does everyone else know not to?

In this section, you will see one-sided communication routines where a process can do a `put' or `get' operation, writing data to or reading it from another processor, without that other processor's involvement.

In one-sided MPI operations, known as RMA operations in the standard, or as RDMA in other literature, there are still two processes involved: the origin , which is the process that originates the transfer, whether this is a `put' or a `get', and the target whose memory is being accessed. Unlike with two-sided operations, the target does not perform an action that is the counterpart of the action on the origin.

That does not mean that the origin can access arbitrary data on the target at arbitrary times. First of all, one-sided communication in MPI is limited to accessing only a specifically declared memory area on the target: the target declares an area of user-space memory that is accessible to other processes. This is known as a window . Windows limit how origin processes can access the target's memory: you can only `get' data from a window or `put' it into a window; all the other memory is not reachable from other processes. On the origin there is no such limitation; any data can function as the source of a `put' or the recipient of a `get operation.

The alternative to having windows is to use distributed shared memory or virtual shared memory : memory is distributed but acts as if it shared. The so-called PGAS languages such as UPC use this model. The MPI RMA model makes it possible to lock a window which makes programming slightly more cumbersome, but the implementation more efficient.

Within one-sided communication, MPI has two modes: active RMA and passive RMA. In active RMA , or active target synchronization , the target sets boundaries on the time period (the `epoch') during which its window can be accessed. The main advantage of this mode is that the origin program can perform many small transfers, which are aggregated behind the scenes. Active RMA acts much like asynchronous transfer with a concluding MPI_Waitall .

In passive RMA , or passive target synchronization , the target process puts no limitation on when its window can be accessed. ( PGAS languages such as UPC are based on this model: data is simply read or written at will.) While intuitively it is attractive to be able to write to and read from a target at arbitrary time, there are problems. For instance, it requires a remote agent on the target, which may interfere with execution of the main thread, or conversely it may not be activated at the optimal time. Passive RMA is also very hard to debug and can lead to strange deadlocks.

9.1 Windows

crumb trail: > mpi-onesided > Windows

FIGURE 9.1: Collective definition of a window for one-sided data access

In one-sided communication, each processor can make an area of memory, called a window , available to one-sided transfers. This is stored in a variable of type MPI_Win . A process can put an arbitrary item from its own memory (not limited to any window) to the window of another process, or get something from the other process' window in its own memory.

A window can be characteristized as follows:

  • The window is defined on a communicator, so the create call is collective; see figure  9.1 .
  • The window size can be set individually on each process. A zero size is allowed, but since window creation is collective, it is not possible to skip the create call.
  • You can set a `displacement unit' for the window: this is a number of bytes that will be used as the indexing unit. For example if you use sizeof(double) as the displacement unit, an  MPI_Put to location 8 will go to the 8th double. That's easier than having to specify the 64th byte.
  • The window is the target of data in a put operation, or the source of data in a get operation; see figure  9.2 .
  • There can be memory associated with a window, so it needs to be freed explicitly.

The typical calls involved are:

MPI_Info info;
MPI_Win window;
MPI_Win_allocate( /* size info */, info, comm, &memory, &window );
// do put and get calls
MPI_Win_free( &window );

FIGURE 9.2: Put and get between process memory and windows

9.1.1 Window creation and allocation

crumb trail: > mpi-onesided > Windows > Window creation and allocation

The memory for a window is at first sight ordinary data in user space. There are multiple ways you can associate data with a window:

  1. You can pass a user buffer to
    C:
    int MPI_Win_create
       (void *base, MPI_Aint size, int disp_unit,
        MPI_Info info, MPI_Comm comm, MPI_Win *win)
    
    Fortran:
    MPI_Win_create(base, size, disp_unit, info, comm, win, ierror)
    TYPE(*), DIMENSION(..), ASYNCHRONOUS :: base
    INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size
    INTEGER, INTENT(IN) :: disp_unit
    TYPE(MPI_Info), INTENT(IN) :: info
    TYPE(MPI_Comm), INTENT(IN) :: comm
    TYPE(MPI_Win), INTENT(OUT) :: win
    INTEGER, OPTIONAL, INTENT(OUT) :: ierror
    
    Python:
    MPI.Win.Create
       (memory, int disp_unit=1,
        Info info=INFO_NULL, Intracomm comm=COMM_SELF)
    
    MPI_Win_create . This buffer can be an ordinary array, or it can be created with MPI_Alloc_mem . (In the former case, it may not be possible to lock the window; section  9.4 .)
  2. You can let MPI do the allocation, so that MPI can perform various optimizations regarding placement of the memory. The user code then receives the pointer to the data from MPI. This can again be done in two ways:

    • Use
      Semantics:
      MPI_WIN_ALLOCATE(size, disp_unit, info, comm, baseptr, win)
      
      Input parameters:
      size: size of local window in bytes (non-negative integer)
      disp_unit local unit size for displacements, in bytes (positive
      integer)
      info: info argument (handle)
      comm: intra-communicator (handle)
      
      Output parameters:
      baseptr: address of local allocated window segment (choice)
      win: window object returned by the call (handle)
      
      C:
      int MPI_Win_allocate
         (MPI_Aint size, int disp_unit, MPI_Info info,
          MPI_Comm comm, void *baseptr, MPI_Win *win)
      
      Fortran:
      MPI_Win_allocate
         (size, disp_unit, info, comm, baseptr, win, ierror)
      USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR
      INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size
      INTEGER, INTENT(IN) :: disp_unit
      TYPE(MPI_Info), INTENT(IN) :: info
      TYPE(MPI_Comm), INTENT(IN) :: comm
      TYPE(C_PTR), INTENT(OUT) :: baseptr
      TYPE(MPI_Win), INTENT(OUT) :: win
      INTEGER, OPTIONAL, INTENT(OUT) :: ierror
      
      MPI_Win_allocate to create the data and the window in one call.
    • If a communicator is on a shared memory (see section  12.1 ) you can create a window in that shared memory with MPI_Win_allocate_shared . This will be useful for MPI shared memory ; see chapter  MPI topic: Shared memory .
  3. Finally, you can create a window with MPI_Win_create_dynamic which postpones the allocation; see section  9.5.2 .

First of all, MPI_Win_create creates a window from a pointer to memory. The data array must not be PARAMETER or static const .

The size parameter is measured in bytes. In C this can be done with the sizeof operator;

// putfencealloc.c
MPI_Win the_window;
int *window_data;
MPI_Win_allocate(2*sizeof(int),sizeof(int),
		   MPI_INFO_NULL,comm,
		   &window_data,&the_window);
for doing this calculation in Fortran, see section  14.3.1 .

Python note

For computing the displacement in bytes, here is a good way for finding the size of numpy datatypes:

## putfence.py
intsize = np.dtype('int').itemsize
window_data = np.zeros(2,dtype=np.int)
win = MPI.Win.Create(window_data,intsize,comm=comm)

Next, one can obtain the memory from MPI by using which has the data pointer as output. Note the void* in the C prototype; it is still necessary to pass a pointer to a pointer:

double *window_data;
MPI_Win_allocate( ... &window_data ... );

The routine

int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr)
MPI_Alloc_mem performs only the allocation part of MPI_Win_allocate , after which you need to MPI_Win_create .

  • An error of MPI_ERR_NO_MEM indicates that no memory could be allocated.

    MPI 4 Standard only

  • Allocated memory can be aligned by specifying an MPI_Info key of mpi_minimum_memory_alignment .

    End of MPI 4 note

This memory is freed with

MPI_Free_mem()

These calls reduce to malloc and free if there is no special memory area; SGI is an example where such memory does exist.

There will be more discussion of window memory in section  9.5.1 .

Python note

Unlike in C, the python window allocate call does not return a pointer to the buffer memory. Should you need this, there are the following options:

  • Window objects expose the Python buffer interface. So you can do Pythonic things like

    mview = memoryview(win)
    array = numpy.frombuffer(win, dtype='i4')
    
  • If you really want the raw base pointer (as an integer), you can do any of these:

    base, size, disp_unit = win.atts
    base = win.Get_attr(MPI.WIN_BASE)
    
  • You can use mpi4py's builtin memoryview/buffer-like type, but I do not recommend it, much better to use NumPy as above:

    mem = win.tomemory() # type(mem) is MPI.memory, similar to memoryview, but quite limited in functionality
    base = mem.address
    size = mem.nbytes
    

9.2 Active target synchronization: epochs

crumb trail: > mpi-onesided > Active target synchronization: epochs

One-sided communication has an obvious complication over two-sided: if you do a put call instead of a send, how does the recipient know that the data is there? This process of letting the target know the state of affairs is called `synchronization', and there are various mechanisms for it. First of all we will consider target synchronization}. Here the target knows when the transfer may happen (the communication epoch ), but does not do any data-related calls.

In this section we look at the first mechanism, which is to use a fence operation:

Semantics:
MPI_WIN_FENCE(assert, win)
IN assert: program assertion (integer)
IN win: window object (handle)

C:
int MPI_Win_fence(int assert, MPI_Win win)

F:
MPI_Win_fence(assert, win, ierror)
INTEGER, INTENT(IN) :: assert
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror

Python:
win.Fence(self, int assertion=0)
MPI_Win_fence . This operation is collective on the communicator of the window. It is comparable to MPI_Wait calls for nonblocking communication. (Another, more sophisticated mechanism for active target synchronization is discussed in section  9.3.7 .)

The use of fences is somewhat complicated. The interval between two fences is known as an epoch . You can give various hints to the system about this epoch versus the ones before and after through the assert parameter.

MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
MPI_Get( /* operands */, win);
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);

In between the two fences the window is exposed, and while it is you should not access it locally. If you absolutely need to access it locally, you can use an RMA operation for that. Also, there can be only one remote process that does a put ; multiple accumulate accesses are allowed.

Fences are, together with other window calls, collective operations. That means they imply some amount of synchronization between processes. Consider:

MPI_Win_fence( ... win ... ); // start an epoch
if (mytid==0) // do lots of work
MPI_Win_fence( ... win ... ); // end the epoch

and assume that all processes execute the first fence more or less at the same time. The zero process does work before it can do the second fence call, but all other processes can call it immediately. However, they can not finish that second fence call until all one-sided communication is finished, which means they wait for the zero process.

\caption{A trace of a one-sided communication epoch where process zero only originates a one-sided transfer}

As a further restriction, you can not mix MPI_Get with MPI_Put or MPI_Accumulate calls in a single epoch. Hence, we can characterize an epoch as an access epoch on the origin, and as an exposure epoch on the target.

Assertions are an integer parameter: you can combine assertions by adding them or using logical-or. The value zero is always correct. For further information, see section  9.6 .

9.3 Put, get, accumulate

crumb trail: > mpi-onesided > Put, get, accumulate

We will now look at the first three routines for doing one-sided operations: the Put, Get, and Accumulate call. (We will look at so-called `atomic' operations in section  9.3.8 .) These calls are somewhat similar to a Send, Receive and Reduce, except that of course only one process makes a call. Since one process does all the work, its calling sequence contains both a description of the data on the origin (the calling process) and the target (the affected other process).

As in the two-sided case, MPI_PROC_NULL can be used as a target rank.

The Put/Get/Accumulate routines have an MPI_Op argument that can be any of the usual operators, but no user-defined ones (see section  3.10.1 ).

9.3.1 Put

crumb trail: > mpi-onesided > Put, get, accumulate > Put

The

C:
int MPI_Put(
  const void *origin_addr, int origin_count, MPI_Datatype origin_datatype,
  int target_rank,
  MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype,
  MPI_Win win)

Semantics:
IN origin_addr: initial address of origin buffer (choice)
IN origin_count: number of entries in origin buffer (non-negative integer)
IN origin_datatype: datatype of each entry in origin buffer (handle)
IN target_rank: rank of target (non-negative integer)
IN target_disp: displacement from start of window to target buffer (non-negative integer)
IN target_count: number of entries in target buffer (non-negative integer)
IN target_datatype: datatype of each entry in target buffer (handle)
IN win: window object used for communication (handle)

Fortran:
MPI_Put(origin_addr, origin_count, origin_datatype,
  target_rank, target_disp, target_count, target_datatype, win, ierror)
TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: origin_addr
INTEGER, INTENT(IN) :: origin_count, target_rank, target_count
TYPE(MPI_Datatype), INTENT(IN) :: origin_datatype, target_datatype
INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: target_disp
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror

Python:

win.Put(self, origin, int target_rank, target=None)
MPI_Put call can be considered as a one-sided send. As such, it needs to specify

  • the target rank
  • the data to be sent from the origin, and
  • the location where it is to be written on the target.

The description of the data on the origin is the usual trio of buffer/count/datatype. However, the description of the data on the target is more complicated. It has a count and a datatype, but instead of an address it has a displacement with respect to the start of the window on the target. This displacement can be given in bytes, so its type is MPI_Aint , but strictly speaking it is a multiple of the displacement unit that was specified in the window definition.

Specifically, data is written starting at \[ \mathtt{window\_base} + \mathtt{target\_disp}\times \mathtt{disp\_unit}. \]

Here is a single put operation. Note that the window create and window fence calls are collective, so they have to be performed on all processors of the communicator that was used in the create call.

// putfence.c
MPI_Win the_window;
MPI_Win_create
  (&window_data,2*sizeof(int),sizeof(int),
   MPI_INFO_NULL,comm,&the_window);
MPI_Win_fence(0,the_window);
if (procno==0) {
  MPI_Put
    ( /* data on origin: */   &my_number, 1,MPI_INT,
	/* data on target: */   other,1,    1,MPI_INT,
	the_window);
}
MPI_Win_fence(0,the_window);
MPI_Win_free(&the_window);

Fortran note

The disp_unit variable is declared as an integer of `kind' MPI_ADDRESS_KIND :

// putfence.F90
integer(kind=MPI_ADDRESS_KIND) :: target_displacement
   target_displacement = 1
   call MPI_Put( my_number, 1,MPI_INTEGER, &
        other,target_displacement, &
        1,MPI_INTEGER, &
        the_window)

Prior to Fortran2008 , specifying a literal constant, such as  0 , could lead to bizarre runtime errors; the solution was to specify a zero-valued variable of the right type. With the mpi_f08 module this is no longer allowed. Instead you get an error such as

error #6285: There is no matching specific subroutine for this generic subroutine call.   [MPI_PUT]

Exercise

Revisit exercise  4.1.2.3 and solve it using MPI_Put . \skeleton{rightput}

Exercise

Write code where:

  • process 0 computes a random number $r$
  • if $r<.5$, zero writes in the window on 1;
  • if $r\geq .5$, zero writes in the window on 2.

\skeleton{randomput}

9.3.2 Get

crumb trail: > mpi-onesided > Put, get, accumulate > Get

The

C:
int MPI_Get(
  const void *origin_addr, int origin_count, MPI_Datatype origin_datatype,
  int target_rank,
  MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype,
  MPI_Win win)

Semantics:
IN origin_addr: initial address of origin buffer (choice)
IN origin_count: number of entries in origin buffer (non-negative integer)
IN origin_datatype: datatype of each entry in origin buffer (handle)
IN target_rank: rank of target (non-negative integer)
IN target_disp: displacement from start of window to target buffer (non-negative integer)
IN target_count: number of entries in target buffer (non-negative integer)
IN target_datatype: datatype of each entry in target buffer (handle)
IN win: window object used for communication (handle)

Fortran:
MPI_Get(origin_addr, origin_count, origin_datatype,
  target_rank, target_disp, target_count, target_datatype, win, ierror)
TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: origin_addr
INTEGER, INTENT(IN) :: origin_count, target_rank, target_count
TYPE(MPI_Datatype), INTENT(IN) :: origin_datatype, target_datatype
INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: target_disp
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror

Python:

win.Get(self, origin, int target_rank, target=None)
MPI_Get call is very similar.

Example:

// getfence.c
MPI_Win_create(&other_number,2*sizeof(int),sizeof(int),
               MPI_INFO_NULL,comm,&the_window);
MPI_Win_fence(0,the_window);
if (procno==0) {
  MPI_Get( /* data on origin: */   &my_number, 1,MPI_INT,
	       /* data on target: */   other,1,    1,MPI_INT,
	       the_window);
}
MPI_Win_fence(0,the_window);
We make a null window on processes that do not participate.
## getfence.py
if procid==0 or procid==nprocs-1:
    win_mem = np.empty( 1,dtype=np.float64 )
    win = MPI.Win.Create( win_mem,comm=comm )
else:
    win = MPI.Win.Create( None,comm=comm )

# put data on another process
win.Fence()
if procid==0 or procid==nprocs-1:
    putdata = np.empty( 1,dtype=np.float64 )
    putdata[0] = mydata
    print("[%d] putting %e" % (procid,mydata))
    win.Put( putdata,other )
win.Fence()

9.3.3 Put and get example: halo update

crumb trail: > mpi-onesided > Put, get, accumulate > Put and get example: halo update

UNKNOWN

{r}{3in} As an example, let's look at halo update . The array  A is updated using the local values and the halo that comes from bordering processors, either through Put or Get operations.

In a first version we separate computation and communication. Each iteration has two fences. Between the two fences in the loop body we do the MPI_Put operation; between the second and and first one of the next iteration there is only computation, so we add the NOPRECEDE and NOSUCCEED assertions. The NOSTORE assertion states that the local window was not updated: the Put operation only works on remote windows.

for ( .... ) {
  update(A);
  MPI_Win_fence(MPI_MODE_NOPRECEDE, win);
  for(i=0; i < toneighbors; i++)
    MPI_Put( ... );
  MPI_Win_fence((MPI_MODE_NOSTORE | MPI_MODE_NOSUCCEED), win);
  }

For much more about assertions, see section  9.6 below.

Next, we split the update in the core part, which can be done purely from local values, and the boundary, which needs local and halo values. Update of the core can overlap the communication of the halo.

for ( .... ) {
  update_boundary(A);
  MPI_Win_fence((MPI_MODE_NOPUT | MPI_MODE_NOPRECEDE), win);
  for(i=0; i < fromneighbors; i++)
    MPI_Get( ... );
  update_core(A);
  MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
  }

The NOPRECEDE and NOSUCCEED assertions still hold, but the Get operation implies that instead of NOSTORE in the second fence, we use NOPUT in the first.

9.3.4 Accumulate

crumb trail: > mpi-onesided > Put, get, accumulate > Accumulate

A third one-sided routine is

C:
int MPI_Accumulate
   (const void *origin_addr, int origin_count,MPI_Datatype origin_datatype,
    int target_rank,MPI_Aint target_disp, int target_count,MPI_Datatype target_datatype,
    MPI_Op op, MPI_Win win)
int MPI_Raccumulate
   (const void *origin_addr, int origin_count,MPI_Datatype origin_datatype,
    int target_rank,MPI_Aint target_disp, int target_count,MPI_Datatype target_datatype,
    MPI_Op op, MPI_Win win,MPI_Request *request)

Input Parameters

origin_addr : Initial address of buffer (choice).
origin_count : Number of entries in buffer (nonnegative integer).
origin_datatype : Data type of each buffer entry (handle).
target_rank : Rank of target (nonnegative integer).
target_disp : Displacement from start of window to beginning of target buffer (nonnegative integer).
target_count : Number of entries in target buffer (nonnegative integer).
target_datatype : Data type of each entry in target buffer (handle).
op : Reduce operation (handle).
win : Window object (handle).

Output Parameter

MPI_Raccumulate: RMA request
IERROR (Fortran only): Error status (integer).

Fortran:

MPI_ACCUMULATE
   (ORIGIN_ADDR, ORIGIN_COUNT, ORIGIN_DATATYPE,
    TARGET_RANK,TARGET_DISP, TARGET_COUNT, TARGET_DATATYPE,
    OP, WIN, IERROR)
 ORIGIN_ADDR(*)
INTEGER(KIND=MPI_ADDRESS_KIND) :: TARGET_DISP
INTEGER :: ORIGIN_COUNT, ORIGIN_DATATYPE,
           TARGET_RANK, TARGET_COUNT,TARGET_DATATYPE,
           OP, WIN, IERROR
MPI_RACCUMULATE
   (ORIGIN_ADDR, ORIGIN_COUNT, ORIGIN_DATATYPE,
    TARGET_RANK,TARGET_DISP, TARGET_COUNT, TARGET_DATATYPE,
    OP, WIN, REQUEST, IERROR)
 ORIGIN_ADDR(*)
INTEGER(KIND=MPI_ADDRESS_KIND) :: TARGET_DISP
INTEGER :: ORIGIN_COUNT, ORIGIN_DATATYPE, TARGET_RANK, TARGET_COUNT, TARGET_DATATYPE,
           OP, WIN, REQUEST, IERROR

Python:
MPI.Win.Accumulate(self, origin, int target_rank, target=None, Op op=SUM)
MPI_Accumulate which does a reduction operation on the results that are being put.

Accumulate is an atomic reduction with remote result. This means that multiple accumulates to a single target gives the correct result. As with MPI_Reduce , the order in which the operands are accumulated is undefined.

The same predefined operators are available, but no user-defined ones. There is one extra operator: MPI_REPLACE , this has the effect that only the last result to arrive is retained.

Exercise

Implement an `all-gather' operation using one-sided communication: each processor stores a single number, and you want each processor to build up an array that contains the values from all processors. Note that you do not need a special case for a processor collecting its own value: doing `communication' between a processor and itself is perfectly legal.

Exercise

Implement a shared counter:

  • One process maintains a counter;
  • Iterate: all others at random moments update this counter.
  • When the counter is no longer positive, everyone stops iterating.

The problem here is data synchronization: does everyone see the counter the same way?

9.3.5 Ordering and coherence of RMA operations

crumb trail: > mpi-onesided > Put, get, accumulate > Ordering and coherence of RMA operations

There are few guarantees about what happens inside one epoch.

  • No ordering of Get and Put/Accumulate operations: if you do both, there is no guarantee whether the Get will find the value before or after the update.
  • No ordering of multiple Puts. It is safer to do an Accumulate.

The following operations are well-defined inside one epoch:

  • Instead of multiple Put operations, use Accumulate with MPI_REPLACE .
  • MPI_Get_accumulate with MPI_NO_OP is safe.
  • Multiple Accumulate operations from one origin are done in program order by default. To allow reordering, for instance to have all reads happen after all writes, use the info parameter when the window is created; section  9.5.3 .

9.3.6 Request-based operations

crumb trail: > mpi-onesided > Put, get, accumulate > Request-based operations

Analogous to MPI_Isend there are request based one-sided operations:

C:
int MPI_Rput(
  const void *origin_addr, int origin_count, MPI_Datatype origin_datatype,
  int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype,
  MPI_Win win, MPI_Request *request)

Semantics:
IN origin_addr: initial address of origin buffer (choice)
IN origin_count: number of entries in origin buffer (non-negative integer)
IN origin_datatype: datatype of each entry in origin buffer (handle)
IN target_rank: rank of target (non-negative integer)
IN target_disp: displacement from start of window to target buffer (non-negative integer)
IN target_count: number of entries in target buffer (non-negative integer)
IN target_datatype: datatype of each entry in target buffer (handle)
IN win: window object used for communication (handle)
OUT request: RMA request (handle)
MPI_Rput and similarly MPI_Rget and MPI_Raccumulate and MPI_Rget_accumulate .

These only apply to passive target synchronization. Any MPI_Win_flush... call also terminates these transfers.

9.3.7 More active target synchronization

crumb trail: > mpi-onesided > Put, get, accumulate > More active target synchronization

The `fence' mechanism (section  9.2 ) uses a global synchronization on the communicator of the window. As such it is good for applications where the processes are largely synchronized, but it may lead to performance inefficiencies if processors are not in step which each other. There is a mechanism that is more fine-grained, by using synchronization only on a processor group . This takes four different calls, two for starting and two for ending the epoch, separately for target and origin.

FIGURE 9.5: Window locking calls in fine-grained active target synchronization

You start and complete an exposure epoch with% MPI_Win_post MPI_Win_wait :

int MPI_Win_post(MPI_Group group, int assert, MPI_Win win)
int MPI_Win_wait(MPI_Win win)

In other words, this turns your window into the target for a remote access.

You start and complete an access epoch with% MPI_Win_start MPI_Win_complete :

int MPI_Win_start(MPI_Group group, int assert, MPI_Win win)
int MPI_Win_complete(MPI_Win win)

In other words, these calls border the access to a remote window, with the current processor being the origin of the remote access.

In the following snippet a single processor puts data on one other. Note that they both have their own definition of the group, and that the receiving process only does the post and wait calls.

// postwaitwin.c
if (procno==origin) {
  MPI_Group_incl(all_group,1,&target,&two_group);
// access
  MPI_Win_start(two_group,0,the_window);
  MPI_Put( /* data on origin: */   &my_number, 1,MPI_INT,
           /* data on target: */   target,0,   1,MPI_INT,
	       the_window);
  MPI_Win_complete(the_window);
}

if (procno==target) {
  MPI_Group_incl(all_group,1,&origin,&two_group);
// exposure
  MPI_Win_post(two_group,0,the_window);
  MPI_Win_wait(the_window);
}

Both pairs of operations declare a group of processors ; see section  7.5.1 for how to get such a group from a communicator. On an origin processor you would specify a group that includes the targets you will interact with, on a target processor you specify a group that includes the possible origins.

9.3.8 Atomic operations

crumb trail: > mpi-onesided > Put, get, accumulate > Atomic operations

One-sided calls are said to emulate shared memory in MPI, but the put and get calls are not enough for certain scenarios with shared data. Consider the scenario where:

  • One process stores a table of work descriptors, and a pointer to the first unprocessed descriptor;
  • Each process reads the pointer, reads the corresponding descriptor, and increments the pointer; and
  • A process that has read a descriptor then executes the corresponding task.

The problem is that reading and updating the pointer is not an atomic operation it is possible that multiple processes get hold of the same value; conversely, multiple updates of the pointer may lead to work descriptors being skipped. These different overall behaviors, depending on precise timing of lower level events, are called a race condition .

In \mpistandard{3} some atomic routines have been added. Both

Semantics:

MPI_FETCH_AND_OP(origin_addr, result_addr, datatype, target_rank,
    target_disp, op, win)
IN origin_addr: initial address of buffer (choice)
OUT result_addr: initial address of result buffer (choice)
IN datatype: datatype of the entry in origin, result, and target buffers
(handle)
IN target_rank: rank of target (non-negative integer)
IN target_disp: displacement from start of window to beginning of target
buffer (non-negative integer)
IN op: reduce operation (handle)
IN win: window object (handle)

C:
int MPI_Fetch_and_op
   (const void *origin_addr, void *result_addr,
    MPI_Datatype datatype, int target_rank, MPI_Aint target_disp,
    MPI_Op op, MPI_Win win)

Fortran:
MPI_Fetch_and_op(origin_addr, result_addr, datatype, target_rank,
    target_disp, op, win, ierror)
TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: origin_addr
TYPE(*), DIMENSION(..), ASYNCHRONOUS :: result_addr
TYPE(MPI_Datatype), INTENT(IN) :: datatype
INTEGER, INTENT(IN) :: target_rank
INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: target_disp
TYPE(MPI_Op), INTENT(IN) :: op
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_Fetch_and_op and
C:
int MPI_Get_accumulate
   (const void *origin_addr, int origin_count,MPI_Datatype origin_datatype,
    void *result_addr, int result_count, MPI_Datatype result_datatype,
    int target_rank,
    MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype,
    MPI_Op op, MPI_Win win)

Input Parameters
origin_addr : initial address of buffer (choice)
origin_count : number of entries in buffer (nonnegative integer)
origin_datatype : datatype of each buffer entry (handle)

result_addr : initial address of result buffer (choice)
result_count : number of entries in result buffer (non-negative integer)
result_datatype : datatype of each entry in result buffer (handle)
target_rank : rank of target (nonnegative integer)
target_disp : displacement from start of window to beginning
    of target buffer (nonnegative integer)
target_count : number of entries in target buffer (nonnegative integer)
target_datatype : datatype of each entry in target buffer (handle)
op : predefined reduce operation (handle)
win : window object (handle)
MPI_Get_accumulate atomically retrieve data from the window indicated, and apply an operator, combining the data on the target with the data on the origin. Unlike Put and Get, it is safe to have multiple atomic operations in the same epoch.

Both routines perform the same operations: return data before the operation, then atomically update data on the target, but MPI_Get_accumulate is more flexible in data type handling. The more simple routine, MPI_Fetch_and_op , which operates on only a single element, allows for faster implementations, in particular through hardware support.

Use of MPI_NO_OP as the MPI_Op turns these routines into an atomic Get. Similarly, using MPI_REPLACE turns them into an atomic Put.

Exercise

Redo exercise  9.3.4 using MPI_Fetch_and_op . The problem is again to make sure all processes have the same view of the shared counter.

Does it work to make the fetch-and-op conditional? Is there a way to do it unconditionally? What should the `break' test be, seeing that multiple processes can update the counter at the same time?

Example

A root process has a table of data; the other processes do atomic gets and update of that data using passive target synchronization through MPI_Win_lock .

// passive.cxx
if (procno==repository) {
// Repository processor creates a table of inputs
// and associates that with the window
}
if (procno!=repository) {
  float contribution=(float)procno,table_element;
  int loc=0;
  MPI_Win_lock(MPI_LOCK_EXCLUSIVE,repository,0,the_window);
// read the table element by getting the result from adding zero
  MPI_Fetch_and_op
    (&contribution,&table_element,MPI_FLOAT,
     repository,loc,MPI_SUM,the_window);
  MPI_Win_unlock(repository,the_window);
}
## passive.py
if procid==repository:
    # repository process creates a table of inputs
    # and associates it with the window
    win_mem = np.empty( ninputs,dtype=np.float32 )
    win = MPI.Win.Create( win_mem,comm=comm )
else:
    # everyone else has an empty window
    win = MPI.Win.Create( None,comm=comm )
if procid!=repository:
    contribution = np.empty( 1,dtype=np.float32 )
    contribution[0] = 1.*procid
    table_element = np.empty( 1,dtype=np.float32 )
    win.Lock( repository,lock_type=MPI.LOCK_EXCLUSIVE )
    win.Fetch_and_op( contribution,table_element,repository,0,MPI.SUM)
    win.Unlock( repository )

Finally,

C:
int MPI_Compare_and_swap
   (const void *origin_addr, const void *compare_addr,
    void *result_addr, MPI_Datatype datatype,
    int target_rank, MPI_Aint target_disp,
    MPI_Win win)

Input Parameters

origin_addr : initial address of buffer (choice)
compare_addr : initial address of compare buffer (choice)
result_addr : initial address of result buffer (choice)
datatype : datatype of the entry in origin, result, and target buffers (handle)
target_rank : rank of target (nonnegative integer)
target_disp : displacement from start of window to beginning
    of target buffer (non-negative integer)
win : window object (handle)
MPI_Compare_and_swap swaps the origin and target data if the target data equals some comparison value.

9.3.8.1 A case study in atomic operations

crumb trail: > mpi-onesided > Put, get, accumulate > Atomic operations > A case study in atomic operations

Let us consider an example where a process, identified by counter_process , has a table of work descriptors, and all processes, including the counter process, take items from it to work on. To avoid duplicate work, the counter process has as counter that indicates the highest numbered available item. The part of this application that we simulate is this:

  1. a process reads the counter, to find an available work item; and
  2. subsequently decrements the counter by one.

We initialize the window content, under the separate memory model:

// countdownop.c
MPI_Win_fence(0,the_window);
if (procno==counter_process)
  MPI_Put(&counter_init,1,MPI_INT,
          counter_process,0,1,MPI_INT,
          the_window);
MPI_Win_fence(0,the_window);

We start by considering the naive approach, where we execute the above scheme literally with MPI_Get and MPI_Put :

// countdownput.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
         counter_process,0,1,MPI_INT,
         the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
  my_counter_values[ n_my_counter_values++ ] = counter_value;
  total_decrement++;
  int decrement = -1;
  counter_value += decrement;
  MPI_Put
    ( &counter_value,   1,MPI_INT,
      counter_process,0,1,MPI_INT,
      the_window);
}
MPI_Win_fence(0,the_window);

This scheme is correct if only process has a true value for i_am_available : that processes `owns' the current counter values, and it correctly updates the counter through the MPI_Put operation. However, if more than one process is available, they get duplicate counter values, and the update is also incorrect. If we run this program, we see that the counter did not get decremented by the total number of `put' calls.

Exercise

Supposing only one process is available, what is the function of the middle of the three fences? Can it be omitted?

We can fix the decrement of the counter by using MPI_Accumulate for the counter update, since it is atomic: multiple updates in the same epoch all get processed.

// countdownacc.c
MPI_Win_fence(0,the_window);
int counter_value;
MPI_Get( &counter_value,1,MPI_INT,
         counter_process,0,1,MPI_INT,
         the_window);
MPI_Win_fence(0,the_window);
if (i_am_available) {
  my_counter_values[n_my_counter_values++] = counter_value;
  total_decrement++;
  int decrement = -1;
  MPI_Accumulate
    ( &decrement,       1,MPI_INT,
      counter_process,0,1,MPI_INT,
      MPI_SUM,
      the_window);
}
MPI_Win_fence(0,the_window);

This scheme still suffers from the problem that processes will obtain duplicate counter values. The true solution is to combine the `get' and `put' operations into one atomic action; in this case MPI_Fetch_and_op :

MPI_Win_fence(0,the_window);
int
  counter_value;
if (i_am_available) {
  int
    decrement = -1;
  total_decrement++;
  MPI_Fetch_and_op
    ( /* operate with data from origin: */   &decrement,
      /* retrieve data from target:     */   &counter_value,
      MPI_INT, counter_process, 0, MPI_SUM,
      the_window);
}
MPI_Win_fence(0,the_window);
if (i_am_available) {
  my_counter_values[n_my_counter_values++] = counter_value;
}

Now, if there are multiple accesses, each retrieves the counter value and updates it in one atomic, that is, indivisible, action.

9.4 Passive target synchronization

crumb trail: > mpi-onesided > Passive target synchronization

In passive target synchronization only the origin is actively involved: the target makes no calls whatsoever. This means that the origin process remotely locks the window on the target, performs a one-sided transfer, and releases the window by unlocking it again.

During an access epoch, also called an passive target epoch in this case (the concept of `exposure epoch' makes no sense with passive target synchronization), a process can initiate and finish a one-sided transfer. Typically it will lock the window with

C:
int MPI_Win_lock(int lock_type, int rank, int assert, MPI_Win win)

Input Parameters:
lock_type - Indicates whether other processes may access the target window at the
    same time (if MPI_LOCK_SHARED) or not (MPI_LOCK_EXCLUSIVE)
rank - rank of locked window (nonnegative integer)
assert - Used to optimize this call; zero may be used as a default. (integer)
win - window object (handle)

Python:
MPI.Win.Lock(self,
    int rank, int lock_type=LOCK_EXCLUSIVE, int assertion=0)
MPI_Win_lock :

if (rank == 0) {
  MPI_Win_lock (MPI_LOCK_EXCLUSIVE, 1, 0, win);
  MPI_Put (outbuf, n, MPI_INT, 1, 0, n, MPI_INT, win);
  MPI_Win_unlock (1, win);
}

Remark

The possibility to lock a window is not guaranteed for windows that are not created (possibly internally) by MPI_Alloc_mem , that is, all but MPI_Win_create .

9.4.1 Lock types

crumb trail: > mpi-onesided > Passive target synchronization > Lock types

A lock is needed to start an access epoch , that is, for an origin to acquire the capability to access a target. You can either acquire a lock on a specific process with MPI_Win_lock , or on all processes (in a communicator) with MPI_Win_lock_all . Unlike MPI_Win_fence , this is not a collective call. Also, it is possible to have multiple access epochs through MPI_Win_lock active simultaenously.

The two lock types are:

  • MPI_LOCK_SHARED : multiple processes can access the window on the same rank. If multiple processes perform a MPI_Get call there is no problem; with MPI_Put and similar calls there is a consistency problem; see below.
  • MPI_LOCK_EXCLUSIVE : an origin gets exclusive access to the window on a certain target. Unlike the shared lock, this has no consistency problems.

To unlock a window, use

C:

Py:
MPI.Win.Unlock(self, int rank)
MPI.Win.Unlock_all(self)
MPI_Win_unlock , % includes unlock_all respectively MPI_Win_unlock_all .

Exercise

Investigate atomic updates using passive target synchronization. Use MPI_Win_lock with an exclusive lock, which means that each process only acquires the lock when it absolutely has to.

  • All processs but one update a window:

    int one=1;
    MPI_Fetch_and_op(&one, &readout,
        MPI_INT, repo, zero_disp, MPI_SUM,
        the_win);
    
  • while the remaining process spins until the others have performed their update.

Use an atomic operation for the latter process to read out the shared value.\\ Can you replace the exclusive lock with a shared one? \skeleton{lockfetch}

Exercise

As exercise  9.4.1 , but now use a shared lock: all processes acquire the lock simultaneously and keep it as long as is needed.

The problem here is that coherence between window buffers and local variables is now not forced by a fence or releasing a lock. Use MPI_Win_flush_local to force coherence of a window (on another process) and the local variable from MPI_Fetch_and_op . \skeleton{lockfetchshared}

9.4.2 Lock all

crumb trail: > mpi-onesided > Passive target synchronization > Lock all

To lock the windows of all processes in the group of the windows, use

C:
int MPI_Win_lock( int assert, MPI_Win win)

Input Parameters:
assert - Used to optimize this call; zero may be used as a default. (integer)
win - window object (handle)
MPI_Win_lock_all . This is not a collective call: the `all' part refers to the fact that one process is locking the window on all processes.

  • The assertion value can be zero, or MPI_MODE_NOCHECK , which asserts that no other process will acquire a competing lock.
  • There is no `locktype' parameter: this is a shared lock.

The corresponding unlock is MPI_Win_unlock_all .

The expected use of a `lock/unlock all' is that they surround an extended epoch with get/put and flush calls.

9.4.3 Completion and consistency in passive target synchronization

crumb trail: > mpi-onesided > Passive target synchronization > Completion and consistency in passive target synchronization

In one-sided transfer one should keep straight the multiple instances of the data, and the various completion s that effect their consistency .

  • The user data. This is the buffer that is passed to a Put or Get call. For instance, after a Put call, but still in an access epoch, the user buffer is not safe to reuse. Making sure the buffer has been transferred is called local completion .
  • The window data. While this may be publicly accessible, it is not necessarily always consistent with internal copies.
  • The remote data. Even a successful Put does not guarantee that the other process has received the data. A successful transfer is a remote completion .

As observed, RMA operations are nonblocking, so we need mechanisms to ensure that an operation is completed, and to ensure consistency of the user and window data.

Completion of the RMA operations in a passive target epoch is ensured with MPI_Win_unlock or MPI_Win_unlock_all , similar to the use of MPI_Win_fence in active target synchronization.

If the passive target epoch is of greater duration, and no unlock operation is used to ensure completion, the following calls are available.

Remark

Using flush routines with active target synchronization (or generally outside a passive target epoch) you are likely to get a message

Wrong synchronization of RMA calls

9.4.3.1 Local completion

crumb trail: > mpi-onesided > Passive target synchronization > Completion and consistency in passive target synchronization > Local completion

The call

Synopsis:
MPI_WIN_FLUSH_LOCAL(rank, win)
Input arguments:
rank: rank of target window (non-negative integer)
win : window object (handle)

C:
int MPI_Win_flush_local(int rank, MPI_Win win)

Fortran:
MPI_Win_flush_local(rank, win, ierror)
INTEGER, INTENT(IN) :: rank
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_WIN_FLUSH_LOCAL(RANK, WIN, IERROR)
INTEGER RANK, WIN, IERROR

Synopsis:
MPI_WIN_FLUSH_LOCAL_ALL(win)
Input arguments:
win : window object (handle)

C:
int MPI_Win_flush_local_all(MPI_Win win)

Fortran:
MPI_Win_flush_local_all(win, ierror)
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_WIN_FLUSH_LOCAL_ALL(WIN, IERROR)
INTEGER WIN, IERROR
MPI_Win_flush_local ensure that all operations with a given target is completed at the origin. For instance, for calls to MPI_Get or MPI_Fetch_and_op the local result is available after the MPI_Win_flush_local .

With MPI_Win_flush_local_all local operations are concluded for all targets. This will typically be used with MPI_Win_lock_all (section  9.4.2 ).

9.4.3.2 Remote completion

crumb trail: > mpi-onesided > Passive target synchronization > Completion and consistency in passive target synchronization > Remote completion

The calls

Synopsis
MPI_WIN_FLUSH(rank, win)
Input arguments:
rank : rank of target window (non-negative integer)
win : window object (handle)

C:
int MPI_Win_flush(int rank, MPI_Win win)

Fortran:
MPI_Win_flush(rank, win, ierror)
INTEGER, INTENT(IN) :: rank
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_WIN_FLUSH(RANK, WIN, IERROR)
INTEGER RANK, WIN, IERROR

Synopsis:
MPI_WIN_FLUSH_ALL(win)
Input arguments:
win : window object (handle)

C:
int MPI_Win_flush_all(MPI_Win win)

Fortran:
MPI_Win_flush_all(win, ierror)
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_WIN_FLUSH_ALL(WIN, IERROR)
INTEGER WIN, IERROR
MPI_Win_flush and MPI_Win_flush_all effect completion of all outstanding RMA operations on the target, so that other processes can access its data. This is useful for MPI_Put operations, but can also be used for atomic operations such as MPI_Fetch_and_op .

9.4.3.3 Window synchronization

crumb trail: > mpi-onesided > Passive target synchronization > Completion and consistency in passive target synchronization > Window synchronization

Under the separate memory model , the user code can hold a buffer that is not coherent with the internal window data. The call MPI_Win_sync synchronizes private and public copies of the window.

9.5 More about window memory

crumb trail: > mpi-onesided > More about window memory

9.5.1 Memory models

crumb trail: > mpi-onesided > More about window memory > Memory models

You may think that the window memory is the same as the buffer you pass to MPI_Win_create or that you get from MPI_Win_allocate (section  9.1.1 ). This is not necessarily true, and the actual state of affairs is called the memory model . There are two memory models:

  • Under the unified memory model, the buffer in process space is indeed the window memory, or at least they are kept coherent . This means that after completion of an epoch you can read the window contents from the buffer. To get this, the window needs to be created with MPI_Win_allocate_shared .
  • Under the separate memory model, the buffer in process space is the private window and the target of put/get operations is the public window and the two are not the same and are not kept coherent. Under this model, you need to do an explicit get to read the window contents.

(Window models can be queried as attributes; see section  9.5.4 .)

9.5.2 Dynamically attached memory

crumb trail: > mpi-onesided > More about window memory > Dynamically attached memory

In section  9.1.1 we looked at simple ways to create a window and its memory.

It is also possible to have windows where the size is dynamically set. Create a dynamic window with

int MPI_Win_create_dynamic(MPI_Info info, MPI_Comm comm, MPI_Win *win)

Input Parameters
info : info argument (handle)
comm : communicator (handle)

Output Parameters
win : window object returned by the call (handle)
MPI_Win_create_dynamic and attach memory to the window with
Semantics:
MPI_Win_attach(win, base, size)

Input Parameters:
win : window object (handle)
base : initial address of memory to be attached
size : size of memory to be attached in bytes

C:
int MPI_Win_attach(MPI_Win win, void *base, MPI_Aint size)

Fortran:
MPI_Win_attach(win, base, size, ierror)
TYPE(MPI_Win), INTENT(IN) :: win
TYPE(*), DIMENSION(..), ASYNCHRONOUS :: base
INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_Win_attach .

At first sight, the code looks like splitting up a MPI_Win_create call into separate creation of the window and declaration of the buffer:

// windynamic.c
MPI_Win_create_dynamic(MPI_INFO_NULL,comm,&the_window);
if (procno==data_proc)
  window_buffer = (int*) malloc( 2*sizeof(int) );
  MPI_Win_attach(the_window,window_buffer,2*sizeof(int));
(where the window_buffer represents memory that has been allocated.)

However, there is an important difference in how the window is addressed in RMA operations. With all other window models, the displacement parameter is measured relative in units from the start of the buffer, here the displacement is an absolute address. This means that we need to get the address of the window buffer with MPI_Get_address and communicate it to the other processes:

MPI_Aint data_address;
if (procno==data_proc) {
  MPI_Get_address(window_buffer,&data_address);
}
MPI_Bcast(&data_address,1,MPI_LONG,data_proc,comm);

Location of the data, that is, the displacement parameter, is then given as an absolute location of the start of the buffer plus a count in bytes; in other words, the displacement unit is 1. In this example we use MPI_Get to find the second integer in a window buffer:

MPI_Aint disp = data_address+1*sizeof(int);
MPI_Get( /* data on origin: */           retrieve, 1,MPI_INT,
	       /* data on target: */ data_proc,disp,     1,MPI_INT,
	       the_window);

Notes.

  • The attached memory can be released with
    Semantics:
    MPI_Win_detach(win, base)
    
    Input parameters:
    win : window object (handle)
    base : initial address of memory to be detached
    
    C:
    int MPI_Win_detach(MPI_Win win, const void *base)
    
    Fortran:
    MPI_Win_detach(win, base, ierror)
    TYPE(MPI_Win), INTENT(IN) :: win
    TYPE(*), DIMENSION(..), ASYNCHRONOUS :: base
    INTEGER, OPTIONAL, INTENT(OUT) :: ierror
    
    MPI_Win_detach .
  • The above fragments show that an origin process has the actual address of the window buffer. It is an error to use this if the buffer is not attached to a window.
  • In particular, one has to make sure that the attach call is concluded before performing RMA operations on the window.

9.5.3 Window usage hints

crumb trail: > mpi-onesided > More about window memory > Window usage hints

The following keys can be passed as info argument:

  • no_locks : if set to true, passive target synchronization (section  9.4 ) will not be used on this window.
  • accumulate_ordering : a comma-separated list of the keywords rar , raw , war , waw can be specified. This indicates that reads or writes from MPI_Accumulate or MPI_Get_accumulate can be reordered, subject to certain constraints.
  • accumulate_ops : the value same_op indicates that concurrent Accumulate calls use the same operator; same_op_no_op indicates the same operator or MPI_NO_OP .

9.5.4 Window information

crumb trail: > mpi-onesided > More about window memory > Window information

The MPI_Info parameter can be used to pass implementation-dependent information; see section  14.1.1 .

A number of attributes are stored with a window when it is created.

Obtaining a pointer to the start of the window area:

void *base;
MPI_Win_get_attr(win, MPI_WIN_BASE, &base, &flag)

Obtaining the size and window displacement unit :

MPI_Aint *size;
MPI_Win_get_attr(win, MPI_WIN_SIZE, &size, &flag),
int *disp_unit;
MPI_Win_get_attr(win, MPI_WIN_DISP_UNIT, &disp_unit, &flag),

The type of create call used:

int *create_kind;
MPI_Win_get_attr(win, MPI_WIN_CREATE_FLAVOR, &create_kind, &flag)

with possible values:

  • MPI_WIN_FLAVOR_CREATE if the window was create with MPI_Win_create ;
  • MPI_WIN_FLAVOR_ALLOCATE if the window was create with MPI_Win_allocate ;
  • MPI_WIN_FLAVOR_DYNAMIC if the window was create with MPI_Win_create_dynamic . In this case the base is MPI_BOTTOM and the size is zero;
  • MPI_WIN_FLAVOR_SHARED if the window was create with MPI_Win_allocate_shared ;

The window model:

int *memory_model;
MPI_Win_get_attr(win, MPI_WIN_MODEL, &memory_model, &flag);

with possible values:

  • MPI_WIN_SEPARATE ,
  • MPI_WIN_UNIFIED ,

Get the group of processes associated with a window:

int MPI_Win_get_group(MPI_Win win, MPI_Group *group)

Window information objects:

int MPI_Win_set_info(MPI_Win win, MPI_Info info)
int MPI_Win_get_info(MPI_Win win, MPI_Info *info_used)

9.6 Assertions

crumb trail: > mpi-onesided > Assertions

The MPI_Win_fence call, as well MPI_Win_start and such, take an argument through which assertions can be passed about the activity before, after, and during the epoch. The value zero is always allowed, by you can make your program more efficient by specifying one or more of the following, combined by bitwise OR in C/C++ or IOR in Fortran.

  • MPI_Win_start Supports the option:

    • MPI_MODE_NOCHECK the matching calls to MPI_Win_post have already completed on all target processes when the call to MPI_Win_start is made. The nocheck option can be specified in a start call if and only if it is specified in each matching post call. This is similar to the optimization of ``ready-send'' that may save a handshake when the handshake is implicit in the code. (However, ready-send is matched by a regular receive, whereas both start and post must specify the nocheck option.)
  • MPI_Win_post supports the following options:

    • MPI_MODE_NOCHECK the matching calls to MPI_Win_start have not yet occurred on any origin processes when the call to MPI_Win_post is made. The nocheck option can be specified by a post call if and only if it is specified by each matching start call.
    • MPI_MODE_NOSTORE the local window was not updated by local stores (or local get or receive calls) since last synchronization. This may avoid the need for cache synchronization at the post call.
    • MPI_MODE_NOPUT the local window will not be updated by put or accumulate calls after the post call, until the ensuing (wait) synchronization. This may avoid the need for cache synchronization at the wait call.
  • MPI_Win_fence supports the following options:

    • MPI_MODE_NOSTORE the local window was not updated by local stores (or local get or receive calls) since last synchronization.
    • MPI_MODE_NOPUT the local window will not be updated by put or accumulate calls after the fence call, until the ensuing (fence) synchronization.
    • MPI_MODE_NOPRECEDE the fence does not complete any sequence of locally issued RMA calls. If this assertion is given by any process in the window group, then it must be given by all processes in the group.
    • MPI_MODE_NOSUCCEED the fence does not start any sequence of locally issued RMA calls. If the assertion is given by any process in the window group, then it must be given by all processes in the group.
  • MPI_Win_lock and MPI_Win_lock_all support the following option:

    • MPI_MODE_NOCHECK no other process holds, or will attempt to acquire a conflicting lock, while the caller holds the window lock. This is useful when mutual exclusion is achieved by other means, but the coherence operations that may be attached to the lock and unlock calls are still required.

9.7 Implementation

crumb trail: > mpi-onesided > Implementation

You may wonder how one-sided communication is realized\footnote{For more on this subject, see  [thakur:ijhpca-sync] .}. Can a processor somehow get at another processor's data? Unfortunately, no.

Active target synchronization is implemented in terms of two-sided communication. Imagine that the first fence operation does nothing, unless it concludes prior one-sided operations. The Put and Get calls do nothing involving communication, except for marking with what processors they exchange data. The concluding fence is where everything happens: first a global operation determines which targets need to issue send or receive calls, then the actual sends and receive are executed.

Exercise

Assume that only Get operations are performed during an epoch. Sketch how these are translated to send/receive pairs. The problem here is how the senders find out that they need to send. Show that you can solve this with an MPI_Reduce_scatter call.

The previous paragraph noted that a collective operation was necessary to determine the two-sided traffic. Since collective operations induce some amount of synchronization, you may want to limit this.

Exercise

Argue that the mechanism with window post/wait/start/complete operations still needs a collective, but that this is less burdensome.

Passive target synchronization needs another mechanism entirely. Here the target process needs to have a background task (process, thread, daemon,…) running that listens for requests to lock the window. This can potentially be expensive.

\newpage

9.8 Review questions

crumb trail: > mpi-onesided > Review questions

Find all the errors in this code.

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>


#define MASTER 0


int main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);
  MPI_Comm comm = MPI_COMM_WORLD;
  int r, p;
  MPI_Comm_rank(comm, &r);
  MPI_Comm_size(comm, &p);
  printf("Hello from %d\n", r);
  int result[1] = {0};
  //int assert = MPI_MODE_NOCHECK;
  int assert = 0;
  int one = 1;
  MPI_Win win_res;
  MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0], &win_res);
  MPI_Win_lock_all(assert, win_res);
  if (r == MASTER) {
    result[0] = 0;
    do{
      MPI_Fetch_and_op(&result, &result , MPI_INT, r, 0, MPI_NO_OP, win_res);
      printf("result: %d\n", result[0]);
    } while(result[0] != 4);
    printf("Master is done!\n");
  } else {
    MPI_Fetch_and_op(&one, &result, MPI_INT, 0, 0, MPI_SUM, win_res);
  }
  MPI_Win_unlock_all(win_res);
  MPI_Win_free(&win_res);
  MPI_Finalize();
  return 0;
Back to Table of Contents