MPI topic: Shared memory

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 12.1 : Recognizing shared memory
12.2 : Shared memory for windows
12.2.1 : Pointers to a shared window
12.2.2 : Querying the shared structure
12.2.3 : Heat equation example
12.2.4 : Shared bulk data
Back to Table of Contents

12 MPI topic: Shared memory

Some programmers are under the impression that MPI would not be efficient on shared memory, since all operations are done through what looks like network calls. This is not correct: many MPI implementations have optimizations that detect shared memory and can exploit it, so that data is copied, rather than going through a communication layer. (Conversely, programming systems for shared memory such as OpenMP can actually have inefficiencies associated with thread handling.) The main inefficiency associated with using MPI on shared memory is then that processes can not actually share data.

The one-sided MPI calls (chapter~ MPI topic: One-sided communication ) can also be used to emulate shared memory, in the sense that an origin process can access data from a target process without the target's active involvement. However, these calls do not distinguish between actually shared memory and one-sided access across the network.

In this chapter we will look at the ways MPI can interact with the presence of actual shared memory. (This functionality was added in the \mpistandard{3} standard.) This relies on the MPI_Win windows concept, but otherwise uses direct access of other processes' memory.

12.1 Recognizing shared memory

crumb trail: > mpi-shared > Recognizing shared memory

MPI's one-sided routines take a very symmetric view of processes: each process can access the window of every other process (within a communicator). Of course, in practice there will be a difference in performance depending on whether the origin and target are actually on the same shared memory, or whether they can only communicate through the network. For this reason MPI makes it easy to group processes by shared memory domains using

C:
int MPI_Comm_split_type(
  MPI_Comm comm, int split_type, int key,
  MPI_Info info, MPI_Comm *newcomm)

Fortran:
MPI_Comm_split_type(comm, split_type, key, info, newcomm, ierror)
TYPE(MPI_Comm), INTENT(IN) :: comm
INTEGER, INTENT(IN) :: split_type, key
TYPE(MPI_Info), INTENT(IN) :: info
TYPE(MPI_Comm), INTENT(OUT) :: newcomm
INTEGER, OPTIONAL, INTENT(OUT) :: ierror

Python:
MPI.Comm.Split_type(
  self, int split_type, int key=0, Info info=INFO_NULL)
MPI_Comm_split_type .

Here the split_type parameter has to be from the following (short) list:

  • MPI_COMM_TYPE_SHARED : split the communicator into subcommunicators of processes sharing a memory area.

    MPI 4 Standard only

  • MPI_COMM_TYPE_HW_GUIDED (\mpistandard{4}): split using an info value from MPI_Get_hw_resource_types .
  • MPI_COMM_TYPE_HW_UNGUIDED (\mpistandard{4}): similar to MPI_COMM_TYPE_HW_GUIDED , but the resulting communicators should be a strict subset of the original communicator. On processes where this condition can not be fullfilled, MPI_COMM_NULL will be returned.

    End of MPI 4 note

MPL note

Similar to ordinary communicator splitting~ 7.4 : communicator:: split_shared End of MPL note

In the following example, CORES_PER_NODE is a platform-dependent constant:

// commsplittype.c
MPI_Info info;
MPI_Comm_split_type(MPI_COMM_WORLD,MPI_COMM_TYPE_SHARED,procno,info,&sharedcomm);
MPI_Comm_size(sharedcomm,&new_nprocs);
MPI_Comm_rank(sharedcomm,&new_procno);

12.2 Shared memory for windows

crumb trail: > mpi-shared > Shared memory for windows

Processes that exist on the same physical shared memory should be able to move data by copying, rather than through MPI send/receive calls --~which of course will do a copy operation under the hood. In order to do such user-level copying:

  1. We need to create a shared memory area with MPI_Win_allocate_shared , and
  2. We need to get pointers to where a process' area is in this shared space; this is done with MPI_Win_shared_query .

12.2.1 Pointers to a shared window

crumb trail: > mpi-shared > Shared memory for windows > Pointers to a shared window

The first step is to create a window (in the sense of one-sided MPI; section~ 9.1 ) on the processes on one node. Using the

Semantics:
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)

Input parameters:
size: size of local window in bytes (non-negative integer)
disp_unit local unit size for displacements, in bytes (positive
integer)
info: info argument (handle)
comm: intra-communicator (handle)

Output parameters:
baseptr: address of local allocated window segment (choice)
win: window object returned by the call (handle)

C:
int MPI_Win_allocate_shared
   (MPI_Aint size, int disp_unit, MPI_Info info,
    MPI_Comm comm, void *baseptr, MPI_Win *win)

Fortran:
MPI_Win_allocate_shared
   (size, disp_unit, info, comm, baseptr, win, ierror)
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR
INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size
INTEGER, INTENT(IN) :: disp_unit
TYPE(MPI_Info), INTENT(IN) :: info
TYPE(MPI_Comm), INTENT(IN) :: comm
TYPE(C_PTR), INTENT(OUT) :: baseptr
TYPE(MPI_Win), INTENT(OUT) :: win
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_Win_allocate_shared call presumably will put the memory close to the socket on which the process runs.

// sharedbulk.c
MPI_Aint window_size; double *window_data; MPI_Win node_window;
if (onnode_procid==0)
  window_size = sizeof(double);
else window_size = 0;
MPI_Win_allocate_shared
  ( window_size,sizeof(double),MPI_INFO_NULL,
    nodecomm,
    &window_data,&node_window);

The memory allocated by MPI_Win_allocate_shared is contiguous between the processes. This makes it possible to do address calculation. However, if a cluster node has a NUMA structure, for instance if two sockets have memory directly attached to each, this would increase latency for some processes. To prevent this, the key alloc_shared_noncontig can be set to true in the MPI_Info object.

MPI 4 Standard only

In the contiguous case, the mpi_minimum_memory_alignment info argument (section~ 9.1.1 ) applies only to the memory on the first process; in the noncontiguous case it applies to all.

End of MPI 4 note

// numa.c
MPI_Info window_info;
MPI_Info_create(&window_info);
  MPI_Info_set(window_info,"alloc_shared_noncontig","true");
MPI_Win_allocate_shared( window_size,sizeof(double),window_info,
                         nodecomm,
                         &window_data,&node_window);
MPI_Info_free(&window_info);

Let's now consider a scenario where you spawn two MPI processes per node, and the node has 100G of memory. Using the above option to allow for noncontiguous window allocation, you hope that the windows of the two processes are placed 50G apart. However, if you print out the addresses, you will find that that they are placed considerably closer together. For a small windows that distance may be as little as~4K, the size of a small page .

The reason for this mismatch is that an address that you obtain with the ampersand operator in~C is not a physical address , but a virtual address . The translation of where pages are placed in physical memory is determined by the page table .

12.2.2 Querying the shared structure

crumb trail: > mpi-shared > Shared memory for windows > Querying the shared structure

Even though the window created above is shared, that doesn't mean it's contiguous. Hence it is necessary to retrieve the pointer to the area of each process that you want to communicate with:

Semantics:
MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr)

Input arguments:
win:  shared memory window object (handle)
rank: rank in the group of window win (non-negative integer)
      or MPI_PROC_NULL

Output arguments:
size: size of the window segment (non-negative integer)
disp_unit: local unit size for displacements,
           in bytes (positive integer)
baseptr: address for load/store access to window segment (choice)

C:
int MPI_Win_shared_query
   (MPI_Win win, int rank, MPI_Aint *size, int *disp_unit,
    void *baseptr)

Fortran:
MPI_Win_shared_query(win, rank, size, disp_unit, baseptr, ierror)
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR
TYPE(MPI_Win), INTENT(IN) :: win
INTEGER, INTENT(IN) :: rank
INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: size
INTEGER, INTENT(OUT) :: disp_unit
TYPE(C_PTR), INTENT(OUT) :: baseptr
INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_Win_shared_query .

MPI_Aint window_size0; int window_unit; double *win0_addr;
MPI_Win_shared_query( node_window,0,
			&window_size0,&window_unit, &win0_addr );

12.2.3 Heat equation example

crumb trail: > mpi-shared > Shared memory for windows > Heat equation example

As an example, which consider the 1D heat equation. On each process we create a local area of three point:

// sharedshared.c
MPI_Win_allocate_shared(3,sizeof(int),info,sharedcomm,&shared_baseptr,&shared_window);

12.2.4 Shared bulk data

crumb trail: > mpi-shared > Shared memory for windows > Shared bulk data

In applications such as ray tracing , there is a read-only large data object (the objects in the scene to be rendered) that is needed by all processes. In traditional MPI, this would need to be stored redundantly on each process, which leads to large memory demands. With MPI shared memory we can store the data object once per node. Using as above MPI_Comm_split_type to find a communicator per NUMA domain, we store the object on process zero of this node communicator.

Exercise

Let the `shared' data originate on process zero in MPI_COMM_WORLD . Then:

  • create a communicator per shared memory domain;
  • create a communicator for all the processes with number zero on their node;
  • broadcast the shared data to the processes zero on each node.

\skeleton{shareddata}

Back to Table of Contents