MPI topic: Topologies

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 11.1 : Cartesian grid topology
11.1.1 : Cartesian routines
11.2 : Distributed graph topology
11.2.1 : Graph creation
11.2.2 : Neighbor collectives
11.2.3 : Query
11.2.4 : Graph topology (deprecated)
Back to Table of Contents

11 MPI topic: Topologies

A communicator describes a group of processes, but the structure of your computation may not be such that every process will communicate with every other process. For instance, in a computation that is mathematically defined on a Cartesian 2D grid, the processes themselves act as if they are two-dimensionally ordered and communicate with N/S/E/W neighbors. If MPI had this knowledge about your application, it could conceivably optimize for it, for instance by renumbering the ranks so that communicating processes are closer together physically in your cluster.

The mechanism to declare this structure of a computation to MPI is known as a virtual topology . The following types of topology are defined:

  • MPI_UNDEFINED : this values holds for communicators where no topology has explicitly been specified.
  • MPI_CART : this value holds for Cartesian toppologies, where processes act as if they are ordered in a multi-dimensional `brick'; see section~ 11.1 .
  • MPI_GRAPH : this value describes the graph topology that was defined in \mpistandard{1}; section~ 11.2.4 . It is unnecessarily burdensome, since each process needs to know the total graph, and should therefore be considered obsolete; the type MPI_DIST_GRAPH should be used instead.
  • MPI_DIST_GRAPH : this value describes the distributed graph topology where each process only describes the edges in the process graph that touch itself; see section~ 11.2 .

These values can be discovered with the routine

int MPI_Topo_test(MPI_Comm comm, int *status)

status:
MPI_UNDEFINED
MPI_CART
MPI_GRAPH
MPI_DIST_GRAPH
MPI_Topo_test .

11.1 Cartesian grid topology

crumb trail: > mpi-topo > Cartesian grid topology

A Cartesian grid is a structure, typically in 2~or~3 dimensions, of points that have two neighbors in each of the dimensions. Thus, if a Cartesian grid has sizes $K\times M\times N$, its points have coordinates $(k,m,n)$ with $0\leq kwraparound connections is called a periodic grid .

The most common use of Cartesian coordinates is to find the rank of process by referring to it in grid terms. For instance, one could ask `what are my neighbors offset by $(1,0,0)$, $(-1,0,0)$, $(0,1,0)$ et cetera'.

While the Cartesian topology interface is fairly easy to use, as opposed to the more complicated general graph topology below, it is not actually sufficient for all Cartesian graph uses. Notably, in a so-called star stencil , such as the nine-point stencil , there are diagonal connections, which can not be described in a single step. Instead, it is necessary to take a separate step along each coordinate dimension. In higher dimensions this is of course fairly awkward.

Thus, even for Cartesian structures, it may be advisable to use the general graph topology interface.

11.1.1 Cartesian routines

crumb trail: > mpi-topo > Cartesian grid topology > Cartesian routines

The cartesian topology is specified by giving MPI_Cart_create the sizes of the processor grid along each axis, and whether the grid is periodic along that axis.

int MPI_Cart_create(
  MPI_Comm comm_old, int ndims, int *dims, int *periods,
  int reorder, MPI_Comm *comm_cart)

Each point in this new communicator has a coordinate and a rank. They can be queried with MPI_Cart_coords and MPI_Cart_rank respectively.

int MPI_Cart_coords(
  MPI_Comm comm, int rank, int maxdims,
  int *coords);
int MPI_Cart_rank(
  MPI_Comm comm, init *coords,
  int *rank);

Note that these routines can give the coordinates for any process, not just for the current process.

// cart.c
MPI_Comm comm2d;
ndim = 2; periodic[0] = periodic[1] = 0;
dimensions[0] = idim; dimensions[1] = jdim;
MPI_Cart_create(comm,ndim,dimensions,periodic,1,&comm2d);
MPI_Cart_coords(comm2d,procno,ndim,coord_2d);
MPI_Cart_rank(comm2d,coord_2d,&rank_2d);
printf("I am %d: (%d,%d); originally %d\n",rank_2d,coord_2d[0],coord_2d[1],procno);

The reorder parameter to MPI_Cart_create indicates whether processes can have a rank in the new communicator that is different from in the old one.

Strangely enough you can only shift in one direction, you can not specify a shift vector.

int MPI_Cart_shift(MPI_Comm comm, int direction, int displ, int *source,
                  int *dest)

If you specify a processor outside the grid the result is MPI_PROC_NULL .

char mychar = 65+procno;
MPI_Cart_shift(comm2d,0,+1,&rank_2d,&rank_right);
MPI_Cart_shift(comm2d,0,-1,&rank_2d,&rank_left);
MPI_Cart_shift(comm2d,1,+1,&rank_2d,&rank_up);
MPI_Cart_shift(comm2d,1,-1,&rank_2d,&rank_down);
int irequest = 0; MPI_Request *requests = malloc(8*sizeof(MPI_Request));
MPI_Isend(&mychar,1,MPI_CHAR,rank_right, 0,comm, requests+irequest++);
MPI_Isend(&mychar,1,MPI_CHAR,rank_left,  0,comm, requests+irequest++);
MPI_Isend(&mychar,1,MPI_CHAR,rank_up,    0,comm, requests+irequest++);
MPI_Isend(&mychar,1,MPI_CHAR,rank_down,  0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_right, 0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_left,  0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_up,    0,comm, requests+irequest++);
MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_down,  0,comm, requests+irequest++);

11.2 Distributed graph topology

crumb trail: > mpi-topo > Distributed graph topology

\caption{Illustration of a distributed graph topology where each node has four neighbors}

In many calculations on a grid (using the term in its mathematical, FEM , sense), a grid point will collect information from grid points around it. Under a sensible distribution of the grid over processes, this means that each process will collect information from a number of neighbor processes. The number of neighbors is dependent on that process. For instance, in a 2D grid (and assuming a five-point stencil for the computation) most processes communicate with four neighbors; processes on the edge with three, and processes in the corners with two.

Such a topology is illustrated in figure  11.1 .

MPI's notion of neighborhood collectives , offer an elegant way of expressing such communication structures. There are various reasons for using graph topologies over the older, simpler methods.

  • MPI is allowed to reorder the processes, so that network proximity in the cluster corresponds to proximity in the structure of the code.
  • Ordinary collectives could not directly be used for graph problems, unless one would adopt a subcommunicator for each graph neighborhood. However, scheduling would then lead to deadlock or serialization.
  • The normal way of dealing with graph problems is through nonblocking communications. However, since the user indicates an explicit order in which they are posted, congestion at certain processes may occur.
  • Collectives can pipeline data, while send/receive operations need to transfer their data in its entirety.
  • Collectives can use spanning trees, while send/receive uses a direct connection.

Thus the minimal description of a process graph contains for each process:

  • Degree: the number of neighbor processes; and
  • the ranks of the processes to communicate with.

However, this ignores that communication is not always symmetric: maybe the processes you receive from are not the ones you send to. Worse, maybe only one side of this duality is easily described. Therefore, there are two routines:

  • MPI_Dist_graph_create_adjacent assumes that a process knows both who it is sending it, and who will send to it. This is the most work for the programmer to specify, but it is ultimately the most efficient.
  • MPI_Dist_graph_create specifies on each process only what it is the source for; that is, who this process will be sending to. Consequently, some amount of processing -- including communication -- is needed to build the converse information, the ranks that will be sending to a process.

11.2.1 Graph creation

crumb trail: > mpi-topo > Distributed graph topology > Graph creation

There are two creation routines for process graphs. These routines are fairly general in that they allow any process to specify any part of the topology. In practice, of course, you will mostly let each process describe its own neighbor structure.

The routine MPI_Dist_graph_create_adjacent assumes that a process knows both who it is sending it, and who will send to it. This means that every edge in the communication graph is represented twice, so the memory footprint is double of what is strictly necessary. However, no communication is needed to build the graph.

The second creation routine,

int MPI_Dist_graph_create
   (MPI_Comm comm_old, int n, const int sources[],
    const int degrees[], const int destinations[], const int weights[],
    MPI_Info info, int reorder,
    MPI_Comm *comm_dist_graph)

Input Parameters:
comm_old : input communicator (handle)
n : number of source nodes for which this process specifies edges (non-negative integer)
sources : array containing the n source nodes for which this process specifies edges (array of non-negative integers)
degrees : array specifying the number of destinations for each source node in the source node array (array of non-negative integers)
destinations : destination nodes for the source nodes in the source
node array (array of
non-negative
integers)
weights : weights for source to destination edges (array of
non-negative integers or MPI_UNWEIGHTED)
info : hints on optimization and interpretation of weights (handle)
reorder : the process may be reordered (true) or not (false) (logical)

Output Parameters:
comm_dist_graph : communicator with distributed graph topology added (handle)

Python:
MPI.Comm.Create_dist_graph
    (self, sources, degrees, destinations, weights=None, Info info=INFO_NULL, bool reorder=False)
returns graph communicator
MPI_Dist_graph_create , is probably easier to use, especially in cases where the communication structure of your program is symmetric, meaning that a process sends to the same neighbors that it receives from. Now you specify on each process only what it is the source for; that is, who this process will be sending to.\footnote{I disagree with this design decision. Specifying your sources is usually easier than specifying your destinations.}. Consequently, some amount of processing -- including communication -- is needed to build the converse information, the ranks that will be sending to a process.

MPL note The class mpl:: dist_graph_communicator

only has a constructor corresponding to MPI_Dist_graph_create . End of MPL note

Figure  11.1 describes the common five-point stencil structure. If we let each process only describe itself, we get the following:

  • nsources $=1$ because the calling process describes on node in the graph: itself.
  • sources is an array of length 1, containing the rank of the calling process.
  • degrees is an array of length 1, containing the degree (probably: 4) of this process.
  • destinations is an array of length the degree of this process, probably again 4. The elements of this array are the ranks of the neighbor nodes; strictly speaking the ones that this process will send to.
  • weights is an array declaring the relative importance of the destinations. For an unweighted graph use MPI_UNWEIGHTED . In the case the graph is weighted, but the degree of a source is zero, you can pass an empty array as MPI_WEIGHTS_EMPTY .
  • reorder ( int in C, LOGICAL in Fortran) indicates whether MPI is allowed to shuffle processes to achieve greater locality.

The resulting communicator has all the processes of the original communicator, with the same ranks. In other words MPI_Comm_size and MPI_Comm_rank gives the same values on the graph communicator, as on the intra-communicator that it is constructed from. To get information about the grouping, use MPI_Dist_graph_neighbors and MPI_Dist_graph_neighbors_count ; section  11.2.3 .

Python note

Graph communicator creation is a method of the \plstinline{Comm} class, and the graph communicator is a function return result:

graph_comm = oldcomm.Create_dist_graph(sources, degrees, destinations)

The weights, info, and reorder arguments have default values.

MPL note

The constructor dist_graph_communicator

dist_graph_communicator
   (const communicator &old_comm, const source_set &ss,
    const dest_set &ds, bool reorder = true);

is a wrapper around MPI_Dist_graph_create_adjacent . End of MPL note MPL note

Methods indegree outdegree are wrappers around MPI_Dist_graph_neighbors_count . Sources and targets can be queried with inneighbors outneighbors which are wrappers around MPI_Dist_graph_neighbors . End of MPL note

11.2.2 Neighbor collectives

crumb trail: > mpi-topo > Distributed graph topology > Neighbor collectives

We can now use the graph topology to perform a gather or allgather

Synopsis

int MPI_Neighbor_allgather
   (const void *sendbuf, int sendcount,MPI_Datatype sendtype,
    void *recvbuf, int recvcount, MPI_Datatype recvtype,
    MPI_Comm comm)

Input Parameters:
sendbuf : starting address of the send buffer (choice)
sendcount : number of elements sent to each neighbor (non-negative integer)
sendtype : data type of send buffer elements (handle)
recvcount : number of elements received from each neighbor (non-negative integer)
recvtype : data type of receive buffer elements (handle)
comm : communicator (handle)

Output Parameters
recvbuf : starting address of the receive buffer (choice)
MPI_Neighbor_allgather that combines only the processes directly connected to the calling process.

The neighbor collectives have the same argument list as the regular collectives, but they apply to a graph communicator.

\caption{Solving the right-send exercise with neighborhood collectives}

Exercise

Revisit exercise  4.1.2.3 and solve it using MPI_Dist_graph_create . Use figure  11.2 for inspiration.

Use a degree value of 1. \skeleton{rightgraph}

The previous exercise can be done with a degree value of:

  • 1, reflecting that each process communicates with just 1 other; or
  • 2, reflecting that you really gather from two processes.

In the latter case, results do not wind up in the receive buffer in order of increasing process number as with a traditional gather. Rather, you need to use  MPI_Dist_graph_neighbors to find their sequencing; see section  11.2.3 .

Another neighbor collective is MPI_Neighbor_alltoall .

The vector variants are MPI_Neighbor_allgatherv and MPI_Neighbor_alltoallv .

There is a heterogenous (multiple datatypes) variant: MPI_Neighbor_alltoallw .

The list is: MPI_Neighbor_allgather , MPI_Neighbor_allgatherv , MPI_Neighbor_alltoall , MPI_Neighbor_alltoallv , MPI_Neighbor_alltoallw .

Nonblocking: MPI_Ineighbor_allgather , MPI_Ineighbor_allgatherv , MPI_Ineighbor_alltoall , MPI_Ineighbor_alltoallv , MPI_Ineighbor_alltoallw .

For unclear reasons there is no MPI_Neighbor_allreduce .

11.2.3 Query

crumb trail: > mpi-topo > Distributed graph topology > Query

There are two routines for querying the neighbors of a process:


MPI_Dist_graph_neighbors_count and

MPI_Dist_graph_neighbors .

While this information seems derivable from the graph construction, that is not entirely true for two reasons.

  1. With the nonadjoint version MPI_Dist_graph_create , only outdegrees and destinations are specified; this call then supplies the indegrees and sources;
  2. As observed above, the order in which data is placed in the receive buffer of a gather call is not determined by the create call, but can only be queried this way.

11.2.4 Graph topology (deprecated)

crumb trail: > mpi-topo > Distributed graph topology > Graph topology (deprecated)

The original \mpistandard{1} had a graph topology interface MPI_Graph_create which required each process to specify the full process graph. Since this is not scalable, it should be considered deprecated. Use the distributed graph topology (section  11.2 ) instead.

Other legacy routines: MPI_Graph_neighbors , MPI_Graph_neighbors_count , MPI_Graph_get , MPI_Graphdims_get .

Back to Table of Contents