A communicator describes a group of processes, but the structure of your computation may not be such that every process will communicate with every other process. For instance, in a computation that is mathematically defined on a Cartesian 2D grid, the processes themselves act as if they are two-dimensionally ordered and communicate with N/S/E/W neighbors. If MPI had this knowledge about your application, it could conceivably optimize for it, for instance by renumbering the ranks so that communicating processes are closer together physically in your cluster.
The mechanism to declare this structure of a computation to MPI is known as a virtual topology . The following types of topology are defined:
These values can be discovered with the routine
int MPI_Topo_test(MPI_Comm comm, int *status) status: MPI_UNDEFINED MPI_CART MPI_GRAPH MPI_DIST_GRAPH
crumb trail: > mpi-topo > Cartesian grid topology
A
Cartesian grid
is a structure, typically in 2~or~3 dimensions,
of points that have two neighbors in each of the dimensions.
Thus, if a Cartesian grid has sizes $K\times M\times N$, its
points have coordinates $(k,m,n)$ with $0\leq k
The most common use of Cartesian coordinates is to find the rank of process by referring to it in grid terms. For instance, one could ask `what are my neighbors offset by $(1,0,0)$, $(-1,0,0)$, $(0,1,0)$ et cetera'.
While the Cartesian topology interface is fairly easy to use, as opposed to the more complicated general graph topology below, it is not actually sufficient for all Cartesian graph uses. Notably, in a so-called star stencil , such as the nine-point stencil , there are diagonal connections, which can not be described in a single step. Instead, it is necessary to take a separate step along each coordinate dimension. In higher dimensions this is of course fairly awkward.
Thus, even for Cartesian structures, it may be advisable to use the general graph topology interface.
crumb trail: > mpi-topo > Cartesian grid topology > Cartesian routines
The cartesian topology is specified by giving MPI_Cart_create the sizes of the processor grid along each axis, and whether the grid is periodic along that axis.
int MPI_Cart_create( MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart)
Each point in this new communicator has a coordinate and a rank. They can be queried with MPI_Cart_coords and MPI_Cart_rank respectively.
int MPI_Cart_coords( MPI_Comm comm, int rank, int maxdims, int *coords); int MPI_Cart_rank( MPI_Comm comm, init *coords, int *rank);
Note that these routines can give the coordinates for any process, not just for the current process.
// cart.c MPI_Comm comm2d; ndim = 2; periodic[0] = periodic[1] = 0; dimensions[0] = idim; dimensions[1] = jdim; MPI_Cart_create(comm,ndim,dimensions,periodic,1,&comm2d); MPI_Cart_coords(comm2d,procno,ndim,coord_2d); MPI_Cart_rank(comm2d,coord_2d,&rank_2d); printf("I am %d: (%d,%d); originally %d\n",rank_2d,coord_2d[0],coord_2d[1],procno);
The reorder parameter to MPI_Cart_create indicates whether processes can have a rank in the new communicator that is different from in the old one.
Strangely enough you can only shift in one direction, you can not specify a shift vector.
int MPI_Cart_shift(MPI_Comm comm, int direction, int displ, int *source, int *dest)
If you specify a processor outside the grid the result is MPI_PROC_NULL .
char mychar = 65+procno; MPI_Cart_shift(comm2d,0,+1,&rank_2d,&rank_right); MPI_Cart_shift(comm2d,0,-1,&rank_2d,&rank_left); MPI_Cart_shift(comm2d,1,+1,&rank_2d,&rank_up); MPI_Cart_shift(comm2d,1,-1,&rank_2d,&rank_down); int irequest = 0; MPI_Request *requests = malloc(8*sizeof(MPI_Request)); MPI_Isend(&mychar,1,MPI_CHAR,rank_right, 0,comm, requests+irequest++); MPI_Isend(&mychar,1,MPI_CHAR,rank_left, 0,comm, requests+irequest++); MPI_Isend(&mychar,1,MPI_CHAR,rank_up, 0,comm, requests+irequest++); MPI_Isend(&mychar,1,MPI_CHAR,rank_down, 0,comm, requests+irequest++); MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_right, 0,comm, requests+irequest++); MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_left, 0,comm, requests+irequest++); MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_up, 0,comm, requests+irequest++); MPI_Irecv( indata+idata++, 1,MPI_CHAR, rank_down, 0,comm, requests+irequest++);
crumb trail: > mpi-topo > Distributed graph topology
In many calculations on a grid (using the term in its mathematical, FEM , sense), a grid point will collect information from grid points around it. Under a sensible distribution of the grid over processes, this means that each process will collect information from a number of neighbor processes. The number of neighbors is dependent on that process. For instance, in a 2D grid (and assuming a five-point stencil for the computation) most processes communicate with four neighbors; processes on the edge with three, and processes in the corners with two.
Such a topology is illustrated in figure 11.1 .
MPI's notion of neighborhood collectives , offer an elegant way of expressing such communication structures. There are various reasons for using graph topologies over the older, simpler methods.
Thus the minimal description of a process graph contains for each process:
However, this ignores that communication is not always symmetric: maybe the processes you receive from are not the ones you send to. Worse, maybe only one side of this duality is easily described. Therefore, there are two routines:
crumb trail: > mpi-topo > Distributed graph topology > Graph creation
There are two creation routines for process graphs. These routines are fairly general in that they allow any process to specify any part of the topology. In practice, of course, you will mostly let each process describe its own neighbor structure.
The routine MPI_Dist_graph_create_adjacent assumes that a process knows both who it is sending it, and who will send to it. This means that every edge in the communication graph is represented twice, so the memory footprint is double of what is strictly necessary. However, no communication is needed to build the graph.
The second creation routine,
int MPI_Dist_graph_create (MPI_Comm comm_old, int n, const int sources[], const int degrees[], const int destinations[], const int weights[], MPI_Info info, int reorder, MPI_Comm *comm_dist_graph) Input Parameters: comm_old : input communicator (handle) n : number of source nodes for which this process specifies edges (non-negative integer) sources : array containing the n source nodes for which this process specifies edges (array of non-negative integers) degrees : array specifying the number of destinations for each source node in the source node array (array of non-negative integers) destinations : destination nodes for the source nodes in the source node array (array of non-negative integers) weights : weights for source to destination edges (array of non-negative integers or MPI_UNWEIGHTED) info : hints on optimization and interpretation of weights (handle) reorder : the process may be reordered (true) or not (false) (logical) Output Parameters: comm_dist_graph : communicator with distributed graph topology added (handle) Python: MPI.Comm.Create_dist_graph (self, sources, degrees, destinations, weights=None, Info info=INFO_NULL, bool reorder=False) returns graph communicator
only has a constructor corresponding to MPI_Dist_graph_create . End of MPL note
Figure 11.1 describes the common five-point stencil structure. If we let each process only describe itself, we get the following:
The resulting communicator has all the processes of the original communicator, with the same ranks. In other words MPI_Comm_size and MPI_Comm_rank gives the same values on the graph communicator, as on the intra-communicator that it is constructed from. To get information about the grouping, use MPI_Dist_graph_neighbors and MPI_Dist_graph_neighbors_count ; section 11.2.3 .
Graph communicator creation is a method of the \plstinline{Comm} class, and the graph communicator is a function return result:
graph_comm = oldcomm.Create_dist_graph(sources, degrees, destinations)
The weights, info, and reorder arguments have default values.
The constructor dist_graph_communicator
dist_graph_communicator (const communicator &old_comm, const source_set &ss, const dest_set &ds, bool reorder = true);
is a wrapper around
MPI_Dist_graph_create_adjacent
.
End of MPL note
Methods indegree outdegree are wrappers around MPI_Dist_graph_neighbors_count . Sources and targets can be queried with inneighbors outneighbors which are wrappers around MPI_Dist_graph_neighbors . End of MPL note
crumb trail: > mpi-topo > Distributed graph topology > Neighbor collectives
We can now use the graph topology to perform a gather or allgather
Synopsis int MPI_Neighbor_allgather (const void *sendbuf, int sendcount,MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) Input Parameters: sendbuf : starting address of the send buffer (choice) sendcount : number of elements sent to each neighbor (non-negative integer) sendtype : data type of send buffer elements (handle) recvcount : number of elements received from each neighbor (non-negative integer) recvtype : data type of receive buffer elements (handle) comm : communicator (handle) Output Parameters recvbuf : starting address of the receive buffer (choice)
The neighbor collectives have the same argument list as the regular collectives, but they apply to a graph communicator.
Revisit exercise 4.1.2.3 and solve it using MPI_Dist_graph_create . Use figure 11.2 for inspiration.
Use a degree value of 1. \skeleton{rightgraph}
The previous exercise can be done with a degree value of:
In the latter case, results do not wind up in the receive buffer in order of increasing process number as with a traditional gather. Rather, you need to use MPI_Dist_graph_neighbors to find their sequencing; see section 11.2.3 .
Another neighbor collective is MPI_Neighbor_alltoall .
The vector variants are MPI_Neighbor_allgatherv and MPI_Neighbor_alltoallv .
There is a heterogenous (multiple datatypes) variant: MPI_Neighbor_alltoallw .
The list is: MPI_Neighbor_allgather , MPI_Neighbor_allgatherv , MPI_Neighbor_alltoall , MPI_Neighbor_alltoallv , MPI_Neighbor_alltoallw .
Nonblocking: MPI_Ineighbor_allgather , MPI_Ineighbor_allgatherv , MPI_Ineighbor_alltoall , MPI_Ineighbor_alltoallv , MPI_Ineighbor_alltoallw .
For unclear reasons there is no MPI_Neighbor_allreduce .
crumb trail: > mpi-topo > Distributed graph topology > Query
There are two routines for querying the neighbors of a process:
While this information seems derivable from the graph construction, that is not entirely true for two reasons.
crumb trail: > mpi-topo > Distributed graph topology > Graph topology (deprecated)
The original \mpistandard{1} had a graph topology interface MPI_Graph_create which required each process to specify the full process graph. Since this is not scalable, it should be considered deprecated. Use the distributed graph topology (section 11.2 ) instead.
Other legacy routines: MPI_Graph_neighbors , MPI_Graph_neighbors_count , MPI_Graph_get , MPI_Graphdims_get .