MPI topic: File I/O

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 10.1 : File handling
10.2 : File reading and writing
10.2.1 : Nonblocking read/write
10.2.2 : Individual file pointers, contiguous writes
10.2.3 : File views
10.2.4 : Shared file pointers
10.3 : Consistency
10.4 : Constants
10.5 : Review questions
Back to Table of Contents

10 MPI topic: File I/O

This chapter discusses the I/O support of MPI, which is intended to alleviate the problems inherent in parallel file access. Let us first explore the issues. This story partly depends on what sort of parallel computer are you running on.

  • On networks of workstations each node will have a separate drive with its own file system.
  • On many clusters there will be a shared file system that acts as if every process can access every file.
  • Cluster nodes may or may not have a private file system.

Based on this, the following strategies are possible, even before we start talking about MPI I/O.

  • One process can collect all data with MPI_Gather and write it out. There are at least three things wrong with this: it uses network bandwidth for the gather, it may require a large amount of memory on the root process, and centralized writing is a bottlenectk.
  • Absent a shared file system, writing can be parallelized by letting every process create a unique file and merge these after the run. This makes the I/O symmetric, but collecting all the files is a bottleneck.
  • Even with a with a shared file system this approach is possible, but it can put a lot of strain on the file system, and the post-processing can be a significant task.
  • Using a shared file system, there is nothing against every process opening the same existing file for reading, and using an individual file pointer to get its unique data.
  • \ldots~but having every process open the same file for output is probably not a good idea. For instance, if two processes try to write at the end of the file, you may need to synchronize them, and synchronize the file system flushes.

For these reasons, MPI has a number of routines that make it possible to read and write a single file from a large number of processes, giving each well-defined locations where to access the data. In fact, MPI-IO uses MPI derived datatype s for both the source data (that is, in memory) and target data (that is, on disk). Thus, in one call that is collective on a communicator each process can address data that is not contiguous in memory, and place it in locations that are not contiguous on disc.

There are dedicated libraries for file I/O, such as hdf5 , netcdf , or silo . However, these often add header information to a file that may not be understandable to post-processing applications. With MPI I/O you are in complete control of what goes to the file. (A~useful tool for viewing your file is the unix utility~ od .)

TACC note

Each node has a private  file system (typically flash storage), to which you can write files. Considerations:

  • Since these drives are separate from the shared file system, you don't have to worry about stress on the file servers.
  • These temporary file systems are wiped after your job finishes, so you have to do the post-processing in your job script.
  • The capacity of these local drives are fairly limited; see the userguide for exact numbers.

10.1 File handling

crumb trail: > mpi-io > File handling

MPI has its own file handle: MPI_File .

You open a file with

Semantics:
MPI_FILE_OPEN(comm, filename, amode, info, fh)
IN comm: communicator (handle)
IN filename: name of file to open (string)
IN amode: file access mode (integer)
IN info: info object (handle)
OUT fh: new file handle (handle)

C:
int MPI_File_open
    (MPI_Comm comm, char *filename, int amode,
     MPI_Info info, MPI_File *fh)

Fortran:
MPI_FILE_OPEN(COMM, FILENAME, AMODE, INFO, FH, IERROR)
CHARACTER*(*) FILENAME
INTEGER COMM, AMODE, INFO, FH, IERROR

Python:
Open(type cls, Intracomm comm, filename,
     int amode=MODE_RDONLY, Info info=INFO_NULL)
MPI_File_open . This routine is collective, even if only certain processes will access the file with a read or write call. Similarly, MPI_File_close is collective.

Python note

Note the slightly unusual syntax for opening a file:

mpifile = MPI.File.Open(comm,filename,mode)

Even though the file is opened on a communicator, it is a class method for the MPI.File class, rather than for the communicator object. The latter is passed in as an argument.

File access modes:

  • MPI_MODE_RDONLY : read only,
  • MPI_MODE_RDWR : reading and writing,
  • MPI_MODE_WRONLY : write only,
  • MPI_MODE_CREATE : create the file if it does not exist,
  • MPI_MODE_EXCL : error if creating file that already exists,
  • MPI_MODE_DELETE_ON_CLOSE : delete file on close,
  • MPI_MODE_UNIQUE_OPEN : file will not be concurrently opened elsewhere,
  • MPI_MODE_SEQUENTIAL : file will only be accessed sequentially,
  • MPI_MODE_APPEND : set initial position of all file pointers to end of file.

These modes can be added or bitwise-or'ed.

You can delete a file with MPI_File_delete .

Buffers can be flushed with MPI_File_sync , which is a collective call.

10.2 File reading and writing

crumb trail: > mpi-io > File reading and writing

The basic file operations, in between the open and close calls, are the POSIX-like, noncollective, calls

  • MPI_File_seek - Updates individual file pointers (noncollective)
    
    C:
    #include 
    int MPI_File_seek(MPI_File fh, MPI_Offset offset,int whence)
    
    Fortran 2008:
    USE mpi_f08
    MPI_File_seek(fh, offset, whence, ierror)
        TYPE(MPI_File), INTENT(IN) :: fh
        INTEGER(KIND=MPI_OFFSET_KIND), INTENT(IN) :: offset
        INTEGER, INTENT(IN) :: whence
        INTEGER, OPTIONAL, INTENT(OUT) :: ierror
    
    Fortran 90:
    USE MPI
    ! or the older form: INCLUDE ’mpif.h’
    MPI_FILE_SEEK(FH, OFFSET, WHENCE, IERROR)
        INTEGER    FH, WHENCE, IERROR
        INTEGER(KIND=MPI_OFFSET_KIND)    OFFSET
    
    Input parameters:
    fh     : File handle (handle).
    offset : File offset (integer).
    whence : Update mode (integer).
    
    Output parameters:
    IERROR : Fortran only: Error status (integer)
    
    MPI_File_seek . The whence parameter can be:

    • MPI_SEEK_SET The pointer is set to offset.
    • MPI_SEEK_CUR The pointer is set to the current pointer position plus offset.
    • MPI_SEEK_END The pointer is set to the end of the file plus offset.
  • Synopsis:
    write at current file pointer
    MPI_File_write: non-collective
    MPI_File_write_all : collective
    
    C Syntax
    #include 
    int MPI_File_write
    int MPI_File_write_all(MPI_File fh, const void *buf,
        int count, MPI_Datatype datatype,
        MPI_Status *status)
    
    Input parameters:
    buf : Initial address of buffer (choice).
    count : Number of elements in buffer (integer).
    datatype : Data type of each buffer element (handle).
    
    Output parameters:
    status : Status object (status).
    IERROR : Fortran only: Error status (integer).
    
    USE mpi_f08
    MPI_File_write
    MPI_File_write_all(fh, buf, count, datatype, status, ierror)
        TYPE(MPI_File), INTENT(IN) :: fh
        TYPE(*), DIMENSION(..), INTENT(IN) :: buf
        INTEGER, INTENT(IN) :: count
        TYPE(MPI_Datatype), INTENT(IN) :: datatype
        TYPE(MPI_Status) :: status
        INTEGER, OPTIONAL, INTENT(OUT) :: ierror
    
    USE MPI
    ! or the older form: INCLUDE ’mpif.h’
    MPI_FILE_WRITE(FH, BUF, COUNT,
        DATATYPE, STATUS, IERROR)
            BUF(*)
        INTEGER    FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR
    
    
    MPI_File_write . This routine writes the specified data in the locations specified with the current file view. The number of items written is returned in the MPI_Status argument; all other fields of this argument are undefined. It can not be used if the file was opened with MPI_MODE_SEQUENTIAL .
  • If all processes execute a write at the same logical time, it is better to use the collective call MPI_File_write_all .
  • Synopsis:
    Reads a file starting at the location specified by the
    individual file pointer (blocking, noncollective).
    
    C Syntax
    #include 
    int MPI_File_read(MPI_File fh, void *buf,
        int count, MPI_Datatype datatype, MPI_Status *status)
    
    USE mpi_f08
    MPI_File_read(fh, buf, count, datatype, status, ierror)
        TYPE(MPI_File), INTENT(IN) :: fh
        TYPE(*), DIMENSION(..) :: buf
        INTEGER, INTENT(IN) :: count
        TYPE(MPI_Datatype), INTENT(IN) :: datatype
        TYPE(MPI_Status) :: status
        INTEGER, OPTIONAL, INTENT(OUT) :: ierror
    
    USE MPI
    ! or the older form: INCLUDE ’mpif.h’
    MPI_FILE_READ(FH, BUF, COUNT,
        DATATYPE, STATUS, IERROR)
            BUF(*)
        INTEGER    FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE),IERROR
    
    Input parameters:
    fh : File handle (handle).
    count : Number of elements in buffer (integer).
    datatype : Data type of each buffer element (handle).
    
    Output parameters:
    buf : Initial address of buffer (integer).
    status : Status object (status).
    IERROR : Fortran only: Error status (integer).
    
    MPI_File_read This routine attempts to read the specified data from the locations specified in the current file view. The number of items read is returned in the MPI_Status argument; all other fields of this argument are undefined. It can not be used if the file was opened with MPI_MODE_SEQUENTIAL .
  • If all processes execute a read at the same logical time, it is better to use the collective call
    
    
    MPI_File_read_all .

For thread safety it is good to combine seek and read/write operations:

  • MPI_File_read_at : combine read and seek. The collective variant is MPI_File_read_at_all .
  • MPI_File_write_at : combine write and seekl The collective variant is MPI_File_write_at_all ; section  10.2.2 .

Writing to and reading from a parallel file is rather similar to sending a receiving:

  • The process uses an elementary data type or a derived datatype to describe what elements in an array go to file, or are read from file.
  • In the simplest case, your read or write that data to the file using an offset, or first having done a seek operation.
  • But you can also set a `file view' to describe explicitly what elements in the file will be involved.

10.2.1 Nonblocking read/write

crumb trail: > mpi-io > File reading and writing > Nonblocking read/write

Just like there are blocking and nonblocking sends, there are also nonblocking writes and reads:

Synopsis
nonblocking write using individual file pointer
MPI_File_iwrite: non-collective

C syntax:
#ifdef HAVE_MPI_GREQUEST
#include "mpiu_greq.h"
#endif
int MPI_File_iwrite(MPI_File fh,
    ROMIO_CONST void *buf, int count, MPI_Datatype datatype,
    MPI_Request * request)

Input parameters:
fh  : file handle
buf : Initial address of buffer (choice).
count : Number of elements in buffer (integer).
datatype : Data type of each buffer element (handle).

Output parameters:
request : request object (handle)
status : Status object (status).
IERROR : Fortran only: Error status (integer).
MPI_File_iwrite , MPI_File_iread operations, and their collective versions MPI_File_iwrite_all , MPI_File_iread_all .

Also MPI_File_iwrite_at_all , MPI_File_iread_at_all .

These routines output an MPI_Request object, which can then be tested with MPI_Wait or MPI_Test .

Nonblocking collective I/O functions much like other nonblocking collectives (section {sec:mpi3collect}): the request is satisfied if all processes finish the collective.

There are also split collective routines that function like nonblocking collective I/O, but with the request/wait mechanism: MPI_File_write_all_begin  / MPI_File_write_all_end (and similarly MPI_File_read_all_begin  / MPI_File_read_all_end ) where the second routine blocks until the collective write/read has been concluded.

Also MPI_File_iread_shared , MPI_File_iwrite_shared .

10.2.2 Individual file pointers, contiguous writes

crumb trail: > mpi-io > File reading and writing > Individual file pointers, contiguous writes

After the collective open call, each process holds an individual file pointer each process can individually position the pointer somewhere in the shared file. Let's explore this modality.

The simplest way of writing a data to file is much like a send call: a buffer is specified with the usual count/datatype specification, and a target location in the file is given. The routine

MPI_File_write_at(fh,offset,buf,count,datatype)

Semantics:
Input Parameters
fh : File handle (handle).
offset : File offset (integer).
buf : Initial address of buffer (choice).
count : Number of elements in buffer (integer).
datatype : Data type of each buffer element (handle).

Output Parameters:
status : Status object (status).

C:
int MPI_File_write_at
   (MPI_File fh, MPI_Offset offset, const void *buf,
    int count, MPI_Datatype datatype, MPI_Status *status)

Fortran:
MPI_FILE_WRITE_AT
   (FH,  OFFSET,  BUF, COUNT, DATATYPE, STATUS,  IERROR)
    BUF(*)
INTEGER :: FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR
INTEGER(KIND=MPI_OFFSET_KIND) :: OFFSET

Python:
MPI.File.Write_at(self, Offset offset, buf, Status status=None)
MPI_File_write_at gives this location in absolute terms with a parameter of type MPI_Offset , which counts bytes.

FIGURE 10.1: Writing at an offset

Exercise

Create a buffer of length nwords=3 on each process, and write these buffers as a sequence to one file with MPI_File_write_at . \skeleton{blockwrite}

Instead of giving the position in the file explicitly, you can also use a MPI_File_seek call to position the file pointer, and write with MPI_File_write at the pointer location. The write call itself also advances the file pointer so separate calls for writing contiguous elements need no seek calls with MPI_SEEK_CUR .

Exercise

Rewrite the code of exercise  10.1 to use a loop where each iteration writes only one item to file. Note that no explicit advance of the file pointer is needed.

Exercise

Construct a file with the consecutive integers $0,…,WP$ where $W$ some integer, and $P$ the number of processes. Each process $p$ writes the numbers $p,p+W,p+2W,…$. Use a loop where each iteration

  1. writes a single number with MPI_File_write , and
  2. advanced the file pointer with MPI_File_seek with a whence parameter of MPI_SEEK_CUR .

10.2.3 File views

crumb trail: > mpi-io > File reading and writing > File views

The previous mode of writing is enough for writing simple contiguous blocks in the file. However, you can also access noncontiguous areas in the file. For this you use

Semantics:
MPI_FILE_SET_VIEW(fh, disp, etype, filetype, datarep, info)
INOUT fh: file handle (handle)
IN disp: displacement (integer)
IN etype: elementary datatype (handle)
IN filetype: filetype (handle)
IN datarep: data representation (string)
IN info: info object (handle)

C:
int MPI_File_set_view
   (MPI_File fh,
    MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype,
    char *datarep, MPI_Info info)

Fortran:
MPI_FILE_SET_VIEW(FH, DISP, ETYPE, FILETYPE, DATAREP, INFO, IERROR)
INTEGER FH, ETYPE, FILETYPE, INFO, IERROR
CHARACTER*(*) DATAREP
INTEGER(KIND=MPI_OFFSET_KIND) DISP

Python:
mpifile = MPI.File.Open( .... )
mpifile.Set_view
  (self,
   Offset disp=0, Datatype etype=None, Datatype filetype=None,
   datarep=None, Info info=INFO_NULL)
MPI_File_set_view . This call is collective, even if not all processes access the file.

  • The disp displacement parameters is measured in bytes. It can differ between processes. On sequential files such as tapes or network streams it does not make sense to set a displacement; for those the MPI_DISPLACEMENT_CURRENT value can be used.
  • The etype describes the data type of the file, it needs to be the same on all processes.
  • The filetype describes how this process sees the file, so it can differ between processes.
  • The datarep string can have the following values:

    • native : data on disk is represented in exactly the same format as in memory;
    • internal : data on disk is represented in whatever internal format is used by the MPI implementation;
    • external : data on disk is represented using XDR portable data formats.
  • The info parameter is an MPI_Info object, or MPI_INFO_NULL . See section  14.1.1.3 for more. (See T3PIO   [t3pio-git] for a tool that assists in setting this object.)

FIGURE 10.2: Writing at a view

Exercise

\skeleton{viewwrite} Write a file in the same way as in exercise  10.1 , but now use MPI_File_write and use MPI_File_set_view to set a view that determines where the data is written.

You can get very creative effects by setting the view to a derived datatype.

FIGURE 10.3: Writing at a derived type

Fortran note

In Fortran you have to assure that the displacement parameter is of `kind' MPI_OFFSET_KIND . In particular, you can not specify a literal zero `0' as the displacement; use 0_MPI_OFFSET_KIND instead.

More: MPI_File_set_size MPI_File_get_size MPI_File_preallocate MPI_File_get_view

10.2.4 Shared file pointers

crumb trail: > mpi-io > File reading and writing > Shared file pointers

It is possible to have a file pointer that is shared (and therefore identical) between all processes of the communicator that was used to open the file. This file pointer is set with MPI_File_seek_shared . For reading and writing there are then two sets of routines:

  • Individual accesses are done with MPI_File_read_shared and MPI_File_write_shared . Nonblocking variants are MPI_File_iread_shared and MPI_File_iwrite_shared .
  • Collective accesse are done with MPI_File_read_ordered and MPI_File_write_ordered , which execute the operations in order ascending by rank.

Shared file pointers require that the same view is used on all processes. Also, these operations are less efficient because of the need to maintain the shared pointer.

10.3 Consistency

crumb trail: > mpi-io > Consistency

It is possible for one process to read data previously writte by another process. For this it is of course necessary to impose a temporal order, for instance by using MPI_Barrier , or using a zero-byte send from the writing to the reading process.

However, the file also needs to be declared atomic : MPI_File_set_atomicity .

10.4 Constants

crumb trail: > mpi-io > Constants

MPI_SEEK_SET used to be called SEEK_SET which gave conflicts with the C++ library. This had to be circumvented with

make CPPFLAGS="-DMPICH_IGNORE_CXX_SEEK -DMPICH_SKIP_MPICXX"

and such.

\newpage

10.5 Review questions

crumb trail: > mpi-io > Review questions

Exercise

T/F? After your SLURM job ends, you can copy from the login node the files you've written to \tmp .

Exercise

T/F? File views ( MPI_File_set_view ) are intended to

  • write MPI derived types to file; without them you can only write contiguous buffers;
  • prevent collisions in collective writes; they are not needed for individual writes.

Exercise

The sequence MPI_File_seek_shared , MPI_File_read_shared can be replaced by MPI_File_seek , MPI_File_read if you make what changes?

Back to Table of Contents