A brief introduction to the theory of the Integrative Model for Parallelism



Distributed data



For purposes of discussion, by object we mean array, and so an object can be indexed by an integer that runs $0\ldots N-1$.

A distribution is then a mapping from processors to indices, that is, \[ p\mapsto 2^N \]
(where $2^N$ is the usual shorthand for all subsets of $0\ldots N-1$).
Note that mapping processors to data, rather than the other way around, easily allows redundant assignment of data,
and later redundant replication of work.

Data parallel work



A data parallel computation $y=f(x)$, where $x,y$ are distributed objects as just defined,
is a computation of the form \[ \forall_i\colon y_i=f(x_{i_1},\ldots,x_{i_k}) \] for some $k$, possibly dependent on $i$.

We define the signature function of a data parallel computation as the function
\[ \sigma_f(i) = \{ i_1,\ldots,i_k \} \]
describing for each output index, what input indices are needed.

Beta distribution



Now let $x,y$ have distributions $\alpha,\gamma$ respectively. It is straightforward to extend the definition of signature function
to map between distributions. We can now define the $\beta$ distribution as \[ \beta=\sigma_f \gamma \]
Informally, the $\beta$ distribution describes for each processor $p$ which indices in $x$ it needs to compute its share of $y$.

The $\beta$ distribution is a dynamically defined distribution that combines knowledge of the data (the $\gamma$ distribution) and of
the structure of the algorithm (the $\sigma_f$ function).

Since $x$ has the distribution $\alpha$, not $\beta$, some data needs to be communicated/synchronized before the local computation
of $y$ can commence.

Communication and synchronization



Let $q$ be a process that has just computed part of $x$, and $p$ be a process that is about to compute part of $y$, then
$q$ sends data to / is a predecessor of $p$ if \[ \alpha(q) \cap \beta(p) \not = \emptyset \]
That is, all messages and predecessor relations are derived by the system from, ultimately the user-specified $\alpha,\gamma$
data distributions and the $\sigma_f$ function that describes the structure of the operation.

Practical importance



A programming system that allows for user specification of data distributions and the signature function can
derive all communication and synchronization from that.

If we define a task as the computation of an object on a specific processor, then the IMP system can derive the
task data flow graph from a sequential program.