Parallel Prefix

Experimental html version of downloadable textbook, see
\[ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% %%%% This text file is part of the source of %%%% `Introduction to High-Performance Scientific Computing' %%%% by Victor Eijkhout, copyright 2012-2020 %%%% %%%% mathjax.tex : macros to facility mathjax use in html version %%%% %%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \newcommand\macro[1]{$\langle$#1$\rangle$} \newcommand\dtdxx{\frac{\alpha\Delta t}{\Delta x^2}} \] 20.1 : Parallel prefix
20.2 : Sparse matrix vector product as parallel prefix
Back to Table of Contents

20 Parallel Prefix

For operations to be executable in parallel they need to be independent. That makes recurrences problematic to evaluate in parallel. Recurrences occur in obvious places such as solving a triangular system of equations (section~ ), but they can also appear in sorting and many other operations.

In this appendix we look at parallel prefix operations: the parallel execution of an operation that is defined by a recurrence involving an associative operator. Computing the sum of an array of elements is an example of this type of operation (disregarding the non-associativy for the moment). Let $\pi(x,y)$ be the binary sum operator: \[ \pi(x,y)\equiv x+y, \] then we define the prefix sum of $n\geq 2$ terms as \[ \Pi(x_1,\ldots,x_n) = \begin{cases} \pi(x_1,x_2)&\hbox{if $n=2$}\\ \pi\bigl( \Pi(x_1,\ldots,x_{n-1}),x_n\bigr)&\hbox{otherwise} \\ \end{cases} \]

As a non-obvious of a prefix operation, we could count the number of elements of an array that have a certain property.


Let $p(\cdot)$ be a predicate, $p(x)=1$ if it holds for~$x$ and 0~otherwise. Define a binary operator $\pi(x,y)$ so that its reduction over an array of numbers yields the number of elements for which $p$ is true.

So let us now assume the existence of an associative operator~$\oplus$, an array of values~$x_1,\ldots,x_n$. Then we define the prefix problem as the computation of $X_1,\ldots,X_n$, where \[ \begin{cases} X_1=x_1\\ X_k=\oplus_{i\leq k} x_i \end{cases} \]

20.1 Parallel prefix

crumb trail: > prefix > Parallel prefix

The key to parallelizing this is the realization that we can compute partial reductions in parallel: \[ x_1\oplus x_2, \quad x_3\oplus x_4, \ldots \] are all independent. Furthermore, partial reductions of these reductions, \[ (x_1\oplus x_2) \oplus (x_3\oplus x_4),\quad \ldots \] are also independent. We use the notation \[ X_{i,j}=x_i\oplus\cdots\oplus x_j \] for these partial reductions.

You have seen this before in section~ : an array of $n$ numbers can be reduced in~$\lceil \log_2 n\rceil$ steps. What is missing to make this a full prefix operation is computation of all intermediate values.

Observing that, for instance, $X_3=(x_1\oplus x_2)\oplus x_3=X_2\oplus x_3$, you can now imagine the whole process; see figure~ for the case of $8$~elements.

FIGURE 20.1: A prefix operation applied to 8 elements

To compute, say, $X_{13}$, you express $13=8+4+1$ and compute $X_{13}=X_8\oplus X_{9,12} \oplus x_13$.

In this figure, operations over the same `distance' have been horizontally aligned corresponding to a SIMD type execution. If the execution proceeds with a task graph, some steps can be executed earlier than the figure suggests; for instance $X_3$ can be computed simultaneously with~$X_6$.

Regardless the arrangement of the computational steps, it is not hard to see that the whole prefix calculation can be done in $2\log_2n$ steps: $\log_2 n$~steps for computing the final reduction~$X_n$, then another $\log_2 n$ steps for filling in the intermediate values.

20.2 Sparse matrix vector product as parallel prefix

crumb trail: > prefix > Sparse matrix vector product as parallel prefix

It has been observed that the sparse matrix vector product can be considered a prefix operation; see~ [Blelloc:segmented-report] . The reasoning here is we first compute all $y_{ij}\equiv a_{ij}x_j$, and subsequently compute the sums $y_i=\sum_j y_{ij}$ with a prefix operation.

A~prefix sum as explained above does not compute the right result. The first couple of $y_{ij}$ terms do indeed sum to~$y_1$, but then continuing the prefix sum gives $y_1+y_2$, instead of~$y_2$. The trick to making this work is to consider two-component quantities $\langle y_{ij},s_{ij}\rangle$, where \[ s_{ij} = \begin{cases} 1&\hbox{if $j$ is the first nonzero index in row $i$}\\ 0&\hbox{otherwise} \end{cases} \] Now we can define prefix sums that are `reset' every time $s_{ij}=1$.

Back to Table of Contents