# N-body problems

9.2 : The Fast Multipole Method
9.3 : Full computation
9.4 : Implementation
9.4.1 : Vectorization
9.4.2 : Shared memory implementation
9.4.3 : Distributed memory implementation

# 9 N-body problems

In chapter Numerical treatment of differential equations we looked at continuous phenomena, such as the behaviour of a heated rod in the entire interval $[0,1]$ over a certain time period. There are also applications where you may be interested in a finite number of points. One such application is the study of collections of particles, possibly very big particles such as planets or stars, under the influence of a force such as gravity or the electrical force. (There can also be external forces, which we will ignore; also we assume there are no collisions, otherwise we need to incorporate nearest-neighbour interactions.) This type of problems is known as N-body problems; for an introduction see http://www.scholarpedia.org/article/N-body_simulations_(gravitational) .

A basic algorithm for this problem is easy enough:

• choose some small time interval,

• calculate the forces on each particle, given the locations of all the particles,

• move the position of each particle as if the force on it stays constant throughout that interval.

For a small enough time interval this algorithm gives a reasonable approximation to the truth.

The last step, updating the particle positions, is easy and completely parallel: the problem is in evaluating the forces. In a naive way this calculation is simple enough, and even completely parallel:

for each particle $i$

\>for each particle $j$

\>\> let $\bar r_{ij}$ be the vector between $i$ and $j$;

\>\> then the force on $i$ because of $j$ is

\>\> $\quad f_{ij} = -\bar r_{ij}\frac{m_im_j}{|r_{ij}|}$

\>\> (where $m_i,m_j$ are the masses or charges) and

\>\> $f_{ji}=-f_{ij}$.

The main objection to this algorithm is that it has quadratic computational complexity: for $N$ particles, the number of operations is $O(N^2)$.

Exercise

If we had $N$ processors, the computations for one update step would take time $O(N)$. What is the communication complexity? Hint: is there a collective operations you can use?

Several algorithms have been invented to get the sequential complexity down to $O(N\log N)$ or even $O(N)$. As might be expected, these are harder to implement than the naive algorithm. We will discuss a popular method: the Barnes-Hut algorithm   [BarnesHut] , which has $O(N\log N)$ complexity.

# 9.1 The Barnes-Hut algorithm

Top > The Barnes-Hut algorithm

The basic observation that leads to a reduction in complexity is the following. If you are calculating the forces on two particles $i_1,i_2$ that are close together, coming from two particles $j_1,j_2$ that are also close together, you can clump $j_1,j_2$ together into one particle, and use that for both $i_1,i_2$.

Next, the algorithm uses a recursive division of space, in two dimensions in quadrants and in three dimensions in octants; see figure  .

The algorithm is then as follows. First total mass and center of mass are computed for all cells on all levels:

for each level $\ell$, from fine to coarse:

\>for each cell $c$ on level $\ell$:

\>\> compute the total mass and center of mass

\>\>\> for cell $c$ by considering its children

\> if there are no particles in this cell,

\>\> set its mass to zero

Then the levels are used to compute the interaction with each particle:

for each particle $p$:

\>for each cell $c$ on the top level

\>\>if $c$ is far enough away from $p$:

\>\>\>use the total mass and center of mass of $c$;

\>\>otherwise consider the children of $c$

The test on whether a cell is far enough away is typically implemented as the ratio of its diameter to its distance being small enough. This is sometimes referred to as the cell opening criterium'. In this manner, each particle interacts with a number of concentric rings of cells, each next ring of double width; see figure  .

This algorithm is easy to realize if the cells are organized in a tree. In the three-dimensional case, each cell has eight children, so this is known as an octtree .

The computation of centres of masses has to be done each time after the particles move. Updating can be less expensive than computing from scratch. Also, it can happen that a particle crosses a cell border, in which case the data structure needs to be updated. In the worst case, a particle moves into a cell that used to be empty.

# 9.2 The Fast Multipole Method

Top > The Fast Multipole Method

The \indexac{FMM} computes an expression for the potential at every point, not the force as does Barnes-Hut. FMM uses more information than the mass and center of the particles in a box. This more complicated expansion is more accurate, but also more expensive. In compensation, the FMM uses a fixed set of boxes to compute the potential, rather than a set varying with the accuracy parameter theta, and location of the center of mass.

However, computationally the FMM is much like the Barnes-Hut method so we will discuss their implementation jointly.

# 9.3 Full computation

Top > Full computation

Despite the above methods for judicious approximation, there are also efforts at full calculation of the $N^2$ interactions; see for instance the NBODY6 code of Sverre Aarseth; see http://www.ast.cam.ac.uk/~sverre/web/pages/home.htm . Such codes use high order integrators and adaptive time steps. Fast implementation on the Grape computer exist; general parallelization is typically hard because of the need for regular load balancing.

# 9.4 Implementation

Top > Implementation

Octtree methods offer some challenges on high performance architectures. First of all, the problem is irregular, and secondly, the irregularity dynamically changes. The second aspect is mostly a problem in distributed memory, and it needs load rebalancing ; see section  2.9 . In this section we concentrated on the force calculation in a single step.

## 9.4.1 Vectorization

Top > Implementation > Vectorization

The structure of a problem as in figure  is quite irregular. This is a problem for vectorization on the small scale of \indexac{SSE}/ \indexac{AVX} instructions and on the large scale of vector pipeline processors (see section  2.3.1 for an explanation of both). Program steps for all children of a certain box do something' will be of irregular length, and data will possibly be not stored in a regular manner.

This problem can be alleviated by subdividing the grid even if this means having empty boxes. If the bottom level is fully divided, there will always be eight (in three dimension) particles to operate on. Higher levels can also be filled in, but this means an increasing number of empty boxes on the lower levels, so there is a trade-off between increased work and increasing efficiency.

## 9.4.2 Shared memory implementation

Top > Implementation > Shared memory implementation

Executed on a sequential architecture, this algorithm has complexity $O(N\log N)$. It is clear that this algorithm will also work on shared memory if each particle is turned into a task. Since not all cells contain particles, tasks will have a different running time.

## 9.4.3 Distributed memory implementation

Top > Implementation > Distributed memory implementation

The above shared-memory version of the Barnes-Hut algorithm can not immediately be used in a distributed memory context, since each particle can in principle access information from any part of the total data. It is possible to realize an implementation along these lines using a hashed octtree , but we will not persue this.

We observe data access is more structured than it seems at first. Consider a particle $p$ and the cells on level $\ell$ that it interacts with. Particles located close to $p$ will interact with the same cells, so we can rearrange interaction by looking at cells on level $\ell$ and the other cells on the same level that they interact with.

This gives us the following algorithm  [Katzenelson:nbody] : the calculation of centres of mass become a calculation of the force $g^{(\ell)}_p$ exerted by a particle $p$ on level $\ell$:

for level $\ell$ from one above the finest to the coarsest:

\>for each cell $c$ on level $\ell$

\>\>let $g^{(\ell)}_c$ be the combination of the $g^{(\ell+1)}_i$

for all children $i$ of $c$

With this we compute the force on a cell:

for level $\ell$ from one below the coarses to the finest:

\>for each cell $c$ on level $\ell$:

\>\>let $f^{(\ell)}_c$ be the sum of

\>\>\>1. the force $f^{(\ell-1)}_p$ on the parent $p$ of $c$, and

\>\>\>2. the sums $g^{(\ell)}_i$ for all $i$ on level $\ell$ that

\>\>\>\>satisfy the cell opening criterium

We see that on each level, each cell now only interacts with a small number of neighbours on that level. In the first half of the algorithm we go up the tree using only parent-child relations between cells. Presumably this is fairly easy.

The second half of the algorithm uses more complicated data access. The cells $i$ in the second term are all at some distance from the cell $c$ on which we are computing the force. In graph terms these cells can be described as cousins: children of a sibling of $c$'s parent. If the opening criterium is made sharper, we use second cousins: grandchildren of the sibling of $c$'s grandparent, et cetera.

Exercise

Argue that this force calculation operation has much in common, structurally, with the sparse matrix-vector product.

In the shared memory case we already remarked that different subtrees take different time to process, but, since we are likely to have more tasks than processor cores, this will all even out. With distributed memory we lack the possibility to assign work to arbitrary processors, so we need to assign load carefully. SFC can be used here to good effect (see section  ).