\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] Back to Table of Contents

28 Scientific Libraries

There are many libraries for scientific computing. Some are specialized to certain application areas, others are quite general. In this section we will take a brief look at the PETSc library for sparse matrix computations, and the BLAS/Lapack libraries for dense computations.

28.1 The Portable Extendable Toolkit for Scientific Computing

Top > The Portable Extendable Toolkit for Scientific Computing

PETSc, the Portable Extendable Toolkit for Scientifc Computation  [] , is a large powerful library, mostly concerned with linear and nonlinear system of equations that arise from discretized PDE . Since it has many hundreds of routines (and a good manual already exists) we limit this tutorial to going through a few simple, yet representative, PETSc programs.

28.1.1 What is in PETSc?

Top > The Portable Extendable Toolkit for Scientific Computing > What is in PETSc?

PETSc can be used as a library in the traditional sense, where you use some high level functionality, such as solving a nonlinear system of equations, in your program. However, it can also be used as a toolbox, to compose your own numerical applications using low-level tools.

The basic functionality of PETSc can be extended through external packages:

28.1.2 Design of PETSc

Top > The Portable Extendable Toolkit for Scientific Computing > Design of PETSc

PETSc has an object-oriented design, even though it is implemented in C, a not object oriented language. PETSc is also parallel: all objects in Petsc are defined on an MPI communicator (section ). For an object such as a matrix this means that it is stored distributed over the processors in the communicator; for objects such as solvers it means that their operation is distributed over the communicator. PETSc objects can only interact if they are based on the same communicator. Most of the time objects will be defined on the MPI_COMM_WORLD communicator, but subcommunicators can be used too. This way, you can define a matrix or a linear system on all your available processors, or on a subset.

Parallelism is handled through MPI. You may need to have occasional MPI calls in a PETSc code, but the vast bulk of communication is done behind the scenes inside PETSc calls. Shared memory parallel (such as through OpenMP; section  2.6.2 ) is not used explicitly, but the user can incorporate it without problems.

The object-oriented design means that a call such as matrix-vector multiplication

  MATMult(A,x,y); // y <- A x
looks the same, regardless whether $A$ is sparse or dense, sequential or parallel.

One implication of this is that the actual data are usually hidden from the user. Data can be accessed through routines such as

  double *array;
but most of the time the application does not explicitly maintain the data, only the PETSc objects containing the data. As the ultimate consequence of this, the user usually does not allocate data; rather, matrix and vector arrays get created through the PETSc create calls, and subsequent values are inserted through PETSc calls:
  MatSetValue(A,i,j,v,INSERT_VALUES); // A[i,j] <- v
This may feel like the user is giving up control, but it actually makes programming a lot easier, since the user does not have to worry about details of storage schemes, especially in parallel.

28.1.3 Small examples

Top > The Portable Extendable Toolkit for Scientific Computing > Small examples

In this section we will go through a number of successively more complicated examples of the use of PETSc. The files can be downloaded from the repository of this book. While doing these examples it is a good idea to keep the manual page open:\\ \url{http://tinyurl.com/PETSc-man-page}.

When you do these examples, make sure to use a version of PETSc that has debug mode enabled! \begin{istc} At TACC, do module load petsc/3.5, or other version number as applicable; and module load petsc/3.5-debug for the corresponding debug version. \end{istc} Program structure

Top > The Portable Extendable Toolkit for Scientific Computing > Small examples > Program structure

The first example (we only list C sources in this book; the download includes their Fortran equivalents) illustrates the basic structure of a PETSc program: the include file at the top and the calls to \n{PetscInitialize}, PetscFinalize. Furthermore it uses two routines:


Just about the only lines of MPI that you need to know when programming PETSc are:

ierr = MPI_Comm_size(comm,&ntids);
ierr = MPI_Comm_rank(comm,&mytid);
call MPI_Comm_size(comm,&ntids,ierr);
call MPI_Comm_rank(comm,&mytid,ierr);
The first line gives the size of the communicator, meaning how many processes there are; the second one gives the rank of the current process as a number from zero to the number of processes minus one.


Add these two lines to your program. Look up the routine PetscSynchronizedPrintf in the documentation and use it to let each process print out a line like \n{Process 3 out of 7}. You may also need to use PetscSynchronizedFlush. Vectors

Top > The Portable Extendable Toolkit for Scientific Computing > Small examples > Vectors

Next we start making PETSc objects, first a vector. \listing{vec.c}{tutorials/petsc/exercises/cexercises/vec.c} Note how it takes several calls to fully create the vector object:

At the end of the program there is a to VecDestroy, which

deallocates the memory for the vector. While this is strictly speaking not necessary for the functioning of the program, it is a good idea to issue \n{Destroy} calls for each Create call, since it can prevent potential memory leaks .

Exercise Comment out the VecDestroy call, and run the program with the option -malloc_dump. PETSc will now report on all memory that had not been freed at the time of the PetscFinalize call.

If you run the program in parallel the vector will be created distributed over the processors. The program contains a call to VecGetOwnershipRange to discover what part of the vector lives on what processor. You see that the VecView calls display for each processor precisely that part of the vector.

Exercise The listing has a call to VecSet to set the vector elements to a

constant value. Remove this line. Use the myfirst,mylast variables and the PETSc routine VecSetValue to let every process set its local vector elements to the processor number. (The last argument of \n{VecSetValue} should be INSERT_VALUES.) That is, if the vector is six elements long and there are three processors, the resulting vector should be $(0,0,1,1,2,2)$.


Change the code from the previous exercise. Now every vector element should be the sum of the processor number and the previous processor numbers; in the above example the result should be $(0,0,1,1,3,3)$. Read the man page for VecSetValue!

Run the code from the previous two exercises again, adding a commandline argument -log_summary. Observe that the first code has no messages in the VecAssemblyBegin/End calls, but the second one does. Matrices

Top > The Portable Extendable Toolkit for Scientific Computing > Small examples > Matrices

Let's move up to matrices. Creating a matrix is similar to creating a vector. In this case the type is MPIAIJ which stands for a distributed sparse matrix. \listing{mat.c}{tutorials/petsc/exercises/cexercises/mat.c} In this example a diagonal matrix is constructed, with each processor setting only its locally stored elements.


In addition to setting the diagonal, also set the first subdiagonal and the first superdiagonal, that is, the elements $(i,i-1)$ and $(i,i+1)$. To set all three elements in a row with one call you can do this:

vm = 1.*mytid;
for (i=myfirst; i<mylast; i++) {
  PetscInt j[3]; PetscReal v[3];
  j[0] = i-1; j[1] = i; j[2] = i+1;
  v[0] = --vm-1; v[1] = 2*vm; v[2] = -vm+1;
  ierr = MatSetValues
            (A,1,&i,3,j,v,INSERT_VALUES); CHKERRQ(ierr);
However, this code is not entirely correct. Edit the program using this fragment and run it. Diagnose the problem and fix it. Matrix-vector operations

Top > The Portable Extendable Toolkit for Scientific Computing > Small examples > Matrix-vector operations Next we will create a file \n{mul.c} based on mat.c and multiply the

matrix and the vector. Make sure that the size declarations of the matrix and the vector are compatible. You also need a second vector to store the result of the multiplication. This is easiest done by

ierr = VecDuplicate(x,&y); CHKERRQ(ierr);

Exercise Look up the MatMult routine in the documentation and use it your program. Use VecView to inspect the result. Note that no size

parameters or anything pertaining to parallelism appears in the calling sequence. Solvers

Top > The Portable Extendable Toolkit for Scientific Computing > Small examples > Solvers Copy your \n{mat.c} file to sys.c: we are going to explore linear

system solving. First you need to create a solver object and give it the matrix as operator:

ierr = KSPCreate(comm,&solver); CHKERRQ(ierr);
ierr = KSPSetOperators(solver,A,A,0); CHKERRQ(ierr);
ierr = KSPSolve(solver,x,y); CHKERRQ(ierr);
ierr = VecView(y,0); CHKERRQ(ierr);


Add these lines to your code. Make sure you know what the correct solution is by using MatMult to obtain the right hand side.

You have just used the default linear system solver. Run the program again, but with the option -ksp_view. This will tell you all the details of what solver was used.

Solving the linear system is a one line call to KSPSolve. The

story would end there if it weren't for some complications:


Use the routine KSPGetConvergedReason to inspect the status of the solution vector. Use KSPGetIterationNumber to see how many iterations it took.


Add code to your program to compute the residual and its norm. For the residual, look up the routines \n{VecDuplicate} and VecAXPY; for computing its norm look up VecNorm.

Exercise Add a call to KSPSetFromOptions to your code. Use the option -ksp_monitor to observe the convergence behaviour.

The workings of a PETSc program can be customized to a great degree through the use of commandline options. This includes setting the type of the solver. In order for such options to be obeyed, you first need to put a command \n{KSPSetFromOptions} before the KSPSolve call.

Exercise The -ksp_view option told you what solver and preconditioner were used. Look up the routines \n{KSPSetType} and PCSetType and

use those to change the iterative method to CG and the preconditioner to Jacobi. Do this first by using commandline options, and then by editing the code.

28.1.4 A realistic program

Top > The Portable Extendable Toolkit for Scientific Computing > A realistic program

This section will give you some further help towards solving a realistic PDE problem. Construction of the coefficient matrix

Top > The Portable Extendable Toolkit for Scientific Computing > A realistic program > Construction of the coefficient matrix

In the examples above you used a commandline argument to determine the matrix size directly. Here we construct the matrix of 5-point stencil for the Poisson operator (see section  4.1.6 and in particular figure  1 ). Determining its size takes two steps: you need to read the domain size $n=1/h-1$ and compute the matrix size from it.


  int domain_size,matrix_size;
  matrix_size = domain_size*domain_size;
  integer :: domain_size,matrix_size
  call PetscOptionsGetInt(PETSC_NULL_CHARACTER,
 >     "-n",domain_size,flag)
  matrix_size = domain_size*domain_size;

Now you use the matrix_size parameter for constructing the matrix. Filling in matrix elements

Top > The Portable Extendable Toolkit for Scientific Computing > A realistic program > Filling in matrix elements

Just like in the examples above, you want each processor to set only its local rows. The easiest way to iterate over those is to iterate over all variables / matrix rows and select only the local ones.

We will now set matrix elements (refer to the full domain, but only inserting those elements that are in its matrix block row.


for ( i=0; i<domain_size; i++ ) {
  for ( j=0; j<domain_size; j++ ) {
    I = j + matrix_size*i;
    if (I>=myfirst && I<mylast) {
      J = I; // for the diagonal element
      J = .... // for the other points
      J = .... 


call MatGetOwnershipRange(A,myfirst,mylast) 
do i=0,matrix_size-1
   do j=0,domain_size-1
      ii = j + domain_size*i
      if (ii>=myfirst .and. ii<mylast) then
        jj = ii ; for the diagonal element
        call MatSetValues
 >           (A,1,ii,1,jj,v,INSERT_VALUES) 
        jj = ii... ; for the other elements
        jj = ii...
      end if
   end do
end do
call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY)
call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY)


Construct the matrix from equation \eqref{eq:5starmatrix} in section  4.1.6 . Compute and output the product with the identity vector (meaning that all elements are $1$), and check that the result is correct. Make sure to test your program in parallel. Finite Element Matrix assembly

Top > The Portable Extendable Toolkit for Scientific Computing > A realistic program > Finite Element Matrix assembly

PETSc's versatility in dealing with Finite Element matrices (see sections 4.1.8 and  6.5.2 ), where elements are constructed by adding together contributions, sometimes from different processors. This is no problem in PETSc: any processor can set (or add to) any matrix element. The assembly calls will move data to their eventual location on the correct processors.

for (e=myfirstelement; e<mylastelement; e++) { for (i=0; i<nlocalnodes; i++) { I = localtoglobal(e,i); for (j=0; j<nlocalnodes; j++) { J = localtoglobal(e,j); v = integration(e,i,j); MatSetValues (mat,1,&I,1,&J,&v,ADD_VALUES); .... } } } MatAssemblyBegin(mat,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(mat,MAT_FINAL_ASSEMBLY); Linear system solver

Top > The Portable Extendable Toolkit for Scientific Computing > A realistic program > Linear system solver

We have covered the basics of setting up the solver and solving the system above.

As an illustration of the toolbox nature of PETSc, you can now use routines you have already seen to compute the residual and its norm.

Exercise Create a new vector~\n{z} (use VecDuplicate) and store the product of \n{A} and the computed solution~\n{y} (use MatMult) in it. If you solved the system accurately, z~should now be equal to~x. To see how close it is, use

PetscReal norm; VecAXPY(z,-1,x); VecNorm(z,NORM_2,&norm);

to subtract \n{x} from~z and compute the norm of the result.

28.1.5 Quick experimentation

Top > The Portable Extendable Toolkit for Scientific Computing > Quick experimentation

Reading a parameter from the commandline above is actually a special case of a general mechanism for influencing PETSc's workings through commandline options.

Here is an example of setting the iterative solver and preconditioner from the commandline:

yourprog -ksp_type gmres -ksp_gmres_restart 25
    -pc_type ilu -pc_factor_levels 3

In order for this to work, your code needs to call

before the system solution. This mechanism is very powerful, and it obviates the need for much code recompilation.


28.1.6 Review questions

Top > The Portable Extendable Toolkit for Scientific Computing > Review questions


Write a PETSc program that does the following:

For a small problem (say, $n=10$) print out the first couple $x_i$ vectors. What do you observe? Explanation?

\end{istc} \begin{istc}

Exercise Extend the previous exercise: if a commandline option -inverse

is present, the sequence should be generated as $y_{i+1}=A^{-1}x_i$. Use the routine PetscOptionsHasName.

What do you observe now about the norms of the $y_i$ vectors?


28.2 Libraries for dense linear algebra: BLAS, Lapack, Scalapack

Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack

In this section we will discuss libraries for dense linear algebra operations.

Dense linear algebra, that is linear algebra on matrices that are stored as two-dimensional arrays (as opposed to sparse linear algebra; see section  , as well as the tutorial on PETSc  ) has been standardized for a considerable time. The basic operations are defined by the three levels of \indexac{BLAS}:

The name `BLAS' suggests a certain amount of generality, but the original authors were clear  [Lawson:blas] that these subprograms only covered dense linear algebra. Attempts to standardize sparse operations have never met with equal success.

Based on these building blocks libraries have been built that tackle the more sophisticated problems such as solving linear systems, or computing eigenvalues or singular values. Linpack \footnote{The linear system solver from this package later became the Linpack benchmark ; see section  .} and Eispack were the first to formalize these operations involved, using Blas Level 1 and Blas Level 2 respectively. A later development, Lapack uses the blocked operations of Blas Level 3. As you saw in section  , this is needed to get high performance on cache-based CPUs. (Note: the reference implementation of the BLAS  [reference-blas] will not give good performance with any compiler; most platforms have vendor-optimized implementations, such as the MKL library from Intel.)

With the advent of parallel computers, several projects arose that extended the Lapack functionality to distributed computing, most notably Scalapack   [Choi:scalapack,scalapack-users-guide] , PLapack   [PLAPACK,PLAPACK:UG] , and most recently Elemental  [Elemental:TOMS] . These packages are harder to use than Lapack because of the need for a two-dimensional cyclic distribution; sections and  . We will not go into the details here.

28.2.1 Some general remarks

Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks The Fortran heritage

Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > The Fortran heritage

The original BLAS routines were written in Fortran, and the reference implementation is still in Fortran. For this reason you will see the routine definitions first in Fortran in this tutorial It is possible to use the Fortran routines from a C/C++ program:

There are also C/C++ interfaces: Routine naming

Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > Routine naming Routines conform to a general naming scheme: XYYZZZ where


  • [X] precision: S,D,C,Z stand for single and double, single

    complex and double complex, respectively.

  • [YY] storage scheme: general rectangular, triangular, banded.

  • [ZZZ] operation. See the manual for a list. \end{description} Data formats

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > Data formats

    Lapack and Blas use a number of data formats, including \begin{description}

  • [GE] General matrix: stored two-dimensionally as A(LDA,*)
  • [SY/HE] Symmetric/Hermitian: general storage; UPLO parameter to indicate upper or lower (e.g. SPOTRF)
  • [GB/SB/HB] General/symmetric/Hermitian band; these formats use column-major storage; in SGBTRF overallocation needed because of pivoting

  • [PB] Symmetric of Hermitian positive definite band; no overallocation in SPDTRF \end{description} Lapack operations

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > Lapack operations

    For Lapack, we can further divide the routines into an organization with three levels:

    Expert driver names end on 'X'. BLAS matrix storage

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > BLAS matrix storage

    There are a few points to bear in mind about the way matrices are stored in the BLAS and LAPACK\footnote{We are not going into band storage here.}: Array indexing

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > BLAS matrix storage > Array indexing

    Since these libraries originated in a Fortran environment, they use 1-based indexing. Users of languages such as C/C++ are only affected by this when routines use index arrays, such as the location of pivots in LU factorizations. Fortran column-major ordering

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > BLAS matrix storage > Fortran column-major ordering

    Since computer memory is one-dimensional, some conversion is needed from two-dimensional matrix coordinates to memory locations. The Fortran language uses column-major storage, that is, elements in a column are stored consecutively; see figure  .

    Column-major storage of an array in Fortran

    This is also described informally as `the leftmost index varies quickest'.

    Arrays in C, on the other hand, are laid out in row-major order. How to create a C array that can be handled by Blas routines is discussed in section  . Submatrices and the {\tt LDA} parameter

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some general remarks > Submatrices and the {\tt LDA} parameter

    Using the storage scheme described above, it is clear how to store an $m\times n$ matrix in $mn$ memory locations. However, there are many cases where software needs access to a matrix that is a subblock of another, larger, matrix. As you see in figure 

    A subblock out of a larger matrix

    such a subblock is no longer contiguous in memory. The way to describe this is by introducing a third parameter in addition to {\tt M,N}: we let {\tt LDA} be the `leading dimension of {\tt A}', that is, the allocated first dimension of the surrounding array. This is illustrated in figure  .

    \caption{A subblock out of a larger matrix, using {\tt LDA}}

    To pass the subblock to a routine, you would specify it as

    call routine( A(3,2), /* M= */ 2, /* N= */ 3, /* LDA= */ Mbig, ... )

    28.2.2 Performance issues

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Performance issues The collection of BLAS and LAPACK routines are a de facto

    standard: the API is fixed, but the implementation is not. You can find reference implementations on the netlib website (\texttt{netlib.org}), but these will be very low in performance.

    On the other hand, many LAPACK routines can be based on the matrix-matrix product (BLAS routine \texttt{gemm}), which you saw in section  has the potential for a substantial fraction of peak performance. To achieve this, you should use an optimized version, such as

    28.2.3 Some simple examples

    Top > Libraries for dense linear algebra: BLAS, Lapack, Scalapack > Some simple examples

    Let's look at some simple examples.

    The routine xscal scales a vector in place.

    ! Fortran subroutine dscal(integer N, double precision DA, double precision, dimension(*) DX, integer INCX ) // C void cblas_dscal (const MKL_INT n, const double a, double *x, const MKL_INT incx);

    A simple example: \verbatimsnippet{scale}

    The same in C: \verbatimsnippet{scalecb}

    Many routines have an increment parameter. For xscale that's the final parameter:


    The matrix-vector product xgemv computes $y\leftarrow \alpha Ax+\beta y$,

    rather than $y\leftarrow Ax$. The specification of the matrix takes the M,N size parameters, and a character argument 'N' to indicate that the matrix is not transposed. Both of the vectors have an increment argument.

    subroutine dgemv(character TRANS,
        integer M,integer N,
        double precision ALPHA,
        double precision, dimension(lda,*) A,integer LDA,
        double precision, dimension(*) X,integer INCX,
        double precision BETA,double precision, dimension(*) Y,integer INCY
    An example of the use of this routine: \verbatimsnippet{mvp}

    The same example in C has an extra parameter to indicate whether the matrix is stored in row or column major storage: \verbatimsnippet{mvpc}

    Back to Table of Contents