There are many libraries for scientific computing. Some are specialized to certain application areas, others are quite general. In this section we will take a brief look at the PETSc library for sparse matrix computations, and the BLAS/Lapack libraries for dense computations.
PETSc, the Portable Extendable Toolkit for Scientifc Computation [] , is a large powerful library, mostly concerned with linear and nonlinear system of equations that arise from discretized PDE . Since it has many hundreds of routines (and a good manual already exists) we limit this tutorial to going through a few simple, yet representative, PETSc programs.
PETSc can be used as a library in the traditional sense, where you use some high level functionality, such as solving a nonlinear system of equations, in your program. However, it can also be used as a toolbox, to compose your own numerical applications using low-level tools.
Linear system solvers (sparse/dense, iterative/direct)
Nonlinear system solvers
Tools for distributed matrices
Support for profiling, debugging, graphical output
The basic functionality of PETSc can be extended through external packages:
PETSc has an object-oriented design, even though it is implemented in C, a not object oriented language. PETSc is also parallel: all objects in Petsc are defined on an MPI communicator (section 2.6.3.3 ). For an object such as a matrix this means that it is stored distributed over the processors in the communicator; for objects such as solvers it means that their operation is distributed over the communicator. PETSc objects can only interact if they are based on the same communicator. Most of the time objects will be defined on the MPI_COMM_WORLD communicator, but subcommunicators can be used too. This way, you can define a matrix or a linear system on all your available processors, or on a subset.
Parallelism is handled through MPI. You may need to have occasional MPI calls in a PETSc code, but the vast bulk of communication is done behind the scenes inside PETSc calls. Shared memory parallel (such as through OpenMP; section 2.6.2 ) is not used explicitly, but the user can incorporate it without problems.
The object-oriented design means that a call such as matrix-vector multiplication
MATMult(A,x,y); // y <- A xlooks the same, regardless whether $A$ is sparse or dense, sequential or parallel.
One implication of this is that the actual data are usually hidden from the user. Data can be accessed through routines such as
double *array; VecGetArray(myvector,&array);but most of the time the application does not explicitly maintain the data, only the PETSc objects containing the data. As the ultimate consequence of this, the user usually does not allocate data; rather, matrix and vector arrays get created through the PETSc create calls, and subsequent values are inserted through PETSc calls:
MatSetValue(A,i,j,v,INSERT_VALUES); // A[i,j] <- vThis may feel like the user is giving up control, but it actually makes programming a lot easier, since the user does not have to worry about details of storage schemes, especially in parallel.
In this section we will go through a number of successively more complicated examples of the use of PETSc. The files can be downloaded from the repository of this book. While doing these examples it is a good idea to keep the manual page open:\\ \url{http://tinyurl.com/PETSc-man-page}.
When you do these examples, make sure to use a version of PETSc that has debug mode enabled! \begin{istc} At TACC, do module load petsc/3.5, or other version number as applicable; and module load petsc/3.5-debug for the corresponding debug version. \end{istc}
The first example (we only list C sources in this book; the download includes their Fortran equivalents) illustrates the basic structure of a PETSc program: the include file at the top and the calls to \n{PetscInitialize}, PetscFinalize. Furthermore it uses two routines:
value 45.
that only one processor executes the print command. If you ran your program as \n{mpiexec -np 4 init} and you used printf, you would get four copies of the output.
\verbatimsnippet{petscprint}
Just about the only lines of MPI that you need to know when programming PETSc are:
C: ierr = MPI_Comm_size(comm,&ntids); ierr = MPI_Comm_rank(comm,&mytid); F: call MPI_Comm_size(comm,&ntids,ierr); call MPI_Comm_rank(comm,&mytid,ierr);The first line gives the size of the communicator, meaning how many processes there are; the second one gives the rank of the current process as a number from zero to the number of processes minus one.
Exercise
Add these two lines to your program. Look up the routine PetscSynchronizedPrintf in the documentation and use it to let each process print out a line like \n{Process 3 out of 7}. You may also need to use PetscSynchronizedFlush.
Next we start making PETSc objects, first a vector. \listing{vec.c}{tutorials/petsc/exercises/cexercises/vec.c} Note how it takes several calls to fully create the vector object:
The routine setting the size has two size parameters; the second specifies that the global size is~n, and the first one says that you leave the distribution for PETSc to decide.
deallocates the memory for the vector. While this is strictly speaking not necessary for the functioning of the program, it is a good idea to issue \n{Destroy} calls for each Create call, since it can prevent potential memory leaks .
Exercise Comment out the VecDestroy call, and run the program with the option -malloc_dump. PETSc will now report on all memory that had not been freed at the time of the PetscFinalize call.
If you run the program in parallel the vector will be created distributed over the processors. The program contains a call to VecGetOwnershipRange to discover what part of the vector lives on what processor. You see that the VecView calls display for each processor precisely that part of the vector.
Exercise The listing has a call to VecSet to set the vector elements to a
constant value. Remove this line. Use the myfirst,mylast variables and the PETSc routine VecSetValue to let every process set its local vector elements to the processor number. (The last argument of \n{VecSetValue} should be INSERT_VALUES.) That is, if the vector is six elements long and there are three processors, the resulting vector should be $(0,0,1,1,2,2)$.
Exercise
Change the code from the previous exercise. Now every vector element should be the sum of the processor number and the previous processor numbers; in the above example the result should be $(0,0,1,1,3,3)$. Read the man page for VecSetValue!
Run the code from the previous two exercises again, adding a commandline argument -log_summary. Observe that the first code has no messages in the VecAssemblyBegin/End calls, but the second one does.
Let's move up to matrices. Creating a matrix is similar to creating a vector. In this case the type is MPIAIJ which stands for a distributed sparse matrix. \listing{mat.c}{tutorials/petsc/exercises/cexercises/mat.c} In this example a diagonal matrix is constructed, with each processor setting only its locally stored elements.
Exercise
In addition to setting the diagonal, also set the first subdiagonal and the first superdiagonal, that is, the elements $(i,i-1)$ and $(i,i+1)$. To set all three elements in a row with one call you can do this:
vm = 1.*mytid; for (i=myfirst; i<mylast; i++) { PetscInt j[3]; PetscReal v[3]; j[0] = i-1; j[1] = i; j[2] = i+1; v[0] = --vm-1; v[1] = 2*vm; v[2] = -vm+1; ierr = MatSetValues (A,1,&i,3,j,v,INSERT_VALUES); CHKERRQ(ierr); }However, this code is not entirely correct. Edit the program using this fragment and run it. Diagnose the problem and fix it.
matrix and the vector. Make sure that the size declarations of the matrix and the vector are compatible. You also need a second vector to store the result of the multiplication. This is easiest done by
ierr = VecDuplicate(x,&y); CHKERRQ(ierr);
Exercise Look up the MatMult routine in the documentation and use it your program. Use VecView to inspect the result. Note that no size
parameters or anything pertaining to parallelism appears in the calling sequence.
system solving. First you need to create a solver object and give it the matrix as operator:
ierr = KSPCreate(comm,&solver); CHKERRQ(ierr); ierr = KSPSetOperators(solver,A,A,0); CHKERRQ(ierr); ierr = KSPSolve(solver,x,y); CHKERRQ(ierr); ierr = VecView(y,0); CHKERRQ(ierr);
Exercise
Add these lines to your code. Make sure you know what the correct solution is by using MatMult to obtain the right hand side.
You have just used the default linear system solver. Run the program again, but with the option -ksp_view. This will tell you all the details of what solver was used.
Solving the linear system is a one line call to KSPSolve. Thestory would end there if it weren't for some complications:
Iterative methods can fail, and the solve call does not tell us whether that happened.
If the system was solved successfully, we would like to know in how many iterations.
There can be other reason for the iterative method to halt, such as reaching its maximum number of iterations without converging.
Exercise
Use the routine KSPGetConvergedReason to inspect the status of the solution vector. Use KSPGetIterationNumber to see how many iterations it took.
Exercise
Add code to your program to compute the residual and its norm. For the residual, look up the routines \n{VecDuplicate} and VecAXPY; for computing its norm look up VecNorm.
Exercise Add a call to KSPSetFromOptions to your code. Use the option -ksp_monitor to observe the convergence behaviour.
The workings of a PETSc program can be customized to a great degree through the use of commandline options. This includes setting the type of the solver. In order for such options to be obeyed, you first need to put a command \n{KSPSetFromOptions} before the KSPSolve call.
Exercise The -ksp_view option told you what solver and preconditioner were used. Look up the routines \n{KSPSetType} and PCSetType and
use those to change the iterative method to CG and the preconditioner to Jacobi. Do this first by using commandline options, and then by editing the code.
This section will give you some further help towards solving a realistic PDE problem.
In the examples above you used a commandline argument to determine the matrix size directly. Here we construct the matrix of 5-point stencil for the Poisson operator (see section 4.1.6 and in particular figure 1 ). Determining its size takes two steps: you need to read the domain size $n=1/h-1$ and compute the matrix size from it.
C:
int domain_size,matrix_size; PetscOptionsGetInt (PETSC_NULL,"-n",&domain_size,&flag); matrix_size = domain_size*domain_size;Fortran:
integer :: domain_size,matrix_size call PetscOptionsGetInt(PETSC_NULL_CHARACTER, > "-n",domain_size,flag) matrix_size = domain_size*domain_size;Now you use the matrix_size parameter for constructing the matrix.
Just like in the examples above, you want each processor to set only its local rows. The easiest way to iterate over those is to iterate over all variables / matrix rows and select only the local ones.
We will now set matrix elements (refer to the full domain, but only inserting those elements that are in its matrix block row.
C:
MatGetOwnershipRange(A,&myfirst,&mylast); for ( i=0; i<domain_size; i++ ) { for ( j=0; j<domain_size; j++ ) { I = j + matrix_size*i; if (I>=myfirst && I<mylast) { J = I; // for the diagonal element MatSetValues (A,1,&I,1,&J,&v,INSERT_VALUES); J = .... // for the other points J = .... } } } MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
Fortran:
call MatGetOwnershipRange(A,myfirst,mylast) do i=0,matrix_size-1 do j=0,domain_size-1 ii = j + domain_size*i if (ii>=myfirst .and. ii<mylast) then jj = ii ; for the diagonal element call MatSetValues > (A,1,ii,1,jj,v,INSERT_VALUES) jj = ii... ; for the other elements jj = ii... end if end do end do call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY) call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY)
Exercise
Construct the matrix from equation \eqref{eq:5starmatrix} in section 4.1.6 . Compute and output the product with the identity vector (meaning that all elements are $1$), and check that the result is correct. Make sure to test your program in parallel.
PETSc's versatility in dealing with Finite Element matrices (see sections 4.1.8 and 6.5.2 ), where elements are constructed by adding together contributions, sometimes from different processors. This is no problem in PETSc: any processor can set (or add to) any matrix element. The assembly calls will move data to their eventual location on the correct processors.
for (e=myfirstelement; e<mylastelement; e++) { for (i=0; i<nlocalnodes; i++) { I = localtoglobal(e,i); for (j=0; j<nlocalnodes; j++) { J = localtoglobal(e,j); v = integration(e,i,j); MatSetValues (mat,1,&I,1,&J,&v,ADD_VALUES); .... } } } MatAssemblyBegin(mat,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(mat,MAT_FINAL_ASSEMBLY);
We have covered the basics of setting up the solver and solving the system above.
As an illustration of the toolbox nature of PETSc, you can now use routines you have already seen to compute the residual and its norm.
Exercise Create a new vector~\n{z} (use VecDuplicate) and store the product of \n{A} and the computed solution~\n{y} (use MatMult) in it. If you solved the system accurately, z~should now be equal to~x. To see how close it is, use
to subtract \n{x} from~z and compute the norm of the result.PetscReal norm; VecAXPY(z,-1,x); VecNorm(z,NORM_2,&norm);
Reading a parameter from the commandline above is actually a special case of a general mechanism for influencing PETSc's workings through commandline options.
Here is an example of setting the iterative solver and preconditioner from the commandline:
yourprog -ksp_type gmres -ksp_gmres_restart 25 -pc_type ilu -pc_factor_levels 3
In order for this to work, your code needs to call
KSPSetFromOptions(solver);before the system solution. This mechanism is very powerful, and it obviates the need for much code recompilation.
\begin{istc}
Exercise
Write a PETSc program that does the following:
Construct the matrix \begin{equation} A= \begin{pmatrix} 2&-1\\ -1&2&-1\\ &\ddots&\ddots&\ddots\\ &&-1&2&-1\\ &&&-1&2\\ \end{pmatrix} \end{equation}
Compute the sequence \begin{equation} x_0=(1,0,\ldots,0)^t,\quad y_{i+1}=Ax_i,\quad x_i=y_i/\|y_i\|_2. \end{equation} This is the power method (section 15.3 ), which is expected to converge to the dominent eigenvector.
In each iteration of this process, print out the norm of $y_i$ and for $i>0$ the norm of the difference $x_i-x_{i-1}$. Do this for some different problem sizes. What do you observe?
The number of iterations and the size of the problem should be specified through commandline options. Use the routine PetscOptionsGetInt.
\end{istc} \begin{istc}
Exercise Extend the previous exercise: if a commandline option -inverse
is present, the sequence should be generated as $y_{i+1}=A^{-1}x_i$. Use the routine PetscOptionsHasName.
What do you observe now about the norms of the $y_i$ vectors?
\end{istc}
In this section we will discuss libraries for dense linear algebra operations.
Dense linear algebra, that is linear algebra on matrices that are stored as two-dimensional arrays (as opposed to sparse linear algebra; see section , as well as the tutorial on PETSc ) has been standardized for a considerable time. The basic operations are defined by the three levels of \indexac{BLAS}:
Level 1 defines vector operations that are characterized by a single loop [Lawson:blas] .
Level 2 defines matrix vector operations, both explicit such as the matrix-vector product, and implicit such as the solution of triangular systems [BLAS2] .
Level 3 defines matrix-matrix operations, most notably the matrix-matrix product [BLAS3] .
Based on these building blocks libraries have been built that tackle the more sophisticated problems such as solving linear systems, or computing eigenvalues or singular values. Linpack \footnote{The linear system solver from this package later became the Linpack benchmark ; see section .} and Eispack were the first to formalize these operations involved, using Blas Level 1 and Blas Level 2 respectively. A later development, Lapack uses the blocked operations of Blas Level 3. As you saw in section , this is needed to get high performance on cache-based CPUs. (Note: the reference implementation of the BLAS [reference-blas] will not give good performance with any compiler; most platforms have vendor-optimized implementations, such as the MKL library from Intel.)
With the advent of parallel computers, several projects arose that extended the Lapack functionality to distributed computing, most notably Scalapack [Choi:scalapack,scalapack-users-guide] , PLapack [PLAPACK,PLAPACK:UG] , and most recently Elemental [Elemental:TOMS] . These packages are harder to use than Lapack because of the need for a two-dimensional cyclic distribution; sections and . We will not go into the details here.
The original BLAS routines were written in Fortran, and the reference implementation is still in Fortran. For this reason you will see the routine definitions first in Fortran in this tutorial It is possible to use the Fortran routines from a C/C++ program:
You typically need to append an underscore to the Fortran name;
Every argument needs to be a `star'-argument, so you can not pass literal constants: you need to pass the address of a variable.
You need to create a column-major matrix.
Fortran character arguments have been replaced by enumerated constants, for instance \n{CblasNoTrans} instead of the~'N' parameter.
The Cblas interface can accomodate both row-major and column-major storage.
Array indices are 1-based, rather than 0-based; this mostly becomes apparent in error messages and when specifying pivot indices.
\begin{description}
complex and double complex, respectively.
[YY] storage scheme: general rectangular, triangular, banded.
[ZZZ] operation. See the manual for a list. \end{description}
Lapack and Blas use a number of data formats, including \begin{description}
[GB/SB/HB] General/symmetric/Hermitian band; these formats use column-major storage; in SGBTRF overallocation needed because of pivoting
[PB] Symmetric of Hermitian positive definite band; no overallocation in SPDTRF \end{description}
For Lapack, we can further divide the routines into an organization with three levels:
Drivers. These are powerful top level routine for problems such as solving linear systems or computing an SVD. There are simple and expert drivers; the expert ones have more numerical sophistication.
Computational routines. These are the routines that drivers are built up out of\footnote{Ha! Take that, Winston.}. A user may have occasion to call them by themselves.
Auxiliary routines.
Linear system solving. Simple drivers: \n{-SV} (e.g., DGESV) Solve $AX=B$, overwrite A with LU (with pivoting), overwrite B with X.
Expert driver: -SVXAlso transpose solve, condition estimation, refinement, equilibration
Least squares problems. Drivers: \begin{description}
\end{description}
Also: LSE \& GLM linear equality constraint \& general linear model
Eigenvalue routines. Symmetric/Hermitian: \n{xSY} or \n{xHE} (also \n{SP}, \n{SB}, ST)\\ simple driver -EV\\ expert driver -EVX\\ divide and conquer -EVD\\ relative robust representation -EVR
General (only xGE)\\ Schur decomposition \n{-ES} and -ESX\\ eigenvalues \n{-EV} and -EVX SVD (only xGE)\\ simple driver -SVD\\ divide and conquer SDD Generalized symmetric (\n{SY} and \n{HE}; \n{SP}, SB)\\ simple driver GV\\ expert GVX\\ divide-conquer GVDNonsymmetric:\\ Schur: simple \n{GGES}, expert GGESX\\ eigen: simple \n{GGEV}, expert GGEVX
svd: GGSVDThere are a few points to bear in mind about the way matrices are stored in the BLAS and LAPACK\footnote{We are not going into band storage here.}:
Since these libraries originated in a Fortran environment, they use 1-based indexing. Users of languages such as C/C++ are only affected by this when routines use index arrays, such as the location of pivots in LU factorizations.
Since computer memory is one-dimensional, some conversion is needed from two-dimensional matrix coordinates to memory locations. The Fortran language uses column-major storage, that is, elements in a column are stored consecutively; see figure .
This is also described informally as `the leftmost index varies quickest'.
Arrays in C, on the other hand, are laid out in row-major order. How to create a C array that can be handled by Blas routines is discussed in section .
Using the storage scheme described above, it is clear how to store an $m\times n$ matrix in $mn$ memory locations. However, there are many cases where software needs access to a matrix that is a subblock of another, larger, matrix. As you see in figure
such a subblock is no longer contiguous in memory. The way to describe this is by introducing a third parameter in addition to {\tt M,N}: we let {\tt LDA} be the `leading dimension of {\tt A}', that is, the allocated first dimension of the surrounding array. This is illustrated in figure .
To pass the subblock to a routine, you would specify it as
call routine( A(3,2), /* M= */ 2, /* N= */ 3, /* LDA= */ Mbig, ... )
standard: the API is fixed, but the implementation is not. You can find reference implementations on the netlib website (\texttt{netlib.org}), but these will be very low in performance.
On the other hand, many LAPACK routines can be based on the matrix-matrix product (BLAS routine \texttt{gemm}), which you saw in section has the potential for a substantial fraction of peak performance. To achieve this, you should use an optimized version, such as
MKL , the Intel math-kernel library;
OpenBlas (\texttt{http://www.openblas.net/}), an open source version of the original Goto BLAS ; or
blis (\texttt{https://code.google.com/p/blis/}), a BLAS replacement and extension project.
Let's look at some simple examples.
The routine xscal scales a vector in place.A simple example: \verbatimsnippet{scale}! Fortran subroutine dscal(integer N, double precision DA, double precision, dimension(*) DX, integer INCX ) // C void cblas_dscal (const MKL_INT n, const double a, double *x, const MKL_INT incx);
The same in C: \verbatimsnippet{scalecb}
Many routines have an increment parameter. For xscale that's the final parameter:\verbatimsnippet{scaleinc}
The matrix-vector product xgemv computes $y\leftarrow \alpha Ax+\beta y$,rather than $y\leftarrow Ax$. The specification of the matrix takes the M,N size parameters, and a character argument 'N' to indicate that the matrix is not transposed. Both of the vectors have an increment argument.
subroutine dgemv(character TRANS, integer M,integer N, double precision ALPHA, double precision, dimension(lda,*) A,integer LDA, double precision, dimension(*) X,integer INCX, double precision BETA,double precision, dimension(*) Y,integer INCY )An example of the use of this routine: \verbatimsnippet{mvp}
The same example in C has an extra parameter to indicate whether the matrix is stored in row or column major storage: \verbatimsnippet{mvpc}
Back to Table of Contents