$\newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{{\equiv\atop\mathrm{\scriptstyle D}}}}}$ Back to Table of Contents

# 32 Performance measurement

Much of the teaching in this book is geared towards enabling you to write fast code, whether this is through the choice of the right method, or through optimal coding of a method. Consequently, you sometimes want to measure just how fast your code is. If you have a simulation that runs for many hours, you'd think just looking on the clock would be enough measurement. However, as you wonder whether your code could be faster than it is, you need more detailed measurements. This tutorial will teach you some ways to measure the behaviour of your code in more or less detail.

Here we will discuss

• timers: ways of measuring the execution time (and sometimes other measurements) of a particular piece of code, and

• profiling tools: ways of measuring how much time each piece of code, typically a subroutine, takes during a specific run.

Top > Timers

# 32.1 Timers

There are various ways of timing your code, but mostly they come down to calling a routine twice that tells you the clock values:

  tstart = clockticks()
....
tend = clockticks()
runtime = (tend-tstart)/ticks_per_sec

For instance, in Fortran there is the system_clock routine \index{systemclock@system_clock} :

implicit none
INTEGER :: rate, tstart, tstop
REAL    :: time
real    :: a
integer :: i

CALL SYSTEM_CLOCK(COUNT_RATE = rate)
if (rate==0) then
print *,"No clock available"
stop
else
print *,"Clock frequency:",rate
end if
CALL SYSTEM_CLOCK(COUNT = tstart)
a = 5
do i=1,1000000000
a = sqrt(a)
end do
CALL SYSTEM_CLOCK(COUNT = tstop)
time = REAL( ( tstop - tstart )/ rate )
print *,a,tstart,tstop,time
end



with output

 Clock frequency:       10000
1.000000       813802544   813826097   2.000000

In C there is the clock function:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main() {
double start, stop, time;
int i; double a=5;
i = CLOCKS_PER_SEC; printf("clock resolution: %d\n",i);
start = (double)clock()/CLOCKS_PER_SEC;
for (i=0; i<1000000000; i++)
a = sqrt(a);
stop = (double)clock()/CLOCKS_PER_SEC;
time = stop - start;
printf("res: %e\nstart/stop: %e,%e\nTime: %e\n",a,start,stop,time);
return 0;
}


with output

clock resolution: 1000000
res: 1.000000e+00
start/stop: 0.000000e+00,2.310000e+00
Time: 2.310000e+00

Do you see a difference between the Fortran and C approaches? Hint: what happens in both cases when the execution time becomes long? At what point do you run into trouble?

Top > Timers > System utilities

## 32.1.1 System utilities

There are unix system calls that can be used for timing: getrusage and gettimeofday. These timers have the advantage that they can

distinguish between user time and system time, that is, exclusively timing program execution or giving wallclock time including all system activities.

Top > Accurate counters

# 32.2 Accurate counters

The timers in the previous section had a resolution of at best a millisecond, which corresponds to several thousand cycles on a modern CPU. For more accurate counting it is typically necessary to use assembly language, such as the Intel RDTSC (ReaD Time Stamp Counter) instruction \url{http://developer.intel.com/drg/pentiumII/appnotes/RDTSCPM1.HTM}. However, this approach of using processor-specific timers is not portable. For this reason, the PAPI package (\url{http://icl.cs.utk.edu/papi/}) provides a uniform interface to hardware counters . You can see this package in action in the codes in appendix  .

In addition to timing, hardware counters can give you information about such things as cache misses and instruction counters. A processor typically has only a limited number of counters, but they can be assigned to various tasks. Additionally, PAPI has the concept of derived metrics.

Top > Profiling tools

# 32.3 Profiling tools

Profiling tools will give you the time spent in various events in the program, typically functions and subroutines, or parts of the code that you have declared as such. The tool will then report how many time the event occurred, total and average time spent, et cetera.

The only tool we mention here is gprof , the profiler of the GNU compiler. The TAU tool, discussed in section  32.4.1 for the purposes of tracing, also has profiling capabilities, presented in a nice graphic way. Finally, we mention that the PETSc library allows you to define your own timers and events.

Top > Profiling tools > MPI profiling

## 32.3.1 MPI profiling

The MPI library has been designed to make it easy to profile. Each function such as MPI_Send does no more than calling PMPI_Send. This means that a profiling library can redefine MPI_Send to

• finish the profiling stage.

Top > Tracing

# 32.4 Tracing

In profiling we are only concerned with aggregate information: how many times a routine was called, and with what total/average/min/max runtime. However sometimes we want to know about the exact timing of events. This is especially relevant in a parallel context when we care about load unbalance and idle time .

Tools such as Vampyr can collect trace information about events and in particular messages, and render them in displays such as figure  1 .

Top > Tracing > TAU

## 32.4.1 TAU

The TAU tool  [TAU:ijhpca] (see \url{http://www.cs.uoregon.edu/research/tau/home.php} for the official documentation) uses instrumentation to profile and trace your code. That is, it adds profiling and trace calls to your code. You can then inspect the output after the run.

There are two ways to instrument your code:

• You can use dynamic instrumentation , where TAU adds the measurement facility at runtime:

# original commandline:
# with TAU dynamic instrumentation:


• You can have the instrumentation added at compile time. For this, you need to let TAU take over the compilation in some sense. \begin{enumerate}

• TAU has its own makefiles. The names and locations depend on your installation, but typically it will be something like

export TAU_MAKEFILE=\$TAU_HOME/lib/Makefile.tau-mpi-pdt


• Now you can invoke the TAU compilers \n{tau_cc,sh}, \n{tau_cxx.sh}, tau_f90.sh.

\end{enumerate}

When you run your program you need to tell TAU what to do:

export TAU_TRACE=1
export TAU_PROFILE=1
export TRACEDIR=/some/dir
export PROFILEDIR=/some/dir


In order to generate trace plots you need to convert TAU output:

cd /some/dir # where the trace and profile output went
tau_treemerge.pl
tau2slog2 tau.trc tau.edf -o yourrun.slog2


The \texttt{slog2} file can be displayed with jumpshot .