\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] Back to Table of Contents

32 Performance measurement

Much of the teaching in this book is geared towards enabling you to write fast code, whether this is through the choice of the right method, or through optimal coding of a method. Consequently, you sometimes want to measure just how fast your code is. If you have a simulation that runs for many hours, you'd think just looking on the clock would be enough measurement. However, as you wonder whether your code could be faster than it is, you need more detailed measurements. This tutorial will teach you some ways to measure the behaviour of your code in more or less detail.

Here we will discuss

Top > Timers

32.1 Timers

There are various ways of timing your code, but mostly they come down to calling a routine twice that tells you the clock values:

  tstart = clockticks()
  ....
  tend = clockticks()
  runtime = (tend-tstart)/ticks_per_sec
For instance, in Fortran there is the system_clock routine \index{systemclock@system_clock} :

implicit none INTEGER :: rate, tstart, tstop REAL :: time real :: a integer :: i CALL SYSTEM_CLOCK(COUNT_RATE = rate) if (rate==0) then print *,"No clock available" stop else print *,"Clock frequency:",rate end if CALL SYSTEM_CLOCK(COUNT = tstart) a = 5 do i=1,1000000000 a = sqrt(a) end do CALL SYSTEM_CLOCK(COUNT = tstop) time = REAL( ( tstop - tstart )/ rate ) print *,a,tstart,tstop,time end

with output

 Clock frequency:       10000
   1.000000       813802544   813826097   2.000000  
In C there is the clock function:

#include <stdlib.h> #include <stdio.h> #include <time.h> int main() { double start, stop, time; int i; double a=5; i = CLOCKS_PER_SEC; printf("clock resolution: %d\n",i); start = (double)clock()/CLOCKS_PER_SEC; for (i=0; i<1000000000; i++) a = sqrt(a); stop = (double)clock()/CLOCKS_PER_SEC; time = stop - start; printf("res: %e\nstart/stop: %e,%e\nTime: %e\n",a,start,stop,time); return 0; }

with output

clock resolution: 1000000
res: 1.000000e+00
start/stop: 0.000000e+00,2.310000e+00
Time: 2.310000e+00
Do you see a difference between the Fortran and C approaches? Hint: what happens in both cases when the execution time becomes long? At what point do you run into trouble?

Top > Timers > System utilities

32.1.1 System utilities

There are unix system calls that can be used for timing: getrusage and gettimeofday. These timers have the advantage that they can

distinguish between user time and system time, that is, exclusively timing program execution or giving wallclock time including all system activities.

Top > Accurate counters

32.2 Accurate counters

The timers in the previous section had a resolution of at best a millisecond, which corresponds to several thousand cycles on a modern CPU. For more accurate counting it is typically necessary to use assembly language, such as the Intel RDTSC (ReaD Time Stamp Counter) instruction \url{http://developer.intel.com/drg/pentiumII/appnotes/RDTSCPM1.HTM}. However, this approach of using processor-specific timers is not portable. For this reason, the PAPI package (\url{http://icl.cs.utk.edu/papi/}) provides a uniform interface to hardware counters . You can see this package in action in the codes in appendix  .

In addition to timing, hardware counters can give you information about such things as cache misses and instruction counters. A processor typically has only a limited number of counters, but they can be assigned to various tasks. Additionally, PAPI has the concept of derived metrics.

Top > Profiling tools

32.3 Profiling tools

Profiling tools will give you the time spent in various events in the program, typically functions and subroutines, or parts of the code that you have declared as such. The tool will then report how many time the event occurred, total and average time spent, et cetera.

The only tool we mention here is gprof , the profiler of the GNU compiler. The TAU tool, discussed in section  32.4.1 for the purposes of tracing, also has profiling capabilities, presented in a nice graphic way. Finally, we mention that the PETSc library allows you to define your own timers and events.



Top > Profiling tools > MPI profiling

32.3.1 MPI profiling

The MPI library has been designed to make it easy to profile. Each function such as MPI_Send does no more than calling PMPI_Send. This means that a profiling library can redefine MPI_Send to

Top > Tracing

32.4 Tracing

In profiling we are only concerned with aggregate information: how many times a routine was called, and with what total/average/min/max runtime. However sometimes we want to know about the exact timing of events. This is especially relevant in a parallel context when we care about load unbalance and idle time .

Tools such as Vampyr can collect trace information about events and in particular messages, and render them in displays such as figure  1 .

\includegraphics{graphics/vampyrtrace}

A Vampyr timeline diagram of a parallel process.

Top > Tracing > TAU

32.4.1 TAU

The TAU tool  [TAU:ijhpca] (see \url{http://www.cs.uoregon.edu/research/tau/home.php} for the official documentation) uses instrumentation to profile and trace your code. That is, it adds profiling and trace calls to your code. You can then inspect the output after the run.

There are two ways to instrument your code:

When you run your program you need to tell TAU what to do:

export TAU_TRACE=1
export TAU_PROFILE=1
export TRACEDIR=/some/dir
export PROFILEDIR=/some/dir

In order to generate trace plots you need to convert TAU output:

cd /some/dir # where the trace and profile output went
tau_treemerge.pl
tau2slog2 tau.trc tau.edf -o yourrun.slog2

The \texttt{slog2} file can be displayed with jumpshot .

Back to Table of Contents