Timing and benchmarking

Experimental html version of downloadable textbook, see http://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \newcommand\macro[1]{$\langle$#1$\rangle$} \] 39.1 : Timers
39.1.1 : Fortran
39.1.2 : C
39.1.3 : C++
39.1.4 : System utilities
39.2 : Accurate counters
39.3 : Parallel timers in MPI and OpenMP
39.4 : Profiling tools
39.4.1 : MPI profiling
39.5 : Tracing
39.5.1 : TAU
Back to Table of Contents

39 Timing and benchmarking

Much of the teaching in this book is geared towards enabling you to write fast code, whether this is through the choice of the right method, or through optimal coding of a method. Consequently, you sometimes want to measure just how fast your code is. If you have a simulation that runs for many hours, you'd think just looking on the clock would be enough measurement. However, as you wonder whether your code could be faster than it is, you need more detailed measurements. This tutorial will teach you some ways to measure the behaviour of your code in more or less detail.

Here we will discuss

  • timers: ways of measuring the execution time (and sometimes other measurements) of a particular piece of code, and

  • profiling tools: ways of measuring how much time each piece of code, typically a subroutine, takes during a specific run.

39.1 Timers

crumb trail: > performance > Timers

There are various ways of timing your code, but mostly they come down to calling a timer routine twice that tells you the clock values:

tstart = clockticks()
....
tend = clockticks()
runtime = (tend-tstart)/ticks_per_sec

39.1.1 Fortran

crumb trail: > performance > Timers > Fortran

For instance, in Fortran there is the system_clock routine :

implicit none
INTEGER :: rate, tstart, tstop
REAL    :: time
real    :: a
integer :: i

CALL SYSTEM_CLOCK(COUNT_RATE = rate)
if (rate==0) then
print *,"No clock available"
stop
else
print *,"Clock frequency:",rate
end if
CALL SYSTEM_CLOCK(COUNT = tstart)
a = 5
do i=1,1000000000
a = sqrt(a)
end do
CALL SYSTEM_CLOCK(COUNT = tstop)
time = REAL( ( tstop - tstart )/ rate )
print *,a,tstart,tstop,time
end

with output

Clock frequency:       10000
1.000000       813802544   813826097   2.000000

39.1.2 C

crumb trail: > performance > Timers > C

In C there is the clock function:

#include 
#include 
#include 
int main() {
double start, stop, time;
int i; double a=5;
i = CLOCKS_PER_SEC; printf("clock resolution:
start = (double)clock()/CLOCKS_PER_SEC;
for (i=0; i<1000000000; i++)
a = sqrt(a);
stop = (double)clock()/CLOCKS_PER_SEC;
time = stop - start;
printf("res:
return 0;
}
with output

clock resolution: 1000000
res: 1.000000e+00
start/stop: 0.000000e+00,2.310000e+00
Time: 2.310000e+00

Do you see a difference between the Fortran and C approaches? Hint: what happens in both cases when the execution time becomes long? At what point do you run into trouble?

39.1.3 C++

crumb trail: > performance > Timers > C++

While C routines are available in  C++ , there is also a new chrono library that can do many things, including handling different time formats.

std::chrono::system_clock::time_point start_time;
start_time = std::chrono::system_clock::now();
// ... code ...
auto duration =
std::chrono::system_clock::now()-start_time;
auto millisec_duration =
std::chrono::duration_cast<std::chrono::milliseconds>(duration);
std::cout << "Time in milli seconds: "
<< .001 * millisec_duration.count() << endl;

39.1.4 System utilities

crumb trail: > performance > Timers > System utilities

There are unix system calls that can be used for timing: getrusage

#include <sys/resource.h>
double time00(void)
{
struct rusage ruse;
getrusage(RUSAGE_SELF, &ruse);
return( (double)(ruse.ru_utime.tv_sec+ruse.ru_utime.tv_usec
/ 1000000.0) );
}

and gettimeofday

#include <sys/time.h>
double time00(void)
{
struct timeval tp;
gettimeofday(&tp, NULL);
return( (double) (tp.tv_sec + tp.tv_usec/1000000.0) ); /* wall
}

These timers have the advantage that they can distinguish between user time and system time, that is, exclusively timing program execution or giving wallclock time including all system activities.

39.2 Accurate counters

crumb trail: > performance > Accurate counters

The timers in the previous section had a resolution of at best a millisecond, which corresponds to several thousand cycles on a modern CPU. For more accurate counting it is typically necessary to use assembly language, such as the Intel RDTSC (ReaD Time Stamp Counter) instruction  http://developer.intel.com/drg/pentiumII/appnotes/RDTSCPM1.HTM .

static inline void microtime(unsigned *lo, unsigned *hi)
{
__asm __volatile (
".byte 0x0f; .byte 0x31   # RDTSC instruction
movl   
movl   
: "=g" (*hi), "=g" (*lo) :: "eax", "edx");
}

However, this approach of using processor-specific timers is not portable. For this reason, the PAPI package ( http://icl.cs.utk.edu/papi/ ) provides a uniform interface to hardware counters . You can see this package in action in the codes in appendix  \ref{app:codes} .

In addition to timing, hardware counters can give you information about such things as cache misses and instruction counters. A processor typically has only a limited number of counters, but they can be assigned to various tasks. Additionally, PAPI has the concept of derived metrics .

39.3 Parallel timers in MPI and OpenMP

crumb trail: > performance > Parallel timers in MPI and OpenMP

Many packages have their own timers. For instance for MPI

double MPI_Wtime(void);
double MPI_Wtick(void);

and for OpenMP

double omp_get_wtime()
double omp_get_wtick()

In neither of these packages are the timers synchronized over the parallel processes or threads. To have a synchronized timer you need to do an explicit barrier.

39.4 Profiling tools

crumb trail: > performance > Profiling tools

Profiling tools will give you the time spent in various events in the program, typically functions and subroutines, or parts of the code that you have declared as such. The tool will then report how many time the event occurred, total and average time spent, et cetera.

The only tool we mention here is gprof , the profiler of the GNU compiler. The TAU tool, discussed in section  39.5.1 for the purposes of tracing, also has profiling capabilities, presented in a nice graphic way. Finally, we mention that the PETSc library allows you to define your own timers and events.


39.4.1 MPI profiling

crumb trail: > performance > Profiling tools > MPI profiling

The MPI library has been designed to make it easy to profile. Each function such as MPI_Send does no more than calling PMPI_Send . This means that a profiling library can redefine MPI_Send to

  • initialize a profiling stage,

  • call PMPI_Send ,
  • finish the profiling stage.

39.5 Tracing

crumb trail: > performance > Tracing

In profiling we are only concerned with aggregate information: how many times a routine was called, and with what total/average/min/max runtime. However sometimes we want to know about the exact timing of events. This is especially relevant in a parallel context when we care about load unbalance and idle time .

Tools such as Vampyr can collect trace information about events and in particular messages, and render them in displays such as figure  labelstring .

\includegraphics{graphics/vampyrtrace}

A Vampyr timeline diagram of a parallel process.

39.5.1 TAU

crumb trail: > performance > Tracing > TAU

The TAU tool  [TAU:ijhpca] (see http://www.cs.uoregon.edu/research/tau/home.php for the official documentation) uses instrumentation to profile and trace your code. That is, it adds profiling and trace calls to your code. You can then inspect the output after the run.

There are two ways to instrument your code:

  • You can use dynamic instrumentation , where TAU adds the measurement facility at runtime:

    # original commandline:
    # with TAU dynamic instrumentation:
    
  • You can have the instrumentation added at compile time. For this, you need to let TAU take over the compilation in some sense.

    1. TAU has its own makefiles. The names and locations depend on your installation, but typically it will be something like

      export TAU_MAKEFILE=$TAU_HOME/lib/Makefile.tau-mpi-pdt
      
    2. Now you can invoke the TAU compilers tau_cc,sh , tau_cxx.sh , tau_f90.sh .

When you run your program you need to tell TAU what to do:

export TAU_TRACE=1
export TAU_PROFILE=1
export TRACEDIR=/some/dir
export PROFILEDIR=/some/dir

In order to generate trace plots you need to convert TAU output:

cd /some/dir # where the trace and profile output went
tau_treemerge.pl
tau2slog2 tau.trc tau.edf -o yourrun.slog2

The slog2 file can be displayed with jumpshot .

Back to Table of Contents