Debugging

Experimental html version of downloadable textbook, see https://www.tacc.utexas.edu/~eijkhout/istc/istc.html
\[ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% %%%% This text file is part of the source of %%%% `Introduction to High-Performance Scientific Computing' %%%% by Victor Eijkhout, copyright 2012-2020 %%%% %%%% mathjax.tex : macros to facility mathjax use in html version %%%% %%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \newcommand\macro[1]{$\langle$#1$\rangle$} \newcommand\dtdxx{\frac{\alpha\Delta t}{\Delta x^2}} \] 29.1 : Invoking the debugger
29.2 : Finding errors: where, frame, print
29.2.1 : C programs
29.2.2 : Fortran programs
29.3 : Stepping through a program
29.4 : Inspecting values
29.5 : Breakpoints
29.6 : Memory debugging
29.6.1 : Type of memory errors
29.6.1.1 : Invalid pointers
29.6.1.2 : Out-of-bounds errors
29.6.1.3 : Memory leaks
29.6.2 : Memory tools
29.6.2.1 : Valgrind
29.6.2.2 : Electric fence
29.7 : Parallel debugging
29.8 : Further reading
Back to Table of Contents

29 Debugging

Debugging is like being the detective in a crime movie where you are also the murderer. (Filipe Fortes, 2013)

When a program misbehaves, debugging is the process of finding out why . There are various strategies of finding errors in a program. The crudest one is debugging by print statements. If you have a notion of where in your code the error arises, you can edit your code to insert print statements, recompile, rerun, and see if the output gives you any suggestions. There are several problems with this:

  • The edit/compile/run cycle is time consuming, especially since
  • often the error will be caused by an earlier section of code, requiring you to edit, compile, and rerun repeatedly. Furthermore,
  • the amount of data produced by your program can be too large to display and inspect effectively, and
  • if your program is parallel, you probably need to print out data from all processors, making the inspection process very tedious.

For these reasons, the best way to debug is by the use of an interactive debugger , a program that allows you to monitor and control the behavior of a running program. In this section you will familiarize yourself with gdb and lldb , the open source debuggers of the GNU and clang projects respectively. Other debuggers are proprietary, and typically come with a compiler suite. Another distinction is that gdb is a commandline debugger; there are graphical debuggers such as ddd (a~frontend to gdb) or DDT and TotalView (debuggers for parallel codes). We limit ourselves to gdb, since it incorporates the basic concepts common to all debuggers.

In this tutorial you will debug a number of simple programs with gdb and valgrind. The files can be found in the repository in the directory tutorials/debug_tutorial_files .

TABLE: List of common gdb / lldb commands

\toprule gdb lldb

\midrule Starting a debugger run

\midrule \$ gdb program \$ lldb program
(gdb) run (lldb) run

\midrule Displaying a stack trace

\midrule (gdb) where (lldb) thread backtrace

\midrule Investigate a specific frame

\midrule frame 2 frame select 2

\midrule Run/step

\midrule run / step / continue

\midrule Set a breakpoint at a line

\midrule break foo.c:12 breakpoint set [ -f foo.c ] -l 12
break foo.c:12 if n>0
info breakpoints

\midrule Set a breakpoint for exceptions

\midrule catch throw break set -E C++
\bottomrule

29.1 Invoking the debugger

crumb trail: > debug > Invoking the debugger

There are three ways of using gdb: using it to start a program, attaching it to an already running program, or using it to inspect a core dump . We will only consider the first possibility.

\toprule Starting a debugger run
\midrule gdb lldb
\midrule \$ gdb program \$ lldb program
(gdb) run (lldb) run
\bottomrule

Here is an example of how to start gdb with program that has no arguments (Fortran users, use hello.F ): \codelisting{tutorials/gdb/c/hello.c}

%% cc -g -o hello hello.c
# regular invocation:
%% ./hello
hello world
# invocation from gdb:
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
Copyright 2004 Free Software Foundation, Inc. .... copyright info ....
(gdb) run
Starting program: /home/eijkhout/tutorials/gdb/hello
Reading symbols for shared libraries +. done
hello world


Program exited normally.
(gdb) quit
%%

Important note: the program was compiled with the flag}  -g . This causes the symbol table (that is, the translation from machine address to program variables) and other debug information to be included in the binary. This will make your binary larger than strictly necessary, but it will also make it slower, for instance because the compiler will not perform certain optimizations\footnote{Compiler optimizations are not supposed to change the semantics of a program, but sometimes do. This can lead to the nightmare scenario where a program crashes or gives incorrect results, but magically works correctly with compiled with debug and run in a debugger.}.

To illustrate the presence of the symbol table do

%% cc -g -o hello hello.c
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
(gdb) list

and compare it with leaving out the -g flag:

%% cc -o hello hello.c
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
(gdb) list

For a program with commandline input we give the arguments to the run command (Fortran users use say.F ): \codelisting{tutorials/gdb/c/say.c}

%% cc -o say -g say.c
%% ./say 2
hello world
hello world
%% gdb say
.... the usual messages ...
(gdb) run 2
Starting program: /home/eijkhout/tutorials/gdb/c/say 2
Reading symbols for shared libraries +. done
hello world
hello world


Program exited normally.

29.2 Finding errors: where, frame, print

crumb trail: > debug > Finding errors: where, frame, print

Let us now consider some programs with errors.

29.2.1 C programs

crumb trail: > debug > Finding errors: where, frame, print > C programs

// square.c
  int nmax,i;
  float *squares,sum;

  fscanf(stdin,"%d",nmax);
  for (i=1; i<=nmax; i++) {
    squares[i] = 1./(i*i); sum += squares[i];
  }
  printf("Sum: %e\n",sum);

%% cc -g -o square square.c
 %% ./square
5000
Segmentation fault

The segmentation fault (other messages are possible too) indicates that we are accessing memory that we are not allowed to, making the program exit. A debugger will quickly tell us where this happens:

%% gdb square
(gdb) run
50000


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x000000000000eb4a
0x00007fff824295ca in __svfscanf_l ()

Apparently the error occurred in a function __svfscanf_l , which is not one of ours, but a system function. Using the backtrace (or  bt , also where or  w ) command we display the call stack . This usually allows us to find out where the error lies:

\toprule Displaying a stack trace
\midrule gdb lldb
\midrule (gdb) where (lldb) thread backtrace
\bottomrule

{\small

(gdb) backtrace
#0  0x00007fff824295ca in __svfscanf_l ()
#1  0x00007fff8244011b in fscanf ()
#2  0x0000000100000e89 in main (argc=1, argv=0x7fff5fbfc7c0) at square.c:7

}

We inspect the actual problem:

\toprule Investigate a specific frame
\midrule gdbclang
frame 2 frame select 2
\bottomrule

We take a close look at line 7, and see that we need to change nmax to  &nmax .

There is still an error in our program: {\small

(gdb) run
50000


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000000010000f000
0x0000000100000ebe in main (argc=2, argv=0x7fff5fbfc7a8) at square1.c:9
9           squares[i] = 1./(i*i); sum += squares[i];

} We investigate further:

(gdb) print i
$1 = 11237
(gdb) print squares[i]
Cannot access memory at address 0x10000f000
(gdb) print squares
$2 = (float *) 0x0

and we quickly see that we forgot to allocate squares .

Memory errors can also occur if we have a legitimate array, but we access it outside its bounds.

// up.c
  int nlocal = 100,i;
  double s, *array = (double*) malloc(nlocal*sizeof(double));
  for (i=0; i<nlocal; i++) {
    double di = (double)i;
    array[i] = 1/(di*di);
  }
  s = 0.;
  for (i=nlocal-1; i>=0; i++) {
    double di = (double)i;
    s += array[i];
  }

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000100200000
0x0000000100000f43 in main (argc=1, argv=0x7fff5fbfe2c0) at up.c:15
15          s += array[i];
(gdb) print array
$1 = (double *) 0x100104d00
(gdb) print i
$2 = 128608

29.2.2 Fortran programs

crumb trail: > debug > Finding errors: where, frame, print > Fortran programs

Compile and run the following program: \codelisting{tutorials/gdb/f/square.F} It should end prematurely with a message such as `Illegal instruction'. Running the program in gdb quickly tells you where the problem lies:

(gdb) run
Starting program: tutorials/gdb//fsquare
Reading symbols for shared libraries ++++. done


Program received signal EXC_BAD_INSTRUCTION,
Illegal instruction/operand.
0x0000000100000da3 in square () at square.F:7
7                sum = sum + squares(i)

We take a close look at the code and see that we did not allocate squares properly.

29.3 Stepping through a program

crumb trail: > debug > Stepping through a program

Often the error in a program is sufficiently obscure that you need to investigate the program run in detail. Compile the following program \codelisting{tutorials/gdb/c/roots.c} and run it:

%% ./roots
sum: nan

Start it in gdb as before:

%% gdb roots
GNU gdb 6.3.50-20050815
Copyright 2004 Free Software Foundation, Inc.
....

but before you run the program, you set a breakpoint at main . This tells the execution to stop, or `break', in the main program.

(gdb) break main
Breakpoint 1 at 0x100000ea6: file root.c, line 14.

Now the program will stop at the first executable statement in main :

(gdb) run
Starting program: tutorials/gdb/c/roots
Reading symbols for shared libraries +. done


Breakpoint 1, main () at roots.c:14
14        float x=0;

Most of the time you will set a breakpoint at a specific line:

\toprule Set a breakpoint at a line
\midrule gdblldb
break foo.c:12 breakpoint set [ -f foo.c ] -l 12
\bottomrule

If execution is stopped at a breakpoint, you can do various things, such as issuing the step command:

Breakpoint 1, main () at roots.c:14
14        float x=0;
(gdb) step
15        for (i=100; i>-100; i--)
(gdb)
16          x += root(i);
(gdb)

(if you just hit return, the previously issued command is repeated). Do a number of step s in a row by hitting return. What do you notice about the function and the loop?

Switch from doing step to doing next . Now what do you notice about the loop and the function?

Set another breakpoint: break 17 and do cont . What happens?

Rerun the program after you set a breakpoint on the line with the sqrt call. When the execution stops there do where and list .

29.4 Inspecting values

crumb trail: > debug > Inspecting values

Run the previous program again in gdb: set a breakpoint at the line that does the sqrt call before you actually call run . When the program gets to line 8 you can do print n . Do cont . Where does the program stop?

If you want to repair a variable, you can do set var=value . Change the variable n and confirm that the square root of the new value is computed. Which commands do you do?

29.5 Breakpoints

crumb trail: > debug > Breakpoints

If a problem occurs in a loop, it can be tedious keep typing cont and inspecting the variable with print . Instead you can add a condition to an existing breakpoint. First of all, you can make the breakpoint subject to a condition: with

condition 1 if (n<0)

breakpoint 1 will only obeyed if n<0 is true.

You can also have a breakpoint that is only activated by some condition. The statement

break 8 if (n<0)

means that breakpoint 8 becomes (unconditionally) active after the condition n<0 is encountered.

\toprule Set a breakpoint
\midrule gdb lldb
\midrule break foo.c:12 breakpoint set [ -f foo.c ] -l 12
break foo.c:12 if ngt;0
\bottomrule

Remark

You can break on NaN with the following trick:

break foo.c:12 if x!=x

using the fact that NaN is the only number not equal to itself.

Another possibility is to use ignore 1 50 , which will not stop at breakpoint 1 the next 50 times.

Remove the existing breakpoint, redefine it with the condition n<0 and rerun your program. When the program breaks, find for what value of the loop variable it happened. What is the sequence of commands you use?

You can set a breakpoint in various ways:

  • break foo.c to stop when code in a certain file is reached;
  • break 123 to stop at a certain line in the current file;
  • break foo to stop at subprogram foo
  • or various combinations, such as break foo.c:123 .

Information about breakpoints:

  • If you set many breakpoints, you can find out what they are with info breakpoints .
  • You can remove breakpoints with delete n where n is the number of the breakpoint.
  • If you restart your program with run without leaving gdb, the breakpoints stay in effect.
  • If you leave gdb, the breakpoints are cleared but you can save them: save breakpoints <file> . Use source <file> to read them in on the next gdb run.
  • In languages with exceptions , such as  C++ , you can set a catchpoint :

    \toprule Set a breakpoint for exceptions
    \midrule gdbclang\ catch throw break set -E C++
    \bottomrule

Finally, you can execute commands at a breakpoint:

break 45
command
print x
cont
end

This states that at line 45 variable  x is to be printed, and execution should immediately continue.

If you want to run repeated gdb sessions on the same program, you may want to save an reload breakpoints. This can be done with

save-breakpoint filename
source filename

29.6 Memory debugging

crumb trail: > debug > Memory debugging

Many problems in programming stem from memory errors. We start with a sort description of the most common types, and then discuss tools that help you detect them.

29.6.1 Type of memory errors

crumb trail: > debug > Memory debugging > Type of memory errors

29.6.1.1 Invalid pointers

crumb trail: > debug > Memory debugging > Type of memory errors > Invalid pointers

Dereferencing a pointer that does not point to an allocated object can lead to an error. If your pointer points into valid memory anyway, your computation will continue but with incorrect results.

However, it is more likely that your program will probably exit with a segmentation violation or a bus error .

29.6.1.2 Out-of-bounds errors

crumb trail: > debug > Memory debugging > Type of memory errors > Out-of-bounds errors

Addressing outside the bounds of an allocated object is less likely to crash your program and more likely to give incorrect results.

Exceeding bounds by a large enough amount will again give a segmentation violation, but going out of bounds by a small amount may read invalid data, or corrupt data of other variables, giving incorrect results that may go undetected for a long time.

29.6.1.3 Memory leaks

crumb trail: > debug > Memory debugging > Type of memory errors > Memory leaks

We speak of a memory leak if allocated memory becomes unreachable. Example:

if (something) {
  double *x = malloc(10*sizeofdouble);
  // do something with x
}

After the conditional, the allocated memory is not freed, but the pointer that pointed to has gone away.

This last type especially can be hard to find. Memory leaks will only surface in that your program runs out of memory. That in turn is detectable because your allocation will fail. It is a good idea to always check the return result of your malloc or allocate statement!

29.6.2 Memory tools

crumb trail: > debug > Memory debugging > Memory tools

29.6.2.1 Valgrind

crumb trail: > debug > Memory debugging > Memory tools > Valgrind

Insert the following allocation of squares in your program:

squares = (float *) malloc( nmax*sizeof(float) );

Compile and run your program. The output will likely be correct, although the program is not. Can you see the problem?

To find such subtle memory errors you need a different tool: a memory debugging tool. A popular (because open source) one is valgrind purify .

\codelisting{tutorials/gdb/c/square1.c} Compile this program with cc -o square1 square1.c and run it with valgrind square1 (you need to type the input value). You will lots of output, starting with: {\small

%% valgrind square1
==53695== Memcheck, a memory error detector
==53695== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==53695== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==53695== Command: a.out
==53695==
10
==53695== Invalid write of size 4
==53695==    at 0x100000EB0: main (square1.c:10)
==53695==  Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd
==53695==    at 0x1000101EF: malloc (vg_replace_malloc.c:236)
==53695==    by 0x100000E77: main (square1.c:8)
==53695==
==53695== Invalid read of size 4
==53695==    at 0x100000EC1: main (square1.c:11)
==53695==  Address 0x10027e148 is 0 bytes after a block of size 40 alloc'd
==53695==    at 0x1000101EF: malloc (vg_replace_malloc.c:236)
==53695==    by 0x100000E77: main (square1.c:8)

} Valgrind is informative but cryptic, since it works on the bare memory, not on variables. Thus, these error messages take some exegesis. They state that a line 10 writes a 4-byte object immediately after a block of 40 bytes that was allocated. In other words: the code is writing outside the bounds of an allocated array. Do you see what the problem in the code is?

Note that valgrind also reports at the end of the program run how much memory is still in use, meaning not properly free d.

If you fix the array bounds and recompile and rerun the program, valgrind still complains: {\small

==53785== Conditional jump or move depends on uninitialised value(s)
==53785==    at 0x10006FC68: __dtoa (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x10003199F: __vfprintf (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x1000738AA: vfprintf_l (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x1000A1006: printf (in /usr/lib/libSystem.B.dylib)
==53785==    by 0x100000EF3: main (in ./square2)

} Although no line number is given, the mention of printf gives an indication where the problem lies. The reference to an `uninitialized value' is again cryptic: the only value being output is sum , and that is not uninitialized: it has been added to several times. Do you see why valgrind calls it uninitialized all the same?

29.6.2.2 Electric fence

crumb trail: > debug > Memory debugging > Memory tools > Electric fence

The electric fence library is one of a number of tools that supplies a new malloc with debugging support. These are linked instead of the malloc of the standard libc .

cc -o program program.c -L/location/of/efence -lefence

Suppose your program has an out-of-bounds error. Running with gdb, this error may only become apparent if the bounds are exceeded by a large amount. On the other hand, if the code is linked with libefence , the debugger will stop at the very first time the bounds are exceeded.

29.7 Parallel debugging

crumb trail: > debug > Parallel debugging

FIGURE 29.1: Display of 16 processes in the DDT debugger

Debugging in parallel is harder than sequentially, because you will run errors that are only due to interaction of processes such as deadlock ; see section  .

As an example, consider this segment of MPI code:

MPI_Init(0,0);
// set comm, ntids, mytid
for (int it=0; ; it++) {
  double randomnumber = ntids * ( rand() / (double)RAND_MAX );
  printf("[%d] iteration %d, random %e\n",mytid,it,randomnumber);
  if (randomnumber>mytid && randomnumber<mytid+1./(ntids+1))
    MPI_Finalize();
}
MPI_Finalize();

Each process computes random numbers until a certain condition is satisfied, then exits. However, consider introducing a barrier (or something that acts like it, such as a reduction):

for (int it=0; ; it++) {
  double randomnumber = ntids * ( rand() / (double)RAND_MAX );
  printf("[%d] iteration %d, random %e\n",mytid,it,randomnumber);
  if (randomnumber>mytid && randomnumber<mytid+1./(ntids+1))
    MPI_Finalize();
  MPI_Barrier(comm);
}
MPI_Finalize();

Now the execution will hang, and this is not due to any particular process: each process has a code path from init to finalize that does not develop any memory errors or other runtime errors. However as soon as one process reaches the finalize call in the conditional it will stop, and all other processes will be waiting at the barrier.

Figure  shows the main display of the Allinea DDT debugger ( http://www.allinea.com/products/ddt ) at the point where this code stops. Above the source panel you see that there are 16 processes, and that the status is given for process 1. In the bottom display you see that out of 16 processes 15 are calling MPI_Barrier on line 19, while one is at line 18. In the right display you see a listing of the local variables: the value specific to process 1. A rudimentary graph displays the values over the processors: the value of ntids is constant, that of mytid is linearly increasing, and it is constant except for one process.

Exercise

Make and run ring_1a . The program does not terminate and does not crash. In the debugger you can interrupt the execution, and see that all processes are executing a receive statement. This is probably a case of deadlock. Diagnose and fix the error.

Exercise

The author of ring_1c was very confused about how MPI works. Run the program. While it terminates without a problem, the output is wrong. Set a breakpoint at the send and receive statements to figure out what is happening.

29.8 Further reading

crumb trail: > debug > Further reading

A good tutorial: http://www.dirac.org/linux/gdb/ .

Reference manual: http://www.ofb.net/gnu/gdb/gdb_toc.html .

Back to Table of Contents