Polytechnitis: Reading (open source) code

I read lots and lots of code, practically on a daily basis. Only a tiny fraction of that code has been written by me, or people I know, so I wanted to share my techniques on effective reading of large modern free software projects and if possible to hear yours as well :)

Almost always, the reason of reading the code is so that I can fix a bug or introduce a new feature / refactoring, so the following techniques are colored by this point of view (debugging/maintaining code whose structure is unknown).

I will limit the discussion to system software written in C/C++ in this post, but some of the tricks are also applicable to other languages (the list is in no particular order):

Use the revision control system logs. Especially if the project uses a distributed VCS like git or mercurial, the patches that implement all changes tend to be small and reasonably well documented (the linux kernel is a role model of this). git log or equivalent can also be applied by file to see which patches changed it something that always proves useful.
git bisect can help you locate the patch that introduced an unexpected change in behavior with the least amount of compiles, when you have no idea about the internals of the software.
Text and light structure search tools like grep and ctags are fast enough so that you can apply them in really large codebases without them getting in the way significantly.
An IDE with a good code browser (like Eclipse CDT) can help a lot by using features like static type and call hierarchy. The nice thing about Eclipse is that it can work out much of the preprocessor magic by parsing makefile output, so setting things up is reasonably convenient.
Tracing: enable all logging switches (especially the "debug" stuff) and correlate using grep to figure out which parts of the code are actually relevant to what you want to do.
Tracing 2: use the typical tools like strace/ltrace to get a quick idea of how a program interacts with the outside world (system calls, library calls etc).
Tracing 3: using the recent tracing developments in your OS like dtrace, systemtap etc.
Running (select small parts of) the software in an interactive debugger like gdb/ddd or eclipse cdt's gdb frontend.
Replay debugging: Allows to answer the question "how did we end up here" by running backwards from a failure point and tracing the sequence of events that led to the problem.
Taking advantage of the software being opensource / using google. There are usually presentations, blog posts and news articles about various of the features of successful software, sometimes accompanied by illustrative diagrams.
Books. For very large software, like the Linux Kernel you can even find books documenting the internals, these are a good starting point for reading the source.
Using your distro's package management software to find reverse dependencies like users of a library and seeing how they call its functions / what are the quirks.
A general knowledge of a good amount of algorithms and/or the domain of the software (e.g., image processing) can help you see the abstract mathematical objects behind the concrete implementation, this is useful if you want to add features but not necessary if you are trying to solve a trivial bug.
The mailing list / forums of the project, mainly suitable for asking "why do we do things like this in this part of the code" if the comments / commit logs can't tell you. Lots of the time, the reasons are historical, so only the people that wrote the code can tell you for sure. Many times searching the archives and correlating them chronologically with the revision control logs can give all the answers you need if the development of the project is open.
Performing small or larger changes and/or refactorings in the code and posting your patches for review. This can tempt the maintainers (that may be too busy care to answer in the above case if you just ask in the list) to actually care about your questions, if they see that a useful patch depends on the answers ...
Performing changes statically (or at runtime through the debugger) and see how the program reacts and if you could predict this.
Running your program through a profiler and inspecting the output in a tool like kcachegrind. This can show which parts of the code have delays, which are CPU-bound, which are the fast/slow paths etc etc.
GCC can also provide useful profiling (frequent paths etc) info using certain options like fprofile-arcs.
Using sophisticated static analysis specific toolchains like mozilla's dehydra or LLVM's clang analyzer.
Using tools that perform analysis straight on the binaries like dwarves (uses DWARF data) etc.
Use the software's test suite to figure out how select parts of the functionality expect to be invoked.
Using dynamic binary analysis/translation with tools like valgrind. These tools can answer questions related to thread races / locking relatively efficiently.
You can use techniques like process stalking, implemented by tools like paimei. In general, correlating the output with the source and performing diffs on the logs before and after you activate a useful function can help you "zero in" on the relevant code very effectively.
Dynamic tainting analysis + tracing specially crafted input (e.g., put aaaaaaaaaaaaaaaaa as your name in a form and trace this string in memory) can also be effective in telling you how input propagates throughout a program.
Dynamic framework-specific runtime inspection e.g., for Qt or Gtk, or D-BUS.
Input minimization / Delta Debugging techniques, e.g., using a tool like delta

With all the above techniques under your belt, fixing or at least zeroing in most bugs in your Linux system (at least enough to produce a super-compelling bug report that most experienced maintainers can understand and fix right away) becomes feasible, even when the bugs are on huge software whose source layout you have no idea about (yes libreoffice, I 'm looking at you :P).

Of course, for any given problem you will only need to use a small subset of the above but it can be a different subset depending on the problem, so it is good to be aware of most of them.

It would be great if based on the above techniques we had (free software) tools that can give us useful visualizations like UML structure behavior and interaction diagrams, histograms etc etc but unfortunately I don't know of any such tools that work for C/C++ and are in a usable state right now. For the moment, a subset of this functionality is possible to write in an "adhoc" fashion using gdb and its scripting abilities (especially the new python scripting hotness), or having functions that are meant to only be called from the debugger to produce graphviz or gnuplot output (see the LLVM source for an example of this technique).

There are several more tools and techniques that can be used for debugging / tracing / understanding foreign code (like fault injection) but the above have been the most useful to me so far. If you think I missed an important trick or you have a personal favorite I would be most interested to read about it :)

Happy Hacking!

-Pantelis

Polytechnitis

Σελίδες

Κυριακή 3 Απριλίου 2011

Reading (open source) code

Δεν υπάρχουν σχόλια:

Δημοσίευση σχολίου