A few months ago, I started a my first real very adult programming job. As a consequence of this, I have been spending more hours per week making software than ever before. A big part of that time went into debugging so I have been thinking how to make the process more efficient. I have come up with a little hierarchy of steps for investigating bugs.

  1. Prevention
  2. Reproducing and Thinkging Hard™
  3. Debugger

This works well for issues like the program crashing or giving an undesired result. I haven’t had to deal with performance or memory issues yet, so I’m not sure how much this generalizes.

Prevention

Prevention takes many forms: writing good code, proactively looking for issues, devoting extra attention to high-risk areas, and using tooling that points out best practices and potential problems. If you have the freedom, choose a programing language which is suitable for the job: For any larger project that is not primarily research-focused this means a strongly typed language like Rust (if your team can handle that), Go or (sigh) Java. I do not, so I work on big architectures which are entirely Python-based… I use Pyre for type hints and pyflake for linting. If you have to work in JavaScript world, try using TypeScript or something similar. Even if you’re working with a typed language, make sure you write code that is strongly-typed and not stringly-typed. I will add Git pre-commit hooks to them as soon as I figure out how to ignore all the warnings that the existing code base would raise.

Proactively looking for issues means testing lots and testing early. Ideally your tests execute every line of new code. Ideally your tests are also automated, and it is worth to put in actual effort to make that happen. If not, you should have a manual testing system that is as painless as possible. I use the following zsh function to run a script automatically whenever a python file in my working directory is modified:

~/.zshrc:
...
function watchpy {
    fd "*.py" | entr "$@"
}
...

I then execute

watchpy python -m whatever.im.testing --arg1 val1 ...

to run the program I’m working on every time I save a file, and have a terminal with its output open at all times. While I’m working this way, I’ll usually have slow pieces of code commented out to make the feedback loop as short as possible.

Quick aside on process

I often fall into the trap of planning a task out as follows:

  1. Build X,
  2. verify manually that X works,
  3. build unit tests for X.

Then, X frequently turns out to be more complex than I thought, consisting, say, of bits X_1, X_2, X_3, and I end up doing the following instead:

  1. Build bits X_1, X_2, and X_3.
  2. Verify manually that the whole system works, using a coarse-grained ‘integration test’.
  3. It doesn’t work! Go back to 1.
  4. After potentially hours of debugging, build unit tests.

Which is silly, and instead I should be doing

  1. For i in 1..3,
    1. Build X_i
    2. Verify X_i works using a unit test
  2. Verify that the whole system works.

This saves a lot of time, since I don’t have to spend lots of time in debugging on narrowing down which component actually caused the problem. Whenever I follow the better process, any issues that I find in step 2. are usually very easy to understand and are either typos in the ‘glue code’ connecting the components, or a misunderstanding of the API of those components.

Reproducing and Thinking Hard About the Problem

Okay, back to the plot. You’ve built a beatiful piece of code, your linter and type checker are happy, you’ve tested early, and you’ve still encountered a non-trivial bug. Thed first thing to do is to reproduce it, again focusing on making the feedback loop as short as possible: You press a key, you wait 2 seconds, your program explodes. Lovely. Sometimes the problem will be will only happen at the end of a 40 minute CI pipeline. If you cannot figure out how to make it happen on a shorter feedback loop, you’re setting yourself up for a sad day.

Once I’ve reproduced the issue, I’m often tempted to jump right into a debugger and start stepping through the code. This is often an immense time sink, and at the end I often find that the issues are not particularly convoluted. This realization prompted me to write this rant essay in the first place.

Instead I’ve started doing something differen: I put my hands on my lap and ask myself: What sort of issue would manifest itself in this way? In which component is it most likely to be originating? If you’re a True Believer you might even consider setting yourself a timer.

In about a third of cases this is enough to solve the problem and can be anywhere from 2x to 10x+ more efficient than going in with the heavy guns. I’ve also got a hunch that this exercise will train me to understand my code better and find bugs more quickly. And even if I can’t find the problem this way, I can at least generate hypytheses to speed up the next step.

Debugger

If you are here, I have failed you. Time to step through the program line-by-line and see at what point your expectations diverge from what the program is actually doing. Make sure you’re being nice to yourself maximizing your productivity by using the nicest debugger that’s available for your stack, and setting it up to remove as many annoyances as possible. I like to keep a package (called hackz) in my PYTHONPATH with functions for pretty-printing things, to avoid typing things like print(next(x[0].values().__iter__)) over and over again, and a linear dropin replacement for multiprocessng.pool.starmap to avoid going mad.

What if my problem still isn’t solved??

Idk, have a hug, I guess? At this point you probably won’t like any of the options that are available. Things to try, based on use case

  • Using fancy tooling to monitor memory usage & performance, if that’s where the probelm is.
  • If your issue has to do with concurrency, or is very algorithmic in nature, and you’re mathsy, try to formally prove (a simplified version of) your program correct; the process might give you ideas.
  • Otherwise try reverting to the last working state, confirm that it is actually working, and add changes incrementally until it breaks :/
  • ???