Tuesday, February 12, 2013

"Document Design and Purpose, Not Mechanics"

If you ever write code for scientific computing (chances are you do if you're here), stop what you're doing and spend 8 minutes reading this open-access paper:

Wilson et al. Best Practices for Scientific Computing. arXiv:1210.0530 (2012). (Direct link to PDF).

The paper makes a number of good points regarding software as a tool just like any other lab equipment: it should be built, validated, and used as carefully as any other physical instrumentation. Yet most scientists who write software are self-taught, and haven't been properly trained in fundamental software development skills. 

The paper outlines ten practices every computational biologist should adopt when writing code for research computing. Most of these are the usual suspects that you'd probably guess - using version control, workflow management, writing good documentation, modularizing code into functions, unit testing, agile development, etc. One that particularly jumped out at me was the recommendation to document design and purpose, not mechanics. 

We all know that good comments and documentation is critical for code reproducibility and maintenance, but inline documentation that recapitulates the code is hardly useful. Instead, we should aim to document the underlying ideas, interface, and reasons, not the implementation.

For example, the following commentary is hardly useful:

# Increment the variable "i" by one.
i = i+1

The real recommendation here is that if your code requires such substantial documentation of the actual implementation to be understandable, it's better to spend the time rewriting the code rather than writing a lengthy description of what it does. I'm very guilty of doing this with R code, nesting multiple levels of functions and vector operations:

# It would take a paragraph to explain what this is doing.
# Better to break up into multiple lines of code.
sapply(data.frame(n=sapply(x, function(d) sum(is.na(d)))), function(dd) mean(dd))

It would take much more time to properly document what this is doing than it would take to split the operation into manageable chunks over multiple lines such that the code no longer needs an explanation. We're not playing code golf here - using fewer lines doesn't make you a better programmer.

2 comments:

  1. I'm very guilty of over-long functions as well. Nice to hear someone else confess to this!

    I'm going to have "You're not playing code golf" tattooed on my hand to see if it helps.

    ReplyDelete
  2. Glad to see the paper doesn't harp too much on Agile Development. It really means a lot more than "incremental changes" and was in fact a huge disappointment in the end. Like all software "best practices" it only works in very certain situations with very certain groups of people.

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.