Pop lab coding guidelines

Version control

In general, all code you develop in the lab should be maintained within the lab’s GitLab repository. You may maintain your own fork of this space, but you must also keep the corresponding project in the lab space up to date (commit and push changes each day). Please request access to this repository from Mihai.

Testing

You shouldn’t write any code without also planning how to test it. Every project should include data (real or synthetic) for which your code will produce a predictable output, and the way you run the tests and evaluate the results should be documented. Ideally, you should start by creating the test sets and defining the expected output before you write your code, and this information should be part of the specification of your code. This practice is similar to the scientific method – you start by setting out falsifiable hypotheses, then you test them, not the other way around.

Ideally you should define unit tests (i.e., tests that apply to a single unit of your code, such as a single function/procedure) for each module of your code, and automatically run these tests when you make changes to your code. This can be done using the continuous integration / continuous delivery (CI/CD) frameworks provided by most modern systems, such as GitLab (e.g. GitLab CI) and GitHub (e.g. GitHub Actions).

There are various ways of writing tests and setting up CI frameworks (for most Python projects, pytest works well), but just having something in place can significantly reduce the mental burden of software maintenance. It becomes possible to do things like check — every time you make a change to your code — that the resulting code works on multiple operating systems, multiple versions of Python, etc. This is particularly useful when you are coming back to your code after a few weeks / months / years and forget the details.

Coding style

Different programming languages have different “styles” (e.g. the CamelCase convention in Java). In the lab we don’t prefer any specific one, but try to be consistent within your code even if, for example, you use a different convention than the orthodoxy of a particular language may require. (For instance, Python generally recommends using snake_case rather than CamelCase — but we feel that, as long as you are consistent, it doesn’t really matter.)

Please use spacing and indentation to make your code easy to read. In Python, indentation is part of the semantics of your code, but in other languages you must be more diligent, even if commonly-used IDEs usually handle this.

For example, the code:

for(i=0;i<5;i++){j=run_function(i+2log(i)^5);k=j+2i;}

is much easier to read if written as:

for (i = 0; i < 5; i++) {
j = run_function(i + 2 * log(i)^5);
k = j + 2 * i;
}

As much as possible, try to avoid chaining multiple operations on one line. Fitting as much code as you can in one line may appear clever, but is hard to understand and maintain. (See this article for some examples in Python.)

As an extreme, look up the “Obfuscated C competition” which has resulted in gems such as:

#include<stdio.h> 
char *c[] = { "ENTER", "NEW", "POINT", "FIRST" }
char **cp[] = { c+3, c+2, c+1, c }
char ***cpp = cp
main() 
{ 
printf("%s", **++cpp)
printf("%s ", *--*++cpp+3)
printf("%s", *cpp[-2]+3)
printf("%s\n", cpp[-1][-1]+1)
return 0
}

Clever, but not something you’d enjoy debugging if it did the wrong thing…

Variable names

Part of making your code readable is assigning variables names that make sense. For example, if you read a line from a file, code such as:

while i in f:
...

is less readable than:

while line in input_file:
...

If you don’t believe me, try to figure out what printf("%s\n", cpp[-1][-1]+1); means in the obfuscated C code listed above. What does the variable cpp even mean? You can’t figure it out until you work out the pointer arithmetic at the top of the file.

Comments

Your code should contain comments. Lots of them. Just as mentioned above under “Testing”, comments should be the first thing you write, not after the code is written. As I mention in my advice on communication, a key part of the writing process, that comes before writing any text, is figuring out the outline of the story you are trying to tell. Code is no different – you must plan out how your code should be organized, and that outline should be spelled out as comments before you start filling in actual code.

There are several critical points in your code where comments must be placed:

  • At the top of each file – Here you need to carefully describe what that file does. If it is the main part of your code, you should describe clearly what the code is intended to do, and anything else that another developer or user may need to know. Yes, you’ll probably duplicate things you are also saying in a README file or in the main documentation for the software you are developing. If in doubt, err towards saying more rather than less.
  • At the beginning of each function/procedure – Here you need to precisely define what the function does. In the very least, this comment should include:
    • A description of each input parameter – not just their type, but what they actually represent
    • A description of any prerequisites for the parameter (e.g., “the function assumes the array nums to be non-empty and contain integers in strict increasing order”)
    • A description of the value(s) returned by the function
    • A description of the guarantees you are providing about the return value(s) (e.g., “The return value is a non-negative number representing the number of prime numbers in the nums array.”)

      This comment should also include details about the algorithm implemented in the function, unless it’s truly obvious (which is rarely the case).
      • Good: “Prime numbers are identified in the input array using Eratosthenes’ sieve algorithm.
      • Better: “Prime numbers are identified in the input array using Eratosthenes’ sieve algorithm. Specifically, each number N in the array is divided by each integer, starting, in decreasing order, with the number floor(N/2), and stopping at 2. If any division yields a remainder of 0, the execution moves to the next number in the array.
  • At the beginning of each major code block – You can view such code blocks as “sections” in the story told by the function containing them. A brief summary of what these blocks do is useful at this point.

    If a block is a loop, it is also useful to define the “loop invariant” – a value, mathematical equation, or some other condition that has to be true for all iterations of the loop. Violations of this condition will indicate errors in the code, and it’s important to clarify your expectations. Importantly, the best place to locate a comment regarding the loop invariant is right next to an assert or if statement that ensures that your program will exit with an error in case the loop invariant is not satisfied. (Note that in Python assert statements are sometimes ignored at runtime, so you may want to use something like if not invariant: raise ValueError(...) instead of assert.)
  • At all other places in your code where the code may be hard to understand – If you have to think a bit about how to write a particular line of code, it may be worth adding a comment. If it’s not immediately obvious what your code does when you write it, others (or even yourself at a later date) may not understand it when reading it.
  • Complex if / case statements – It’s worth thinking out loud about the logic that you are trying to execute. Small comments like # if we are here, this means that X is true, which can only happen when Y can make your intent much clearer to the reader.

Command line processing

Most of the code you will write in the lab is intended to be executed from the UNIX command line. It is, therefore, important to follow current conventions about how parameters are passed to the code. For example, you are probably already familiar with command lines of the type:

$ cool_program --help

or

$ cool_program -i input_file.fa

Rather than figuring out your own way to process the input provided to your code, you should use a library that handles such interactions with the user. In Python, for example, the argparse and Click libraries are a good place to start.

Interaction with the system

Most of the code you write will need to interact not just with the users of your code, but also with the system on which your code executes. By “system” I mean a range of factors, including system-level parameters (such as number of processors or memory available) but also the locations of various programs or files.

There are several important principles you should observe:

  1. Do not include any absolute paths in your code – For example, don’t hardcode something like INPUT_FASTA = /home/bob/assemblyproject/v1/seq.fasta into your code.

    Why?
    • Your code will be much easier to adapt to/run on other systems if you accept these kinds of paths as input to your code (e.g. as a command-line --fasta parameter, where the user can specify the file path).
    • Hardcoding absolute paths in your code is a potential security risk; it provides potential hackers with information they can exploit either directly or through social engineering (e.g., an IT professional may be more likely to trust someone who knows the exact layout of our computational system when asked to install code).
  2. Assume external programs exist in the path rather than hardcoding their location – For example, if your code tries to run perl code, instead of writing /usr/bin/perl, simply write perl, which will then pick the configured perl interpreter irrespective of where it is located.
    • A common practice nowadays is to encourage users to set up a conda environment; you can provide users with a conda.yml file that describes all the software that your code depends on. If the version of a particular package matters, then make sure to mention it in the yml file, preferably in the format bwa>=2.1.5, indicating that any version more recent than version 2.1.5 is acceptable.
    • It is also often helpful to check at the start of your program’s execution that the various requirements it depends on are installed. This avoids wasting the user’s time. (See this article by Torsten Seemann for some relevant advice.)
  3. Use configuration files – If there are certain parameters that influence the behavior of your program, it is helpful to store these parameters in a central place where they can easily be modified or checked in the future. See the section below for more details.

Handling “default” parameters

Code frequently contains various parameters that influence its execution but that are not intended to be changed/specified by the user at each execution. Such parameters should be defined in only one place rather than having their value peppered around the code.

For example, assume you have a piece of code that analyzes k-mers in a genome, and you have decided that a good value for k is 8. Your code should look something like:

KMER_SIZE = 8 # default k-mer size
...
for loc in range(0, KMER_SIZE): # for each position in k-mer
...

instead of:

for i in range(0, 8): # for each position in k-mer

If you ever decide to change the value of k, it’s a lot easier to change it in one place in your code, instead of going through the entire code trying to guess whether each occurrence of “8” represents the k-mer size or some other value that coincidentally happens to be 8.

Parameters that a user may reasonably want to change should either be defined as optional command line parameters or specified through a configuration file. Libraries exist that can help you manage such configuration files – for example configparser for parsing Windows-style configuration files, or json for parsing JSON files (JSON is one of the most commonly used formats for serializing data structures!).

Anticipate issues and check for errors

Even if your code is perfect, there are plenty of opportunities for errors to occur that prevent your program from running successfully. Most common issues arise from interactions between your program and the operating system. You may call a program that is not available, fail to open a file because it has the wrong permissions, run out of memory, etc. Such errors may occur at random, making debugging difficult (e.g., the amount of memory available to your code may depend on what other processes run on the same machine). If you don’t anticipate and explicitly handle such errors, it’s likely that your code will fail without providing sufficient clues that allow you to figure out the cause.

Thus, whenever your code executes an operation that may fail for reasons outside your control, you should verify that the operation executes successfully and properly address the case when it doesn’t. Many programming languages have a mechanism for handling such situations, e.g., the try-catch statements in Java or try-except in Python. As an example in Python:

try:
function_I_am_not_sure_about()
except:
print("The function didn't work properly")

This code prints an error message in case the function fails. You can add further code to either mitigate the failure (e.g., run a different function) or ensure that your code fails gracefully. Note that the except clause also allows you to explicitly separate out different types of errors, allowing you to choose different ways of handling potential errors (e.g., do one thing if a file cannot be found, but use a different remedy if you timeout a network connection).

Here are some examples of places in your code where you should typically check that the operation completed successfully:

  • whenever accessing a file
  • whenever you call an external program
  • whenever you access a dictionary where a key may be missing

There are ways to make these kinds of checks easier. For example, if your code relies on reading from a user-specified file, the Python command-line parsing library Click has functionality to only accept input filepaths that correspond to files that actually exist on the system. (It can even check if a file has readable permissions!)

You should also account for the possibility that you have coding mistakes. If there are places in your code where certain conditions must be satisfied, make that explicit by using assert or if statements that check the condition; if the statement fails, you can then output additional detailed information that can aid in debugging.

Use logs

Unless your code is fairly simple, you may find it useful to keep closer track of its execution. A common approach for doing so is by keeping a log in which you record key information at given points in the execution. Typically, you want the log to contain information about the precise time when some event took place. You also want to distinguish between different levels of importance, so that a user may choose how much information should be contained in the log file. Typical levels include:

  • DEBUG – nitty-gritty information that a developer may find useful but that is overkill for the typical use of the code.
  • INFO – just some information about the execution of your code
  • WARNING – indicating something is not quite right, but not serious enough to impact the performance (e.g., WARNING – less than 10% disk space available)
  • ERROR – indicating an error occurred that impacted the execution of your code (e.g., ERROR – could not open file: skipping it)
  • CRITICAL – indicating an error from which your code cannot recover (e.g., CRITICAL – insufficient memory: aborting execution)

Ideally the user should be able to toggle between these levels at runtime. Rather than inventing your own approach for logging, it’s best to rely on existing packages, e.g., the logging library in Python.

A nice side-effect of logging is that it helps a lot with profiling your code (figuring out how long each part of your code is taking). For example, let’s say you have one log statement that goes something like Starting to process file... and another, later, that says ...Finished processing file. You can now compare the times when these statements were logged at to figure out exactly how long the file processing step took. This is extremely useful information to have on hand when optimizing your code or debugging it.

Use of libraries

As much as possible, use existing software libraries rather than reimplementing your own approach for performing common tasks. That being said, try to keep to a minimum the libraries you use in your code – each dependency you introduce makes your code harder to maintain. In general, try to use the language’s default constructs rather than a third-party library, unless the third-party library provides features that are absolutely necessary.

For example, many mathematical functions are available by default in Python’s built-in math library but are also available in the third-party numpy library. Before you go straight to numpy, think of whether you really need the added complexity of this library for your project. (Reliably installing numpy on users’ systems can be a pain point sometimes, so avoiding it for unimportant use cases can actually end up making your life easier!)

Documentation

Remember the golden rule: Do unto others as you would have them do unto you!

Have you ever been frustrated that a piece of code you downloaded was difficult to install, and lacked clear instructions on how to use it? This should give you the impetus to make sure your code is well documented and easy to use.

For a typical program, here is information that you should have in the documentation (ideally easily found under separate headings):

  • Introduction – description of what the program is intended to do, and other relevant high level information;
  • How to obtain and install the program, including information about licensing terms. Ideally you should also include information on how to run a test to make sure the code runs correctly;
  • How to run the program, usually by showing a simple example command for the most common use-case;
  • Details about all the command-line parameters, indicating which parameters are required, and specifying the default values for the optional parameters. Guidance on which factors determine the choice of parameter values should also be included when known. For example, some choice of value may be better for long sequencing reads than for short ones. Another parameter choice may adjust the tradeoff between runtime, accuracy, and/or memory use. Give the users as much information as you have to help them choose the correct parameter settings for their specific context.
  • Examples of use-cases – Highlight common use-cases and indicate how to run your code to address them.
  • Frequently Asked Questions (FAQ) – As people use your code, you’ll start accumulating questions that you hadn’t anticipated when writing the documentation. You can highlight the answers to the most common ones here.

Further reading

About coding best practices

About documentation