Here I will provide some general guidance about scientific writing, primarily reflecting my own opinions/biases. Hence, this material is most useful for PhD students in my lab, however some of it may be useful more generally.
To start, the most important advice: Write! Write! Write! Writing is a skill that gets developed through careful repetition, and you will only get good at it if you keep doing it. Practice as often as you can, whether you write scientific text, or fiction, or even an email or text. Just like developing other skills, it is important for your practice to be deliberate. As a music teacher told my son:
“Practice makes perfect” is inaccurate. “Practice makes permanent”. The correct saying should be “Perfect practice makes perfect.”
Specifically, irrespective of what you write, follow the full process, from outlining the ideas and order in which you will present them, fleshing out the text, using correct spelling and grammar, then re-reading, revising, and re-reading some more. Even when writing a text message. Definitely when writing a scientific paper.
Now, coming back to scientific writing. Before you delve into writing your next paper, make sure you have something interesting to say. To help students in my lab, I have put together a checklist that should apply for most types of papers. The goal of this checklist is to help you decide whether your paper will be of interest to a broader audience, i.e., whether your paper is, at first glance, publishable.
Once you have decided that you have something to say, make sure you say it in an effective way. That means that it’s not sufficient that you express your results and opinions in a way that makes sense to you (and perhaps others in the lab) but that you express yourself in a way that makes sense to the broad audience that will read your paper. It doesn’t matter what you say. It matters what others understand when they read what you said.
This is not the place to give you a full course on writing (technical or otherwise) and I am certainly not the most qualified person to teach such a course. Instead, I recommend a few starting resources. Note the emphasis on starting. You must constantly seek out to learn new tricks and improve your writing toolbox.
- They say, I say. by Gerald Graff and Cathy Birkenstein. A great book about writing that frames the writing process as a dialogue with others (that audience I mentioned above). The book also includes a collection of fill-in-the-blank templates that can help you cut through writer’s block.
- The science of scientific writing. by George Gopen and Judith Swan. American Scientist, 78, pp. 550-558. 1990. This is a great article that works in detail through how a thread of thought is developed from sentence to sentence in a scientific article.
- OWL/The Purdue online writing lab. This is an incredible resource on writing from Purdue University.
- Tandy Warnow’s guide to writing your first paper.
- 10 tips on how to write less badly. by Michael Munger. Published in the Chronicle of Higher Education, and hopefully will remain accessible without a subscription.
Using word processors
First, I would like to start by stating that I don’t use LaTeX and I strongly discourage my students from using it. My main reasons include:
- LaTeX is a technology that has stopped evolving since the 1990s. Modern word processors have evolved to a level where LaTeX is not really needed anymore. In particular, I find the equation editor in OpenOffice/LibreOffice to be quite good, and this was one of the main reasons people have been pushing for LaTeX previously.
- Everybody knows how to use (at least at a basic level) a modern word processor. LaTeX requires a relatively steep learning curve. Using a modern word processor, thus, removes barriers to working across fields and academic levels (e.g., few high school students or biologists know LaTeX).
- Modern word processors have built-in features for checking grammar and spelling. Doing these checks in LaTeX requires more effort. Even if grammar/spell-checkers are imperfect, they are an invaluable sanity check, particularly for inexperienced writers.
- Modern word processors make it easy to include comments and highlight changes to a document. Again, this is difficult to do in LaTeX directly. Overleaf, the buggy “Google docs” for LaTeX includes such features, but this now becomes yet another system that you have to log into and where you have to pay for premium features. Writing, particularly in an academic research lab, involves back and forth between co-authors and numerous edits, comments, etc. I have tremendously improved my writing by seeing the changes others made to text that I had written. The ability to easily annotate drafts is, thus, critical for developing one’s writing skills, and this ability is lacking in LaTeX.
Now, onto less opinionated points about the mechanics of writing.
One nice feature of LaTeX is that it structures the text into a series of blocks, and the style of different types of blocks can be globally controlled. This makes it easy to change, for example, the font used in headings, all at once, without having to go through the whole document and fix each heading independently. This capability is now available in all text authoring tools, including word processors such as MS Word, OpenOffice, LibreOffice, etc., as well as in web authoring tools (including the one I’m using now). Yet many students (and faculty) continue to format their text by individually formatting lines of text.
To make things more concrete, here is a screen shot:
In the figure above, assume you are trying to make the text “Heading” be a heading for a section of the document. The incorrect way to do so is to access the menu marked with A in the figure, where you can change the font, size, color, style, etc. Yes, the text will look “the right way” but you’ll have to remember the combination of settings for all other headings of the same type, and it’s easy to make mistakes, which will make your text look unprofessional and may even confuse the reader. Plus, as I’ve already mentioned, if you change your mind about how the headings are formatted, you have to go through it, line by line, to change the formatting. Instead, head to the menu marked B in the figure and simply select the heading style you want. The styles are customizable, and you can change all instances of a style at once by simply editing the style.
In LibreOffice, the menu looks different but contains the same functionality:
In short, make sure to always use styles to define the structure of the document. It will make your life easier and it will also make it easier for journals to reformat your paper to fit with their styles. It should also make it easier to convert between word processors and LaTeX, and vice versa, should it ever be necessary.
Another feature that is frequently ignored when using word processors is the proper use of cross-references within the document. Most scientific papers contain figures and tables, and all of these building blocks must be properly captioned (more on that below) and referenced from the text. A very common mistake is to simply include text in the paper such as: “See Figure 1”. If you later decide to change the order of figures in the paper, then you must carefully check the paper to make sure you change the numbering in all the places where the figure is referenced, creating yet more opportunities for mistakes to creep into your document.
The better solution is to use the “cross-reference” feature available in most word processors. First, you must create a caption for your figure or table. You can do so by right-clicking on the image and selecting “Insert caption” from the context menu, in both MS Word and OpenOffice/LibreOffice. Then, when you are ready to reference the figure in your text, simply insert a cross-reference to the image/table. This cross-reference will automatically update if you reorganize your text and change the order (and therefore the numbering) of figures or tables.
In OpenOffice/LibreOffice, you can do so from the “Insert” menu which will then allow you to select what the reference text will indicate. The most common choice is “Category and number” which will generate the desired “Figure XX” text.
In MS Word, you find the cross-references in the “References” TAB. You will need to select “Only label and number” when describing the style of the reference in order to generate the desired “Figure XX” text.
Note that this is just a high level introduction to using cross-references in word processors. Please play with these features and also read the relevant documentation in order to become more familiar with their use. From now on, you should NEVER type in the number of a figure or table when referencing it in your manuscript.
Since we discussed referencing figures and tables, it is important to briefly discuss the captions you create for these display items in your papers. Too often I see captions that look like:
“Figure 1. Dot-plot of genome similarity.”
To understand what the figure really says, you now need to find a reference to it in the text, and hope the text clarifies it. A better practice is to provide sufficient information in the caption for a reader to interpret the figure without the help of the text. Such a description usually starts with a short title followed by a longer description.
“Figure 1. Comparison of the genome sequences of Genome A and Genome B. The dots represent segments of 500bp that are shared by the two genomes and the x and y coordinates represent the location(s) in the two genomes where the segments are located.”
Better yet, tell the reader what they should get from the figure.
“Figure 1. Comparison of the genome sequences of Genome A and Genome B. The dots represent segments of 500bp that are shared by the two genomes and the x and y coordinates represent the location(s) in the two genomes where the segments are located. The X pattern in the middle indicates a large (~10kbp) genomic inversion.”
Note that in all the examples above, the caption no longer includes a technical description of the type of plot used. Referring to the figure as a “dot plot” is ultimately irrelevant to the message the figure is trying to convey. Dot plots can convey different messages and there are also messages that dot plots cannot convey. Saying the figure is a dot plot is no more useful than saying that the figure was created using colored lines and an Arial font. Accurate but irrelevant.
As in the case of figures and tables, the best way to manage citations is to have the references linked to the bibliography and automatically updated as you add and remove references. Word processors cannot do that by themselves. LaTeX can, through the BibTeX system, however this involves a largely manual process of building paper-specific files of references and remembering the specific identifier assigned to each reference whenever you need to cite it. You can get good at this, but even then, it takes a bit of time. Wouldn’t you rather spend that time shaping the story presented in your paper?
There are a number of citation managers that work with many word processors. Probably the most feature-full (though not necessarily the best designed one) is EndNote, the citation manager I personally use. It’s fairly expensive, particularly for a student, but the cloud version is available for free at the University of Maryland (and perhaps many other universities). EndNote interfaces directly with both MS Word and OpenOffice/LibreOffice, and includes a good set of citation and bibliography templates consistent with many scientific journals. It is also not too bad at managing and searching a large collection of citations, making it easy to maintain an ever growing collection of papers you have read and cited. As an aside – you should not cite a paper you have not (carefully) read.
Similar in functionality to EndNote are Zotero and Mendeley, the former of which is free, and the latter includes both free and paid versions. All three systems can operate directly on files stored on your computer, so you don’t necessarily need to embrace the cloud to use them.
A last citation manager that is becoming popular is paperpile, and it is specifically designed to work with Google Docs, i.e., it only operates in the cloud. Like many other cloud offerings (including Overleaf), beyond basic usage it tries to push you into subscription fees. Furthermore, you can only collaborate with people who also use paperpile, while other citation managers are better at interoperability. I have seen no evidence that paperpile can effectively manage citations in a way that is compatible with most journals’ requirements, and thus strongly discourage students from using this system for any publishable work.
When to use and not to use the cloud
Back in my day, writing a paper collaboratively involved sending around annotated versions of the manuscript from author to author. More and more, writing is done collaboratively through cloud services, like Google Drive. A major advantage is that multiple authors can work together on the text at the same time, speeding up the writing process. Even if early on in the writing process people may actually work in parallel on the document, in practice, though, authors still take turns editing, and simply having your document in the cloud does not eliminate the need to carefully coordinate between authors.
A second drawback to using the cloud is that, by default, it is not obvious what changes different people make to the text. It’s easy to re-read a 1-page report before submitting it to make sure the multiple people who edited it didn’t introduce errors or inconsistencies. It’s almost impossible to do so for a 12-page technical paper with the associated figures, appendices, supplementary material, etc. Furthermore, unless people highlight how they edited your writing, it will be hard for you to learn how to improve. Thus, even when using the cloud, I recommend that people make their edits in “suggestion” mode so that the changes are explicit to the lead author.
Only one person is responsible for the content of the paper. Period. Irrespective of the number of authors. Who is this person needs to be explicit from the beginning. It could be the PI, or one of the students, or one of the collaborators. Irrespective of who it is, it is their responsibility to read, re-read, re-re-read the text to make sure formatting, spelling, grammar, etc. are correct and consistent throughout the manuscript. This is the person who, at the end of the writing process, places a “freeze” on the collaborative writing process, and becomes the sole owner/gatekeeper of the manuscript. Irrespective of how you started writing, at the end of the process, the “owner” of the manuscript will circulate it among the authors as an email copy, then receive and reconcile the edits, and repeat the process until all authors are comfortable with the submission of the manuscript.
I stress: All authors must read and agree with the manuscript being submitted.
To avoid confusion, it is best to version the document files being sent around. A simple way to do so is to include the date in the name. Using the format YYYYMMDD also creates a string that can be sorted lexicographically (i.e., the way most file managers do) in a way that matches the age of the manuscript. Each author making changes to the manuscript can add their initials to differentiate their version from the others circulating around. For example, assume you are writing a paper creatively named MyPaper.docx and you send it out to co-authors on May 12, 2022. You should rename your version: MyPaper20220512.docx. Now, once I read it and make a number of changes, I will return MyPaper20220512-mp.docx. You can now distinguish the file I sent you from the one Donald Knuth edited, which would be called MyPaper20220512-dk.docx.
If you are in my lab, please do the same even if sending around a PDF compiled from LaTeX. Also, just like coding, please name the file something more creative than “paper” (or “main” as I frequently see in the LaTeX world) so that it’s immediately clear from the name what paper we are talking about. As a graduate student, this paper may be the only one in your life, but for your co-authors, it may be one among many. A good practice is to include your name in the file name, e.g., Pop_WABI2022_20220512.docx.
Overall paper structure
Below I have a brief discussion about the typical sections that form a paper. I am not going to discuss in here the order in which they should appear in the manuscript (which varies by manuscript type and journal), nor is this a full manual of how these sections should be created. Please refer to the resources at the beginning of this page if you are looking for help in crafting your text.
Note that each section in the paper has a specific purpose, but it’s OK to discuss certain things in the “wrong” section as long as it’s done with the goal of helping the reader understand the paper. For example, brief references to methods can occur in the results section even though the full details are also presented in the Methods section. Be careful, however, not to over-do this, and largely keep the points you make in the relevant section.
The abstract is what most people will read from your paper. This is what will sell people on whether they should read the paper in more detail or not. This is also what makes journal editors and reviewers form a first (and perhaps decisive) opinion about your work. Thus, the abstract has to be written very carefully. Here you have to make the case for why your paper is relevant, and outline what people will learn from it. It essentially summarizes the information that you worked through when filling out the checklist before deciding whether it’s worth writing the paper.
The introduction has to set the stage for the reader by defining the impact of the work you are about to present, and helps the reader appreciate the innovation over the state of the art that your work represents. The introduction can also include some idea about how you’ll frame your arguments, particularly if you are going to make the case for your work in an unusual way (e.g., a new type of experiment or setting)
Importantly, do not end the introduction with a “table of contents” as I too often see: “We start with a background, then describe the algorithm in section 2, then show some results and conclude with a discussion”. Don’t all papers do this? Tell me something surprising.
The background is sometimes included in the introduction itself, and provides the reader with a description of the state of the art as well as with the definitions/vocabulary needed to understand the rest of the paper. Like everything else in the paper, the background has to be carefully crafted to align with the key message of the paper. Thus, when discussing prior research in the field, don’t just simply say what this prior work did, but focus on the aspects of the prior work that are relevant to your paper, and highlight the gaps in this research that are targeted by your work. The background should not read like a comprehensive review of a field.
The methods section describes, in detail, the methods used to generate the results presented in the paper. If your paper is mostly about the results of some experiment, the methods section tends to be fairly dry and very specific, as it provides the details necessary for others to reproduce the results of your experiments. When using software, it is important to include version numbers and also specify the parameters used if they differ from default settings. To help the reader, it is useful to use sub-headings aligned with the experiments described in the results section.
If the paper describes a new method, then the methods section gets to be a mini-paper on its own, as you describe the new method in detail but also provide relevant background or highlight experimental results whenever necessary to help the reader to understand your choices. Of course, in addition to your new method you will also have to include the details necessary to reproduce the results presented in the paper.
Here you report the results of the experiments you ran. It’s useful to present these results in a way that supports the story you are trying to tell. You should also include a bit of information about the experimental design and methods used to generate the results, particularly when such information is important for interpreting the results presented. Full experimental design and methods information should still be presented in the Methods section. You can also briefly editorialize, but lengthy discussions should be presented in the Discussion section. By and large, you want the results section to describe the objective picture of what your experiments revealed, and your subjective interpretation gets included in the Discussion.
The way you present results must be precise – include the relevant units and actual measurements of confidence/accuracy (such as p-values, correlation coefficients, etc.). You should not make imprecise statements or qualitative assessments not directly supported by numbers. For example, if you say “we see significantly more widgets of type A than B”, then you better have the statistics to support the word “significant”. You should also present your data in a consistent way that allows readers to compare different figures. This means that you need to use consistent scales, colors, styles, etc.
Also, be only as precise as necessary. Saying a genome is 1,001,347 bp is the same as saying that the genome is 1Mbp in length, yet the former includes a level of specificity not relevant to the story, and the latter is easier to read and interpret. Numbers should always include thousands separators (and justified right in tables), again to make it easier to compare them to each other.
Here is your chance to interpret what the results say and to shape the main story of the paper. What you say here has to focus on “surprise” – tell the reader something that they may not have known before reading your paper and before seeing your results. Place your results in the context of what people knew before your work (i.e., what’s in the Background section). Here you can also discuss the limitations of your data, methods, experiments, etc., and qualify your message with the appropriate context.
The conclusion is sometimes included with the discussion. It simply summarizes what you want people to get from your paper. Don’t repeat what you just said, or provide another table of contents for the manuscript.
If in doubt, do not include anything in supplementary material. Why? You can read more in this paper:
Use and mis-use of supplementary material in science publications. Pop and Salzberg. BMC Bioinformatics, 16, 2015.
Most importantly, you should not even consider supplementary material until the manuscript is completed, carefully edited, and you are ready to format it for submission to a journal. If a figure, table, or paragraph are relevant to the story you are making, it should be in the main manuscript. If it is not relevant to the story you are making, then it shouldn’t be in the manuscript at all, even in supplementary material.
Once you start cutting down the text to fit within a page limit, whether the limit is determined by the journal or by the human attention span, the first thing you should do is to delete text/figures, not move them to the supplementary material. Only once all non-essential information has been removed, if you are still over the page limit, you can consider moving some segments of your paper to supplementary material. This material must be referenced from the main paper (after all you decided it is relevant to the main story), and must be referenced as specifically as possible. In other words “See Section 2.1 in Supplementary Material” is strongly preferred over “See Supplementary Material”.