Learning Rust for bioinformatics, part I

rust-logo-256x256I’m taking a micro-sabbatical (ha!) after dragging two major projects to publication/submission, and as a way to relax, I decided to start learning Rust.

Rust makes a compelling case for use in bioinformatics. Its tooling and language features emphasize good practices for scientific applications:

  • Tooling: Cargo, its package manager and build system, is a strict adherent to semantic versioning (unlike Go, which, as far as I can tell, doesn’t care at all). This means that you know exactly what version of your dependencies were required to build a package, and all versions of a Rust package (called a crate) are stored permanently in the crates.io online repository. This is essential for reproducibility and makes it easy to distribute Rust source code.
  • Memory safety: The language is built around memory safety and robust error handling, such that a whole class of errors are prevented by the compiler. Encouraging more robust code seems like a good idea for the wild west of scientific programming.
  • Data manipulation: Certain language features in Rust also make writing it feel similar to higher-level languages. In particular, its use of iterators and method chaining let you implement lazy data processing through chaining together map() and filter() functions, and reminds me of working with dplyr in R.
  • Documentation: It’s trivial to add documentation to a library in Rust – similar to, and as easy as, docstrings in Python, and there’s automatic rendering to a browsable HTML form using docs.rs. This to me feels like the best of Python and R+Roxygen2. All Cargo crates have their docs hosted on docs.rs.
  • Promising libraries: rust-bio is a growing set of well-implemented bioinformatics algorithms and format parsers. And nom is a library for building extremely fast parsers, which could be easily leveraged to create efficient parsers for lots of other formats. Some of the things that are missing right now are good equivalents to Python’s scipy and numpy libraries.

Rust isn’t a replacement for data analysis in R and Python, but from what I’ve seen, I think it will be a viable replacement for applications that would otherwise require C/C++ in high-performance scientific computing.

The Rust logo is licensed by Mozilla under the CC-BY license.

 

But then it was too late

What follows is an excerpt from a book called They Thought They Were Free, by Milton Mayer; it was written in the aftermath of WWII and is composed of interviews with people who lived (and sometimes supported) the Nazi regime. Smaller excerpts exist on the internet, but here I’ve transcribed as much of it as I could, because I think this, more than anything, has to be read and understood now of all times:

“What no one seemed to noice,” said a colleague of mine, a philologist, “was the ever widening gap, after 1933, between the government and the people. […] You know, it doesn’t make people close to their government to be told that this is a people’s government, a true democracy, or to be enrolled in a civilian defense, or even to vote. All this has little, really nothing, to do with knowing one is governing.

“What happened here was the gradual habituation of the people, little by little, to being governed by surprise; to receiving decisions deliberated in secret; to believing that the situation was so complicated that the government had to act on information which the people could not understand, or so dangerous that, even if the people could not understand it, it could not be released because of national security [. . .]

“This separation of government from people, this widening of the gap, took place so gradually and so insensibly, each step disguised (perhaps not even intentionally) as a temporary emergency measure or associated with true patriotic allegiance or with real social purposes. And all the crises and reforms (real reforms, too) so occupied the people that they did not see the slow motion underneath, of the whole process of government growing remoter and remoter.

“You will understand me when I say that my Middle High German was my life. It was all I cared about. I was a scholar, a specialist. Then, suddenly, I was plunged into all the new activity, as the university was drawn into the new situation; meetings, conferences, interviews, ceremonies, and, above all, papers to be filled out, reports, bibliographies, lists, questionnaires. And on top of that were the demands in the community, the things in which one had to, was ‘expected to’ participate that had not been or had not been important before. It was all rigamarole, of course, but it consumed all one’s energies, coming on top of the work one really wanted to do. You can see how easy it was, then, not to think about fundamental things. One had no time. […]

“The dictatorship, and the whole process of its coming into being, was above all diverting. It provided an excuse not to think for people who did not want to think anyway. […] Most of us did not want to think about fundamental things and never had. There was no need to. Nazism gave us some dreadful, fundamental things to think about – we were decent people– and kept us so busy with continuous changes and ‘crises’ and so fascinated, yes, fascinated, by the machinations of the ‘national enemies,’ without and within, that we had no time to think about these dreadful things that were growing, little by little, all around us. Unconsciously, I suppose, we were grateful. Who wants to think?

“To live in this process is absolutely not to be able to notice it—please try to believe me—unless one has a much greater degree of political awareness, acuity, than most of us had ever had occasion to develop. Each step was so small, so inconsequential, so well explained or, on occasion, ‘regretted,’ that, unless one were detached from the whole process from the beginning, unless one understood what the whole thing was in principle, what all these ‘little measures’ that no ‘patriotic German’ could resent must some day lead to, one no more saw it developing from day to day than a farmer in his field sees the corn growing. One day it is over his head.

“How is this to be avoided, among ordinary men, even highly educated ordinary men? Frankly, I do not know. I do not see, even now. Many, many times since it all happened I have pondered that pair of great maxims, Principiis obsta and Finem respice– ‘Resist the beginnings’ and ‘Consider the end.’ But one must foresee the end clearly and certainly and how is this to be done, by ordinary men or even by extraordinary men? Things might have changed here before they went as far as they did; they didn’t, but they might have. And everyone counts on that might. […]

“You see,” my colleague went on, “one doesn’t see exactly where or how to move. Believe me, this is true. Each act, each occasion, is worse than the last, but only a little worse. You wait for the next and the next. You wait for one great shocking occasion, thinking that others, when such a shock comes, will join with you in resisting somehow. You don’t want to act, or even talk, alone; you don’t want to ‘go out of your way to make trouble.’ Why not?–Well, you are not in the habit of doing it. And it is not just fear, fear of standing alone, that restrains you; it is also genuine uncertainty.

“Uncertainty is a very important factor, and, instead of decreasing as time goes on, it grows. Outside, in the streets, in the general community, ‘everyone’ is happy. One hears no protest, and certainly sees none. […] In the university community, in your own community, you speak privately to your colleagues, some of whom certainly feel as you do; but what do they say? They say, ‘It’s not so bad’ or ‘You’re seeing things’ or ‘You’re an alarmist.’

“And you are an alarmist. You are saying that this must lead to this, and you can’t prove it. These are the beginnings, yes; but how do you know for sure when you don’t know the end, and how do you know, or even surmise, the end? On the one hand, your enemies, the law, the regime, the Party, intimidate you. On the other, your colleagues pooh-pooh you as pessimistic or even neurotic. You are left with your close friends, who are, naturally, people who have always thought as you have. […]

“So you wait, and you wait. But the one great shocking occasion, when tens or hundreds or thousands will join with you, never comes. That’s the difficulty. If the last and worst act of the whole regime had come immediately after the first and smallest, thousands, yes, millions would have been sufficiently shocked–if, let us say, the gassing of the Jews in ’43 had come immediately after the ‘German Firm’ stickers on the windows of non-Jewish shops in ’33. But of course this isn’t the way it happens. In between come all the hundreds of little steps, some of them imperceptible, each of them preparing you not to be shocked by the next. Step C is not so much worse than Step B, and, if you did not make a stand at Step B, why should you at Step C? And so on to Step D.

“And one day, too late, your principles, if you were ever sensible of them, all rush in upon you. The burden of self-deception has grown too heavy, and some minor incident, in my case my little boy, hardly more than a baby, saying ‘Jew swine,’ collapses it all at once, and you see that everything, everything, has changed and changed completely under your nose. The world you live in–your nation, your people– is not the world you were born in at all. The forms are all there, all untouched, all reassuring […]. But the spirit, which you never noticed because you made the lifelong mistake of identifying it with the forms, is changed. Now you live in a world of hate and fear, and the people who hate and fear do not even know it themselves; when everyone is transformed, no one is transformed. […]

“You have gone almost all the way yourself. Life is a continuing process, a flow, not a succession of acts and events at all. It has flowed to a new level, carrying you with it, without any effort on your part. On this new level you live, you have been living more comfortably every day, with new morals, new principles. You have accepted things you would not have accepted five years ago, a year ago, things that your father, even in Germany, could not have imagined.

“Suddenly it all comes down, all at once. You see what you are, what you have done, or more accurately, what you haven’t done (for that was all that was required of most of us: that we do nothing). You remember those early meetings of your department in the university when, if one had stood, others  would have stood, perhaps, but no one stood. A small matter, a matter of hiring this man or that, and you hired this one rather than that. You remember everything now, and your heart breaks. Too late. You are compromised beyond repair.”

Everybody knows

Everybody knows that the dice are loaded
Everybody rolls with their fingers crossed
Everybody knows the war is over
Everybody knows the good guys lost
Everybody knows the fight was fixed
The poor stay poor, the rich get rich
That’s how it goes
Everybody knows
Everybody knows that the boat is leaking
Everybody knows that the captain lied
Everybody got this broken feeling
Like their father or their dog just died
Everybody talking to their pockets
Everybody wants a box of chocolates
And a long-stem rose
Everybody knows

Rest in peace, Leonard Cohen

R package docs + github.io

Ever wish you had a painless way to generate spiffy online documentation for your R package, a la ggplot2? Now you can, using github’s free project site hosting and hadley’s staticdocs package. Here’s how:

  1. Make some nice documentation for your project, ideally using Roxygen and devtools::document(). Follow the instructions at http://r-pkgs.had.co.nz/man.html.
  2. Create a new git branch for your project named ‘gh-pages’. Github recognizes this as a special branch name, so an index.html file here will be rendered and visible at http://{your_github_username}.github.io/{project_name}.
  3. Check out ‘gh-pages’ and from that branch, run: staticdocs::build_site(site_path='.')
  4. Watch as staticdocs creates a beautiful web page for your project. Add all the files it generates to git, commit and push it (remember- stay in the gh-pages branch!)
  5. Visit your new site at the URL in #2.

The next step is to keep it updated as you develop your package. To facilitate this, I’ve written a small Makefile:

docs:
Rscript -e "devtools::document(roclets=c('rd', 'collate', 'namespace', 'vignette'))"
gh-pages:
git checkout gh-pages
git merge master -X theirs -m "merge master"
site:docs gh-pages
Rscript -e "staticdocs::build_site(site_path='.', launch=FALSE)"
git commit -am 'updated docs'
git checkout master
update: site
git push
git checkout gh-pages
git push
git checkout master
view raw Makefile hosted with ❤ by GitHub

Now, whenever you make a change to your package that you want reflected in the docs/site, just run make update and it will rebuild the docs, switch to the gh-pages branch, rebuild the site, then push both master and gh-pages branches to Github.

I’m using this for my random collection of R functions here:

http://eclarke.github.io/eclectic

open file limits and bcl2fastq

I was trying to demultiplex a MiSeq run with the Illumina utility `bcl2fastq` today and got the error “too many open files (24)”.

As it turns out, if you have more than about 250 samples in your run, the utility will open > 1024 file handles to do its processing (2x for R1, R2, then maybe indexes? not sure). 1024 files is the default limit on a lot of Linux systems for a particular process; however, you can get around it very easily by setting `ulimit -n 4000` or some other number. You do not need to be root to do this, so I think it’s a per-user thing and will likely be reset once you log out.

 

 

Keeping a lab notebook with org-mode, git, Papers, and Pandoc (Part II)

I promised a while ago that I’d post the code I used to make org-mode a powerful lab notebook, and here it is. In Step 1, I present the emacs lisp for the lab-notebook minor mode. In Step 2, I show you how to leverage Papers to work with Emacs and Pandoc. In Step 3, I show you how to hook Pandoc up with Emacs.

As I mentioned previously, these are all hacked-together solutions. My requirements and setup will likely be different from yours, and I have probably forgotten a few key steps. For instance, while I use a Mac, the only parts that are specific to Macs are the hooks with Papers, a reference library application. Emacs, Pandoc, and bibtex are all fairly cross-platform solutions, I think.

Step 1: lab-notebook mode in emacs

The code in the gist below provides a minor-mode for emacs that enables checkin-on-save and auto-export from Papers (using the Applescript shown in Step 2). Load this in your .emacs file and initiate by M-x lab-notebook-mode, or add -*- eval: (lab-notebook-mode) -*- to the top of your notebook file.

(defun ensure-in-vc-or-checkin ()
(interactive)
(if (file-exists-p (format "%s" (buffer-file-name)))
(progn (vc-next-action nil) (message "Committed"))
(ding) (message "File not checked in.")))
(defun export-bibtex ()
"Exports Papers library using a custom applescript."
(interactive)
(message "Exporting papers library...")
(shell-command "osascript ~/notebook/papers_bibtex_export.scpt"))
;;;###autoload
(define-minor-mode lab-notebook-mode
"Toggle lab notebook mode.
In this mode, org files that are saved are automatically committed by the VC
system in Emacs. Additionally, Papers library export to bibtex is hooked to
the <f5> key.
Future additions may hook a confluence upload or Org export to this mode as well."
;; initial value
nil
;; indicator
:lighter " LN"
;; keybindings
:keymap (let ((map (make-sparse-keymap)))
(define-key map (kbd "<f5>") 'export-bibtex)
map)
;; body
(add-hook 'after-save-hook 'ensure-in-vc-or-checkin nil 'make-it-local))
(provide 'lab-notebook-mode)
view raw lab-notebook.el hosted with ❤ by GitHub

Step 2: Papers auto-export and citekeys

Papers comes with a Citations tool that allows you to insert a citekey in any text field, including Emacs. A citekey is a short, unique reference to a publication or entry in your reference library, used by tools to automatically format your citations upon printing or export. To use, summon Citations using whatever keybinding you chose in Papers’ preferences and add a citekey like you would in a Word document. In our setup, we’ll be using Pandoc, so be sure to change the citekey type to ‘Pandoc’ (Papers -> Preferences -> Citations -> Format: Pandoc Citation).

Pandoc doesn’t talk to Papers, but Papers can export its library in a format that Pandoc will understand, called bibtex. We will want this to happen automatically before Pandoc attempts to render our notebook. The applescript below exports my Papers library as a bibtex file in the location specified by “path_to_notebook_folder” (replace with whatever you want). Save this file in the same location (or wherever) and update the part of the Gist above that starts (shell-command "osascript ... with the right path and name that matches whatever you saved it as.

tell application "Papers"
set outFile to "path_to_notebook_folder/library.bib"
export ((every publication item) as list) to outFile
end tell

The benefit of adding this into Emacs is that you can either press a key (by default, F5) to automatically keep your bibtex file up-to-date, or add a hook so that when you render your notebook using Pandoc, it automatically updates your bibtex file first before trying to build your citations.

Step 3: Use ox-pandoc in emacs to export a pretty version of your notebook.

While org-mode does include its own LaTeX and HTML export commands, I found it easier to use Pandoc and the ox-pandoc Emacs package. The benefits of Pandoc is that it exports to a much wider variety of formats and has a simple and easy citations manager, pandoc-citeproc. Citeproc can use the bibtex library we exported in Step 2- all we have to do is add the following somewhere in our notebook org file:
#+PANDOC_OPTIONS: bibliography:library.bib
Make sure there is a library.bib file in the same folder as your notebook file, of course. Now, to export to Latex with formatted references, simply use M-x org-pandoc-export-to-latex-pdf and it will produce a beautiful PDF copy of your notebook. Note: in case it wasn’t clear, you will need to install Pandoc and ox-pandoc since they are not built in. Only recent versions of Pandoc handle org-mode files, so make sure you’re up-to-date. ox-pandoc is available on the Emacs package repo Marmelade.

Example notebook.org file

An example notebook.org file. Since Github automatically renders org-mode files when they’re displayed, this is a link to the raw source so you can see the markup used.

Pandoc can export to Github-flavored markdown, so here is the same file rendered:

<2014-09-25 Thu>

Reading this (Wang et al. 2010), in particular Supp. report 4. Has to do with bootstrapping population estimates using the different enzymes used to find integration sites.

<2014-09-26 Fri>

Notes on (Wang et al. 2010) :popest:petersen:intsites:

The review (here A. Chao 2001) indicates sample coverage approaches may be most appropriate for models where the capture probabilities can vary both over time and between species (which is certainly the case for estimating populations from longitudinal T cell sequencing data). I'll need to read more (esp. (Tsay and Chao 2001) and (Chao and Tsay 1998)).

Also found the package Rcapture which appears to fit a Poisson regression to incident counts. I've gotten it to work (kind of) but R crashed trying to plot the model so I'll need to come back to this. If memory serves, it was substantially more conservative in its estimates than the Chao2 estimate.

References

Chao, A, and P K Tsay. 1998. “A sample coverage approach to multiple-system estimation with application to census undercount.” Journal of the American Statistical Association 93 (441): 283–93. doi:10.2307/2669624.

Chao, Anne. 2001. “An overview of closed capture-recapture models.” Journal of Agricultural, Biological, and Environmental Statistics 6 (2). Springer-Verlag: 158–75. doi:10.1198/108571101750524670.

Tsay, P K, and A Chao. 2001. “Population size estimation for capture - recapture models with applications to epidemiological data.” Journal of Applied Statistics 28 (1): 25–36. http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=mekentosj&SrcApp=Papers&DestLinkType=FullRecord&DestApp=WOS&KeyUT=000165844500002.

Wang, Gary P, Charles C Berry, Nirav Malani, Philippe Leboulch, Alain Fischer, Salima Hacein-Bey-Abina, Marina Cavazzana-Calvo, and Frederic D Bushman. 2010. “Dynamics of gene-modified progenitor cells analyzed by tracking retroviral integration sites in a human SCID-X1 gene therapy trial.” Blood 115 (22): 4356–66. doi:10.1182/blood-2009-12-257352.

Further reading

Keeping a lab notebook with org-mode, git, Papers, and Pandoc (Part I)

Lab notebooks for computational biologists can be tricky. In my experience, I often don’t work in a linear, experimental fashion that’s conducive to writing down methods and results the way we’ve been taught to do as biology graduate students. My usual workflow involves a lot of reading, a lot of data exploration in R, and by-hand data munging.

With the exception of R and its suite of reproducible research tools such as Sweave and Knitr, it’s very difficult to lay down a cohesive, immutable “paper trail” of work that serves as a record the same way a formal lab notebook does. Moreover, it’s hard to keep track of ideas and thoughts over time. There’s nothing quite as… disheartening? as coming across what you thought was a new approach or idea and realizing you had been there a year ago.

My ideal requirements for a lab notebook would then be as follows:

  1. Permanence. This has two aspects, the first being immutability- I can’t go back and alter what I’ve written without some sort of record and explanation of the change. The second is durability- it should not live only on my local machine, but should ideally be instantly and immediately backed up to a server.
  2. Citations. I organize all the papers I’m reading with Papers on the Mac. It’s not a perfect piece of software, but it works well in the areas I need it to. I would need a lab notebook to integrate with Papers so that I can easily insert references to papers and know months later a) what I was talking about and b) how to easily find them in my library.
  3. Ease-of-use. I live in my terminal, like many computational people. I want my lab notebook to be accessible in the terminal, just like a wet-lab scientist would have their notebook by them on the bench. Moreover, I want it to be searchable using the same tools I use to work with data, not some ugly XML or binary format that I cannot parse on the command line.
  4. Discoverability. While at this point it’s unlikely anybody would find something particularly illuminating in my lab notebook, anybody who inherits my projects or wants to continue my work should be able to read some version of my notebook easily and find relevant information quickly. (This person could easily be me after working on some other project for a while.) In this sense, the notebook should have a coherent structure and internal format so that you could produce a table of contents or list of entries and their associated topics. For thesis meetings, it should be able to be converted to a printable, readable document.

The tools that fulfill these requirements all exist already. SVN or Git on a remote server solves the issue of permanence. Papers includes a citekey insertion service to add Latex or Pandoc citekeys to any document that accepts keyboard input, and its internal library format is exportable as a Bibtex library. Emacs is my preferred text editor and is primarily a command-line application (and an awesome one at that), and includes hooks to version control systems. Finally, Emacs includes org-mode, a to-do list and note-taking mode that provides document structure, tags, and easy export to Latex or HTML. What’s even better is that Pandoc, a command-line document converter, can work with org-mode .org files and convert them to a huge number of alternate formats, as well as handle citations (in case you don’t want to export to Latex).

I’ve just finished writing a set of glue scripts and Emacs lisp that makes a manageable lab notebook. As of now, my notebook:

  • Automatically exports my Papers library to a Bibtex file in the same directory as my notebook every time I save (using Papers’ Applescript hooks)
  • Automatically commits and uploads both the notebook and the Bibtex file to SVN on save (but is agnostic to what VC system is being used)
  • Exports to a PDF with citations and a table-of-contents with a single keypress. (using ox-pandoc and pandoc-citeproc)
  • Is entirely contained within plain-text .org files, which are entirely readable outside Emacs (the org format looks a little like mutant Markdown)
  • Can parse and execute inline code blocks in bash script, R, and Python, so that I can document what I do outside of the reproducible reports generated with Sweave, using the org-mode Babel extension.
  • Works with tables and inline spreadsheets so I can quickly and easily summarize data in the document itself (another amazing org-mode function).

In Part II I’ll publish the code I used to accomplish all of this, a lot of which will be pretty inelegant and hackish. However, I think that it’ll be a useful starting point for anybody else who wants to adopt this workflow.

Well-designed things

In no particular order:

  • dplyr: data manipulation package for R. Since becoming familiar with dplyr, hard things are easy and easy things are easy. Tedious things are easy. And everything just looks clean and understandable.
  • aeropress: a weird little coffee maker from the guys that make the spinning disks. It makes a single nice cup of coffee, is self-cleaning, and appears to be virtually indestructible (as opposed to the stovetop espresso makers I’ve owned).
  • this bookstand: made from clear polycarbonate, has held everything from cookbooks, to textbooks, to laptops, for over 6 years now. Simple, adjustable design and, like the aeropress, appears to be extremely robust.
  • arq: a powerful and well-engineered backup solution for the Mac. Does iterative, encrypted, and interruptable backups. I use it to back up all my photos to Amazon Glacier and Google Drive, and all my work files to the lab fileserver. Works silently in the background and encrypts all its backups on the destination. You can set monetary (for AWS) and data (for everything) budgets that it will stay within.