New GEO Query/Analysis module for BioPython

So I’ve been working a lot with NCBI GEO recently for a paper on the Gene Ontology. During the course of this work I wound up implementing about 70% of the famous R package GEOQuery in Python (as I’m much more fluent in Python than R) and decided that it might be worthwhile to submit to the BioPython project. Their existing GEO parser is woefully inadequate and slightly buggy (I don’t believe it can handle the curated GEO Dataset format, it has no programmatic access to NCBI GEO, and offers no way to do any statistical analysis on the resulting microarray data).

My fork, which is available here, revamps the Geo package to provide the following features:

  1. Automatic retrieval and parsing of GEO files, either from NCBI or from the local filesystem
  2. Pretty-printing of metadata, column, and table information
  3. Ability to convert GDS records into a form that provides a Numpy matrix representation of the sample/probe matrix
  4. Rudimentary statistical analysis methods (filtering probes, detecting enriched genes for a subset, binary log transformation of probe values)

I still haven’t written unit tests for it all yet (a persistent failing- one of many, I’ll admit) mostly because it was developed a bit on-the-fly during my work. However, I also know that it works for at least a subsection of uses, and it’s well-documented.

The two modified files are here for the morbidly curious:

Bio/Geo/__init__.py

Bio/Geo/Records.py

Quote

This is me writing Java.

You have no idea the pain I feel when I sit down to program. I’m walking on razor blades and broken glass. You have no idea the contempt I feel for C++, for J2EE, for your favorite XML parser, for the pathetic junk we’re using to perform computations today. There are a few diamonds in the rough, a few glimmers of beauty here and there, but most of what I feel is simply indescribable nausea.

— Steve Yegge, Moore’s Law is Crap

Time traveling through the Gene Ontology Annotations

Thanks to the wonders of version control, Gene Ontology human gene annotations can be found stretching all the way back to 2004:

http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/gene-associations/gene_association.goa_human.gz

That’s quite cool if you wanted to a historical analysis of how these annotations are changing with time (which I do). For instance, if you wanted to see how many terms have been marked as obsolete since 2004, and you’ve downloaded the current gene_ontology_ext.obo file and the goa file from ’04:

To see how many total annotations we’ve got:


$ cat gene_association.goa_human_18 | gawkt '{print $5}' | sort | uniq | wc -l
    3989

(by the way, gawkt=’awk -F”\t” -v OFS=”\t”‘)

To see how many we’ve got sans obsolescence:

$ cat gene_association.goa_human_18 | python filter_obs.py | gawkt '{print $5}' | sort | uniq | wc -l
    3608

That python script (filter_obs.py) is available below.

So we’ve got 381/3989 – 9.5%- that have been retired since ’04. That’s not too shabby, although I imagine the GO hierarchy and overall structure has changed more significantly since then. Still, it makes it plausible to track the gene annotations of the majority of terms over the last 8 years.

https://gist.github.com/2580678.js