So I’ve been working a lot with NCBI GEO recently for a paper on the Gene Ontology. During the course of this work I wound up implementing about 70% of the famous R package GEOQuery in Python (as I’m much more fluent in Python than R) and decided that it might be worthwhile to submit to the BioPython project. Their existing GEO parser is woefully inadequate and slightly buggy (I don’t believe it can handle the curated GEO Dataset format, it has no programmatic access to NCBI GEO, and offers no way to do any statistical analysis on the resulting microarray data).
My fork, which is available here, revamps the Geo package to provide the following features:
- Automatic retrieval and parsing of GEO files, either from NCBI or from the local filesystem
- Pretty-printing of metadata, column, and table information
- Ability to convert GDS records into a form that provides a Numpy matrix representation of the sample/probe matrix
- Rudimentary statistical analysis methods (filtering probes, detecting enriched genes for a subset, binary log transformation of probe values)
I still haven’t written unit tests for it all yet (a persistent failing- one of many, I’ll admit) mostly because it was developed a bit on-the-fly during my work. However, I also know that it works for at least a subsection of uses, and it’s well-documented.
The two modified files are here for the morbidly curious: