Time traveling through the Gene Ontology Annotations

Thanks to the wonders of version control, Gene Ontology human gene annotations can be found stretching all the way back to 2004:

http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/gene-associations/gene_association.goa_human.gz

That’s quite cool if you wanted to a historical analysis of how these annotations are changing with time (which I do). For instance, if you wanted to see how many terms have been marked as obsolete since 2004, and you’ve downloaded the current gene_ontology_ext.obo file and the goa file from ’04:

To see how many total annotations we’ve got:


$ cat gene_association.goa_human_18 | gawkt '{print $5}' | sort | uniq | wc -l
    3989

(by the way, gawkt=’awk -F”\t” -v OFS=”\t”‘)

To see how many we’ve got sans obsolescence:

$ cat gene_association.goa_human_18 | python filter_obs.py | gawkt '{print $5}' | sort | uniq | wc -l
    3608

That python script (filter_obs.py) is available below.

So we’ve got 381/3989 – 9.5%- that have been retired since ’04. That’s not too shabby, although I imagine the GO hierarchy and overall structure has changed more significantly since then. Still, it makes it plausible to track the gene annotations of the majority of terms over the last 8 years.

https://gist.github.com/2580678.js

Überwiki

One of the exciting things about our installation of Semantic Mediawiki over at GeneWiki+ is the opportunity to merge in arbitrary datasets and resources. We’ve already brought in the SNP database over at SNPedia and linked the gene pages and SNP pages together with the Disease Ontology.

The Disease Ontology (DO) and Mediawiki’s category system share the same structure (a directed acyclic graph, or (DAG)) so mapping the DO onto GeneWiki+ was actually fairly painless. It created a series of common “nodes” in our wiki for annotations mined from the gene and SNP text to point to. And because it’s a semantic mediawiki, we can transitively associate SNPs, genes, and diseases. Now we can create queries that ask for all the genes and SNPs that are related to various diseases, and if a gene->disease link is known, and a SNP->gene link is known, we can then posit that that SNP is also related to that disease.

It’s pretty cool. We wrote a paper about it that should be out in Journal of Biomedical Statistics soon.

While we’re waiting for that paper to go through the review process, we’ve discussed expanding the resources grafted onto GeneWiki+. A low-hanging fruit is the Gene Ontology, a massive collection of terms describing most eukaryotic gene functions. Like the Disease Ontology, it is also a DAG and as such can be mapped directly onto Mediawiki’s categories. Sites like GONUTS have already done it- all we’d be doing is bringing it into a Wiki that understood semantic relationships. Now all the genes that contain GO annotations can be linked to common GO nodes. If (when?) we start bringing in genes from other organisms, these links can serve as transitive bridges between different species. This sort of linkup was described in the original GO paper, so it’s really nothing too original, just really neat.

A nasty surprise

This has been mentioned before, but something good to note: Java’s SimpleDateFormat is not thread-safe. We recently ran into an issue where we were getting absolutely bizarre dates and unreproduceable exceptions, and it turned out that the culprit was a static SimpleDateFormat object that was being shared among four threads. Every so often we were basically lucky enough to get an exception, though it makes me uncomfortable to think about what was parsed incorrectly and logged in the database. Obviously we had to re-run everything once we had synchronized the code that accessed the static SimpleDateFormat.

This highlights another danger of threading: it sometimes doesn’t matter how carefully you protect your own code, because you may occasionally (frequently) run into libraries that aren’t threadsafe- even ones from major players like Sun/Oracle (who at least note it in the docs, but that makes me wonder- why didn’t they just fix it so that there were synchronized options?). It also drives home the need for copy constructors and minimizing shared data- static objects being prime examples- if you make even a rudimentary attempt at handling threads yourself.

Or you could just use a functional programming language. I’m beginning to see the appeal, to be honest- I would rather have methods who return copies of their object instead of methods that affect their instance in-place. Maybe it’s time to start considering Scala?

Update on getting the exit status of a thread

This always winds up happening to me- I think I’ve found some clever way of doing things, and then about twenty minutes after I write about it, I discover a much better, much simpler alternative.

In the case of the previous post, I wanted to get the exit status of a thread- i.e. did it exit normally, or did it fail with an error? I wanted to know specifically so that I could restart it. Annoyingly, and conveniently, there is a much easier way of doing that using the SingleThreadScheduledExecutor(ThreadFactory) method in the ScheduledExecutorService class. You provide it with a Thread Factory that’s configured to create your process you don’t want to die, and this function will execute the process as scheduled, spawning a new thread from the ThreadFactory if the command exits with an exception.

scheduler = Executors.newSingleThreadScheduledExecutor(myThreadFactory);

ScheduledFuture<?> schedule = scheduler.scheduleAtFixedRate(new Runnable(), 0, 2, TimeUnit.WEEKS);

Bam. Two lines of code. As long as you don’t cancel it, your process will be running every two weeks. We can even query the exit status like the last post if we really want to know the post-mortem on our schedule object, but there’s not a lot of point to that.

Getting the exit status of a thread in Java

Let’s say you have a thread that’s happily chugging along at a task- for example, parsing over 10,000 Wikipedia articles and corresponding medical databases for some heavy-duty automated updating. You’d like to have it so that if the task fails (which is more likely than not) the parent thread will just respawn it, but only if the thread terminated with an exception- i.e., did not exit normally. If it exited normally, well, great, no need to respawn it.

Ordinarily when you spawn threads they implement the Runnable interface, which does not have a return value. You can check if the thread is still alive, of course, and take action if it’s not- but this doesn’t tell you anything about how the thread terminated. You could use the Callable interface, which is basically identical to Runnable, except that you can return a result (maybe like those old-style Unix exit numbers, for instance), but many convenient methods in Java.Util.Concurrent only take Runnable-implementing threads (such as the ScheduledExecutorService.scheduleAtFixedRate method).

A way to have a runnable method basically give you some clue to its exit status is by querying the ScheduledFuture object you get by executing the scheduleAtFixedRate method:

ScheduledExecutorService scheduler = 
      Executors.newScheduledThreadPool(1);
ScheduledFuture<?> schedule = 
      scheduler.scheduleAtFixedRate(new RunnableObject(), 
                                      0, 2, TimeUnit.SECONDS);
try {
  Thread.sleep(10000)
  schedule.cancel(true);
  schedule.get();   // blocks; throws exceptions if the 
                    // current thread is finished.
} catch (CancellationException e) {
  // the thread was cancelled
} catch (ExecutionException e) {
  // the thread died from an exception- now we know!
} catch (InterruptedException e) {
  // the thread was interrupted
}

In the block where we see it died from an exception, we can add logic to restart it, if we want (which I do).

Anyway, as always, I’m sure there’s a 100 better ways to do this, but this seemed clean and easy. I enjoy using thrown exceptions for more than just printing stacktraces.


			

Hello world!

Hello world! I’m Erik, and I’m a molecular biology student moonlighting as a bioinformatics/computational biology developer. This blog is just a place to talk about interesting things I’ve found during my work in these fields, especially as they regard transporter proteins, sequence analysis and bioinformatic software development.

Disclaimer: I do not have formal training in software development though I love learning about design patterns and object-oriented programming. I enjoy writing useful software almost as much as I enjoy working on bioinformatics problems and would love to share any useful things I’ve encountered. However, my lack of training makes me wildly naive; if you stumble upon something I’ve written that is particularly misinformed, it would be wonderful if you corrected me.