Time traveling through the Gene Ontology Annotations

Thanks to the wonders of version control, Gene Ontology human gene annotations can be found stretching all the way back to 2004:

http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/gene-associations/gene_association.goa_human.gz

That’s quite cool if you wanted to a historical analysis of how these annotations are changing with time (which I do). For instance, if you wanted to see how many terms have been marked as obsolete since 2004, and you’ve downloaded the current gene_ontology_ext.obo file and the goa file from ’04:

To see how many total annotations we’ve got:


$ cat gene_association.goa_human_18 | gawkt '{print $5}' | sort | uniq | wc -l
    3989

(by the way, gawkt=’awk -F”\t” -v OFS=”\t”‘)

To see how many we’ve got sans obsolescence:

$ cat gene_association.goa_human_18 | python filter_obs.py | gawkt '{print $5}' | sort | uniq | wc -l
    3608

That python script (filter_obs.py) is available below.

So we’ve got 381/3989 – 9.5%- that have been retired since ’04. That’s not too shabby, although I imagine the GO hierarchy and overall structure has changed more significantly since then. Still, it makes it plausible to track the gene annotations of the majority of terms over the last 8 years.

https://gist.github.com/2580678.js

Überwiki

One of the exciting things about our installation of Semantic Mediawiki over at GeneWiki+ is the opportunity to merge in arbitrary datasets and resources. We’ve already brought in the SNP database over at SNPedia and linked the gene pages and SNP pages together with the Disease Ontology.

The Disease Ontology (DO) and Mediawiki’s category system share the same structure (a directed acyclic graph, or (DAG)) so mapping the DO onto GeneWiki+ was actually fairly painless. It created a series of common “nodes” in our wiki for annotations mined from the gene and SNP text to point to. And because it’s a semantic mediawiki, we can transitively associate SNPs, genes, and diseases. Now we can create queries that ask for all the genes and SNPs that are related to various diseases, and if a gene->disease link is known, and a SNP->gene link is known, we can then posit that that SNP is also related to that disease.

It’s pretty cool. We wrote a paper about it that should be out in Journal of Biomedical Statistics soon.

While we’re waiting for that paper to go through the review process, we’ve discussed expanding the resources grafted onto GeneWiki+. A low-hanging fruit is the Gene Ontology, a massive collection of terms describing most eukaryotic gene functions. Like the Disease Ontology, it is also a DAG and as such can be mapped directly onto Mediawiki’s categories. Sites like GONUTS have already done it- all we’d be doing is bringing it into a Wiki that understood semantic relationships. Now all the genes that contain GO annotations can be linked to common GO nodes. If (when?) we start bringing in genes from other organisms, these links can serve as transitive bridges between different species. This sort of linkup was described in the original GO paper, so it’s really nothing too original, just really neat.

Update on getting the exit status of a thread

This always winds up happening to me- I think I’ve found some clever way of doing things, and then about twenty minutes after I write about it, I discover a much better, much simpler alternative.

In the case of the previous post, I wanted to get the exit status of a thread- i.e. did it exit normally, or did it fail with an error? I wanted to know specifically so that I could restart it. Annoyingly, and conveniently, there is a much easier way of doing that using the SingleThreadScheduledExecutor(ThreadFactory) method in the ScheduledExecutorService class. You provide it with a Thread Factory that’s configured to create your process you don’t want to die, and this function will execute the process as scheduled, spawning a new thread from the ThreadFactory if the command exits with an exception.

scheduler = Executors.newSingleThreadScheduledExecutor(myThreadFactory);

ScheduledFuture<?> schedule = scheduler.scheduleAtFixedRate(new Runnable(), 0, 2, TimeUnit.WEEKS);

Bam. Two lines of code. As long as you don’t cancel it, your process will be running every two weeks. We can even query the exit status like the last post if we really want to know the post-mortem on our schedule object, but there’s not a lot of point to that.

Getting the exit status of a thread in Java

Let’s say you have a thread that’s happily chugging along at a task- for example, parsing over 10,000 Wikipedia articles and corresponding medical databases for some heavy-duty automated updating. You’d like to have it so that if the task fails (which is more likely than not) the parent thread will just respawn it, but only if the thread terminated with an exception- i.e., did not exit normally. If it exited normally, well, great, no need to respawn it.

Ordinarily when you spawn threads they implement the Runnable interface, which does not have a return value. You can check if the thread is still alive, of course, and take action if it’s not- but this doesn’t tell you anything about how the thread terminated. You could use the Callable interface, which is basically identical to Runnable, except that you can return a result (maybe like those old-style Unix exit numbers, for instance), but many convenient methods in Java.Util.Concurrent only take Runnable-implementing threads (such as the ScheduledExecutorService.scheduleAtFixedRate method).

A way to have a runnable method basically give you some clue to its exit status is by querying the ScheduledFuture object you get by executing the scheduleAtFixedRate method:

ScheduledExecutorService scheduler = 
      Executors.newScheduledThreadPool(1);
ScheduledFuture<?> schedule = 
      scheduler.scheduleAtFixedRate(new RunnableObject(), 
                                      0, 2, TimeUnit.SECONDS);
try {
  Thread.sleep(10000)
  schedule.cancel(true);
  schedule.get();   // blocks; throws exceptions if the 
                    // current thread is finished.
} catch (CancellationException e) {
  // the thread was cancelled
} catch (ExecutionException e) {
  // the thread died from an exception- now we know!
} catch (InterruptedException e) {
  // the thread was interrupted
}

In the block where we see it died from an exception, we can add logic to restart it, if we want (which I do).

Anyway, as always, I’m sure there’s a 100 better ways to do this, but this seemed clean and easy. I enjoy using thrown exceptions for more than just printing stacktraces.