Using an SSH tunnel and SOCKS proxy to avoid paywalls

Part of my life as somebody who regularly reads academic papers is dealing with the paywalls that most of them are locked behind. I hate them with the passion of a thousand fiery suns when I encounter them; fortunately I don’t often have to since I spend most of my life on campus. But for those times when (god forbid) I have to access imprisoned documents from the luxury of my own home, I can either stare balefully at the paywall screen, fight with the library’s often-misconfigured proxy service, or […drum roll…] pretend I’m on the University’s network by tunneling in over SSH!

This method is for Mac OS X 10.9 “Mavericks”: we’ll be directing HTTP traffic through 8080 instead of 80 using a SOCKS proxy and SSH tunnel into a server within the network you want to appear to be coming from.

First, open System Preferences > Network > Advanced > Proxies. Select the SOCKS proxy option and put “localhost” for the host and 8080 for the port. It doesn’t matter if you turn it on at this point, just make sure you click “save” and “apply” to save these options.

Next, copy these two functions* into your .bashrc file or wherever else you keep your repository of neat bash functions, making sure to change the ssh login on line 7:

function toggleproxy {
# checks to see if SOCKS proxy is enabled
if [[ $(networksetup -getsocksfirewallproxy Wi-Fi | grep '^Enabled') == "Enabled: No" ]]; then
networksetup -setsocksfirewallproxystate Wi-Fi on
echo "SOCKS on!"
# checks to see if there's an existing SSH tunnel and if not, it starts one
if [[ -z $(ps aux | grep '[0-9] ssh -D 8080') ]]; then
echo -ne "Don't see a ssh tunnel on 8080 active, starting one now..."
ssh -D 8080 -f -C -q -N USERNAME@HOSTNAME.EDU # Change this from the defaults!
[[ $? == 0 ]] && echo " success!" || echo " failed 😦"
fi
else
networksetup -setsocksfirewallproxystate Wi-Fi off
# only show this message if there's an active SSH tunnel
if [[ -n $(ps aux | grep '[0-9] ssh -D 8080') ]]; then
echo "SOCKS off! You may want to kill your existing SSH tunnels with 'killtunnel'."
else
echo "SOCKS off!"
fi
fi
}
function maketunnel {
ssh -D 8080 -f -C -q -N USERNAME@HOSTNAME.EDU # Change this from the defaults!
}
## kills all SSH connections with port forwarding to 8080
function killtunnel {
for x in `ps -u $USER | grep -P '[0-9] ssh -D 8080' | awk '{print $2}'`; do
kill $x
done
}
view raw toggleproxy.sh hosted with ❤ by GitHub

Once this is saved and loaded into your profile, just run  

 toggleproxy 

from your command line to automatically switch the SOCKS proxy on or off. If you’re turning it on, the function will attempt to open a new SSH tunnel if it doesn’t detect one already present.

I didn’t have it automatically kill your tunnels because maybe you had it open for some other reason. Plus you can disable/reenable SOCKS repeatedly using the same SSH tunnel, so there’s no need to kill it every time. To kill any tunnels you have open, run

 killtunnel 

 

* I am not a very good Bash hacker, so I am 100% positive there’s a better way to write these functions. But this works for now.

Taking notes with Markdown and Latex using IPython notebooks

I don’t know why I haven’t been using this forever, but I recently discovered the following near-ideal ways to take notes in CS or math-type classes on my computer:

  1. Install IPython on your computer
  2. From the folder you want to save your notes, launch the IPython notebook server:
     $ ipython notebook 
  3. In the resulting browser window, create a new notebook
  4. Change the cell type of the current cell to “Markdown”
  5. Write notes on the fly, wrapping anything you want to be rendered as Latex in $ $ tags!
  6. Press Shift+Enter to render the cell. To my knowledge, it doesn’t matter what’s split among cells (that’s more useful when you’re dealing with the results of Python expressions (the intended use of IPython…)

But wait- it gets better!

Generally, IPython notebooks are best viewed by a notebook server you launch from the directory. However, if you choose to make your notebooks available on the web (for instance, by hosting them on Github), then you can render them live using the IPython Notebook Viewer. This creates a link to the rendered notebook, incorporating all the Markdown and Latex formatting you wrote. And you can share it!

As an example, here’s a link to my (somewhat terrible) notes on Regression from the Machine Learning class I’m taking this semester: http://nbviewer.ipython.org/urls/raw.github.com/eclarke/Machine-Learning-Notes/master/Regression.ipynb

Edit: corrected capitalization of IPython.

An excerpt from They Thought They Were Free: The Germans, 1933-45, by Milton Mayer:

Herr Simon [a former Nazi interviewed after WWII], was greatly interested in the mass deportation of Americans of Japanese ancestry from our West Coast in 1942. He had not heard of it before […]

He asked me whether I had known anybody connected with the West Coast deportation. When I said “No,” he asked me what I had done about it. When I said “Nothing,” he said, triumphantly, “There. You learned about all these things openly, through your government and your press. We did not learn through ours. As in your case, nothing was required of us–in our case, not even knowledge. You knew about things you thought were wrong, didn’t you, Herr Professor?” “Yes.” “So. You did nothing. We heard, or guessed, and we did nothing. So it is everywhere.” When I protested that the Japanese-descended Americans had not been treated like the Jews, he said “And if they had been–what then? Do you not see that the idea of doing something or doing nothing is in either case the same?”

 

 

Pandas: or why I need to stop rolling my own tabular data parsers in Python.

I don’t know how many times I’ve written the following Python code:

f = [x.split('\t') for x in open('tab_delimited_data.out').read()]

to read tab delimited data. With Pandas, the python data manipulation library, it’s as easy as R:

f = pd.read_csv('tab_data.out', sep='\t', index_col=0)

and now I have a data frame, with a billion more useful functions than my 2D list I usually would have.

High-performance computing with Python’s multiprocessing module

Python is a reasonable choice of language for writing scripts that will run in many-node, many-core clusters as long as one avoids using threads. CPython (the standard version of Python) uses a Global Interpreter Lock when it comes to threads, which prevents concurrent execution of Python bytecode. (Also, I don’t care for threads, having been bitten badly by their easy ability to share state.)

Instead, one can use the multiprocessing module, which is very similar to the multithreading API. It spawns new processes instead of threads, which have no shared state between them. Processes each have their own Python interpreter, which significantly increases the overhead (making them impractical to spawn for very short tasks), but they also efficiently map to individual cores of the machine being used.

For instance, if I’m using an 8-core node, I can simply spawn 8-1 processes (one remains for the parent process) and know that, most likely, the underlying OS has assigned them each to a core. The multiprocessing library allows me to spawn a process for a single function call, instead of having to write standalone scripts that are launched by a shell script, for instance. Here is an example that spawns a process for each chunk of data to be analyzed. Each process then saves its results in a SQLite database (you may want to use a DB that handles concurrency better though):

https://gist.github.com/2938199.js

Of course, this is only per-machine. To really take advantage of HPC, one should be able to work between multiple nodes. Without resorting to a library like MPI, you can simply separate the layers of computation. For instance, my scripts work by splitting the analysis of many gene sets amongst one node’s cores, and splitting the datasets amongst multiple nodes. Since they are all writing back to a common database, there is no real need for any interprocess or internode communication, obviating the need for MPI. Postprocessing the raw data in the database can be done afterwards.

New GEO Query/Analysis module for BioPython

So I’ve been working a lot with NCBI GEO recently for a paper on the Gene Ontology. During the course of this work I wound up implementing about 70% of the famous R package GEOQuery in Python (as I’m much more fluent in Python than R) and decided that it might be worthwhile to submit to the BioPython project. Their existing GEO parser is woefully inadequate and slightly buggy (I don’t believe it can handle the curated GEO Dataset format, it has no programmatic access to NCBI GEO, and offers no way to do any statistical analysis on the resulting microarray data).

My fork, which is available here, revamps the Geo package to provide the following features:

  1. Automatic retrieval and parsing of GEO files, either from NCBI or from the local filesystem
  2. Pretty-printing of metadata, column, and table information
  3. Ability to convert GDS records into a form that provides a Numpy matrix representation of the sample/probe matrix
  4. Rudimentary statistical analysis methods (filtering probes, detecting enriched genes for a subset, binary log transformation of probe values)

I still haven’t written unit tests for it all yet (a persistent failing- one of many, I’ll admit) mostly because it was developed a bit on-the-fly during my work. However, I also know that it works for at least a subsection of uses, and it’s well-documented.

The two modified files are here for the morbidly curious:

Bio/Geo/__init__.py

Bio/Geo/Records.py

Quote

This is me writing Java.

You have no idea the pain I feel when I sit down to program. I’m walking on razor blades and broken glass. You have no idea the contempt I feel for C++, for J2EE, for your favorite XML parser, for the pathetic junk we’re using to perform computations today. There are a few diamonds in the rough, a few glimmers of beauty here and there, but most of what I feel is simply indescribable nausea.

— Steve Yegge, Moore’s Law is Crap