The opinions expressed here are those of the author, Jonathan Casper, and do not necessarily reflect those of the University of California Santa Cruz or any of its units.
I’m happy to say that we’ve finally released the Genome Browser in a Box (GBiB). GBiB is essentially a virtual machine image of a mirror of the UCSC Genome Browser. Download it, set it up, and voilà – instant mirror. This lets you do cool things like have a mirror of the browser on your own personal computer – more information on how to set it up is available on the help page at http://genome.ucsc.edu/goldenPath/help/gbib.html. Here, however, I’m going to talk about the background of GBiB: how it got started, what kinds of decisions we faced, and what became my favorite feature.
At UCSC, we have known for a long time that it can be hard to set up a mirror server. We even have a separate mailing list devoted to the topic. Before the GBiB project began, one of our developers was working on a script to completely automate the process for a computer running stock Ubuntu Linux. Eventually the developer had an epiphany: why not just create a barebones mirror once on a virtual machine, and then make that available to our users? It wouldn’t solve the problem of allowing people to easily add the assemblies and tracks they wanted, but at least they wouldn’t have to wrestle first with setting up Apache and MySql. The developer’s suggestion came at an opportune moment – we had just received several mailing list questions about using sensitive data with the UCSC Genome Browser website. We didn’t have a good answer.
The problem is that the UCSC Genome Browser has always been focused on being an academic research tool, not a clinical one. We aren’t designed to provide the kind of data security that HIPAA and Institutional Review Boards call for. The only answer we could give to people who wanted data security was “create your own secure mirror, or use another genome browser”. Knowing how difficult it could be to set up a mirror, that wasn’t much of a choice.
Into that mix, we were suddenly presented with a new option: give everyone a pre-installed mirror with the hardest parts already done. Just place it behind a firewall, load up your sensitive data, and enjoy! I thought it was a great idea, as did many other browser staff members.
From there, the idea quickly snowballed. UCSC already provides a public MySQL server and download site with most of the data from our browser. We realized that we could set up the virtual machine to take advantage of those resources and load our data over the internet. This was a great advantage over normal mirror servers. UCSC provides many terabytes of data. Most mirrors have to pick and choose which assemblies and tracks they make available; there’s far too much data to download and keep synchronized. By using our public internet resources, GBiB could provide all of it.
In practice, we discovered it wasn’t quite that easy. Latency issues meant that for anyone not on the west coast of the United States, GBiB worked really slow. Just loading the default view of the human GRCh37/hg19 genome assembly could take over 10 seconds. We had to make a compromise: GBiB wouldn’t have to download track data to use it, but downloading would still be an available option for users in remote locations.
There is now a new CGI just for this purpose: “Mirror Tracks”. It combs through the list of database tables and files associated with browser tracks and allows you to download the data for any of them. If you’re interested in looking at, say, mRNA alignments in the Painted turtle (chrPic1) genome and GBiB is just too slow, a few clicks in Mirror Tracks will put them all on your own hard drive. If you really want, you can even then put GBiB into full-offline mode. You’ll lose access to any track data that you haven’t downloaded, but you’ll always have those Painted turtle mRNAs.
My favorite feature of GBiB, though, has to be what it does for track hubs. Track hubs are a feature we released in 2011 to allow users to view their own data files in the UCSC Genome Browser alongside our annotation. Unlike custom tracks, where all the data must be sent to our server at once, track hubs only send the data for the region you are looking at. That is much more manageable for something like a VCF file, which can be on the order of 10-100 GB.
There are two problems with track hubs. First, you must have web hosting space for your data to construct a track hub. Not everyone does. There are public hosting solutions like DropBox, but they don’t always work. Second, once again there is the problem of sensitive data. Even if you are willing to send your sensitive data directly to our servers at UCSC, you may not be willing (or even allowed) to make it publicly available on a web server. GBiB solves both problems beautifully.
GBiB already has a built-in web server that it uses to communicate with your computer. With a few small adjustments, you can take advantage of that and let GBiB also host your data files. This means that you can build and use a track hub with GBiB, and none of the data will ever leave your computer or be accessible to anyone else, unless you grant them access to your GBiB.
Genome Browser in a Box is available from our web store at https://genome-store.ucsc.edu. It is free for non-commercial use by non-profit organizations, academic institutions, and for personal use. Please see the store website for full terms and conditions.
If after reading this blog post you have any public questions, please email firstname.lastname@example.org. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to email@example.com.