Tag Archives: genome

Accessing the Genome Browser Programmatically Part 1 – How to get sequence from the UCSC Genome Browser

Note: We now have an API which can also perform many of these functions.

As the number of bioinformaticians have grown since the inception of the UCSC Genome Browser in 2000, there has been an increased need for programmatic access to the data and tools hosted at UCSC. Although there is no true API developed by UCSC (yet), there are a number of ways to interface with the UCSC Genome Browser, some more efficient than others. The intention of this blog post series is to explain some of the preferred ways to access the commonly requested Genome Browser data and tools and to add a bit of explanation of the architecture of the UCSC Genome Browser in general. The three most common requests are 1) how to download a single stretch of sequence in FASTA format, 2) how to download multiple ranges of sequence, and 3) how to get basic statistics on the nucleotides in a sequence. If you want the in-depth examples and explanation, skip down, but if you’re crunched for time, all you really need to know is the following three Q&As:

Q: How do I extract some sequence?
A: The best choice is to use the twoBitToFa command, available for your system here (Windows 10 users can use the linux.x86_64/ binaries in the Windows Subsystem for Linux). Here’s an example:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

Q: What if I have a list of coordinates?
A: Again use twoBitToFa, this time with the -bed option (also check out the post on coordinate systems):

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Q: How do I count A, C, G, T?
A: twoBitToFa followed by faCount (available from the same location as twoBitToFa):

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

Run twoBitToFa or faCount with no arguments to get a usage message and view all of their options:

$ faCount
faCount - count base statistics and CpGs in FA files.
...


The most efficient way to get sequence from UCSC Genome Browser

The most common data request we receive is a request for FASTA sequence or sequences, making it a fitting subject for part 1 of this blog series about programmatic access to the Genome Browser. If you are browsing a region in the genome browser and you want to get a FASTA sequence for just the region you are browsing, using the keyboard shortcut ‘vd’ (v then d for view DNA) is probably the easiest way. But what about when you want to get sequences for a list of regions? What about if you need your web application to download the sequence? You could download sequence interactively with the Table Browser, although the solution is somewhat cumbersome: first you must make a custom track of the region(s) you would like sequence for, and then use the “output format: sequence” option with your custom track selected as the primary track. Fortunately, there is a much easier approach – downloading the 2bit file for your organism of interest and then using the twoBitToFa command on it like so:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ twoBitToFa hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

The twoBitToFa command is available from the list of public utilities, in the directory appropriate to your operating system. twoBitToFa even accepts a URL to our downloads server as the 2bit argument, so if you wanted to grab some mm10 sequence, or even a list of sequences, you can just query the downloads server directly like so:

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Note that “stdout” in the above commands is a special option (along with the corresponding “stdin”) that tells the majority of UCSC commands to read/write from/to /dev/stdin and /dev/stdout instead of the required filenames, and is exemplified by the following common usage of generating some quick statistics on a region like chr1:100100-100200:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

The twoBitToFa and URL to hgdownload 2bit combo is important because our downloads server is significantly more robust than our DAS CGI, can support more requests, and won’t slow the main site down for other users. We’ve also noticed that our DAS server often receives many requests for the same sequence, so for those of you providing software where the same query will be made multiple times, consider whether it would be more efficient to download an entire 2bit file to your local disk, rather than send the same query thousands of times to our servers.

Summary
twoBitToFa and faCount are two useful utilities, among the many other hundreds of tools available, that are useful for extracting sequence data. While not as preferable to working with locally downloaded files, twoBitToFa can also work with URLs to 2bit files, such as those on the UCSC Genome Browser download site. Stay tuned for part 2 of this programmatic access series — Using the Genome Browser public MySQL server and gbdb.


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

How to share your UCSC screenthoughts

by Robert Kuhn      August 12, 2015

The UCSC Genome Browser is great tool for visualizing your data alongside a ton of data from all over the place.  Perhaps, at long last, you have loaded up a gene set, the supporting mRNAs and maybe the SNPs from OMIM or dbSNP, and the Conservation track to make a great point.

Now you want to save that thought, or share it with a colleague, or make a slide for a meeting, or publish it in a paper. Saving your screenthought can take two forms: static or dynamic.  You can snap and save a picture of the screen, or you can share a link to an active Genome Browser.  We’ll talk about both approaches here and discuss some of the advantages and pitfalls of each.

Share a static image.    You can always take a screen grab and throw it onto a slide with little effort.  The screen resolution is fine for  a slide, because your computer and your slide will viewFingerboth be 72 or 96 dpi.  But, if you try that for a publication, your image will have to be really small (scale down 3x in each dimension to get 300 dpi for print) or it will be unacceptably fuzzy.

To get high resolution images for publication, use the Browser’s .pdf export function to allow the vector-graphics image to scale to full journal size and resolution. Look for the .pdf output in the “View” pulldown menu at the top of the Browser page.  Both the chromosome ideogram and the main Browser graphic can be saved in this fashion.

Share a dynamic session, but DO NOT copy a URL.  To save a dynamic screen session that would allow you or others to look around, add more data tracks, check out other genes, etc., you might be tempted to simply copy the URL from your Firefox or Chrome web browser.  That might even seem to work OK at first, but it is in fact not a stable link and can lead to weird Browser behavior.  Worse, you may not even be sharing what you think you are, and will never know it.

Let’s break down a URL as copied directly from my Firefox and see how it plays out.

url2

This URL contains a parameter, hgsid, which is actually a pointer to a row in a UCSC database identifying your session and keeping the state of all your variables (we borrowed the name “cart”).  If you send this URL to someone, yet keep browsing around, your cart will continue to change as you work, and your friend will see the latest state your Genome Browser is in when she clicks the link. The original state of your cart when you shared the URL is long gone before she sees it.

Your shared URL might even appear to work OK because two of the variables in the URL, db (database) and position, will override values stored in your cart (cart variables are separated by an ampersand).  Your friend will see the right genome assembly (db variable) and location (position variable) and think she’s seeing what you want.  But, if you have turned any data tracks on or off in the interim, or removed a custom track, those changes will also be part of what she sees. The original state is lost.  A different colleague could click the link at some other time and see something different still.

As an experiment, here is that same URL in a form you can click or copy/paste into your web browser:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr8%3A38311140-38327276&hgsid=438231169_c2xrrbHK2bQhTuHqjEIOniXGqenu

Does it look like this?

Untitled

That’s what it looked like when I shared the URL. Your click will show the 5’ end of the FGFR1 gene region on human assembly hg19 (because the URL has explicitly included db and position variables), but who knows what tracks might be turned on or off in the interim? Whatever the last person to click it did to it will rule. Every person who reads this blog and clicks the link can change the track configuration for whomever comes next. Only the db and position are going to persist.

Quick-and-dirty URL hack.    If you really want a quick-and-dirty way to share a link, here are a couple of suggestions.  You could send the link as it is above, then strip a few characters out of the hgsid in the URL in your own browser and refresh.  Because the new long hgsid string will not exist in our database, you will be assigned a new hgsid and the state of the old one will stick – until your friend starts messing with it.  Or you could strip out the hgsid parameter entirely and add in other parameters that define the tracks you want to turn on, e.g.:

&knownGene=pack&snp142=dense

That will better define the tracks you want, but it is neither as stable nor as easy as saving a Session. You can use “hide,” too, to be sure certain tracks are turned off. Read more about configuring your links here.

Share a stable dynamic Session.    The best way to save a train of thought in a stable fashion is via the Saved Session tools under the “My Data” pulldown menu. A Saved Session acts as a mydataFingerstable snapshot of all the details of your Browser view.  Saving a thought using this feature requires a login, but it allows you to save the state of a Browser session (semi)-permanently. Anyone viewing your session will be able to further browse around the genome without affecting the session you saved.  After you have saved a session, you will see a “Browser” link that can be copied and shared.

For example, to load the view above as a stable session, try this link (no login is required to view some else’s Saved Session):

http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=sessionGallery&hgS_otherUserSessionName=hg19_watsonKriek

Although anyone with this URL can view this session, no one can change it unless logged in as user “SessionGallery.”

In the past we endeavored to save the Session for at least 3-4 months after the last time it was viewed, and custom tracks in sessions were subject to persist for at least 48 hours after the last time they were viewed. We have now moved to not remove session data, unless deleted, and to not remove custom tracks in sessions.  We still encourage people to save their Session cart to a local file using the “Save Settings” feature (and to keep backups of all their custom tracks on a local machine).  That way, you can load your Session settings any time and onto any copy of the Browser (such as to the European mirror or a local Genome Browser-in-a-Box) and avoid any possible loss of data due to unforeseen circumstances.  We do the best we can to maintain our servers so that you do not lose your sessions, but computers are only human and they break.

Really stable sessions.    If you are looking to create a permanent link for a publication, you should consider hosting your downloaded Session and any of your own custom data on a server you control (such as in a Track Hub). It will still be loaded onto the UCSC Genome Browser, but you are not at the mercy of California earthquakes, wildfires or crashed servers (except for your own).  You can read more about building links to remotely hosted user information here and on our Session’s Gallery page here.

On both pages you can learn about the following parameters for forming links to launch sessions from your hub:

hgS_doLoadUrl=submit
hgS_loadUrlName=

We hope we have given you some food for thought about how to make the Genome Browser more useful in your work.  Using a reliable method for saving and sharing sessions is great way to avoid the frustration of lost data and misleading links.  Stay tuned for more useful Browser tips in future blogs.