Tag Archives: browser

GenArk Hubs Part 2 – Using the data

This blog post is the second of three to discuss the Genome Archive (GenArk) assembly hubs. This second post discusses examples of using the GenArk hubs’ data, with the first post about accessing the data, and the third shares technical infrastructure behind the hubs.  

Before launching into using the new GenArk hubs, let’s go quickly over how the first blog post examined the multiple ways to access the GenArk hubs. The easiest way to find a GenArk hub is by searching the UCSC Genome Browser’s main Gateway page with a name like “hummingbird” and clicking on the GCA/GCF identifier to attach it.  Another is to build direct links to NCBI GCA/GCF assembly accessions when you know them to instantly arrive at the main Browser view, such as https://genome.ucsc.edu/h/GCF_005190385.1 for narwhal. Yet another is searching the UCSC Public Hubs page or going to the main GenArk homepage where you can in turn navigate directly to individual taxonomic group pages, such as for birds.

USING THE DATA 

What can you do with a GenArk Assembly hub?

The new GenArk hubs come with the ability to perform BLAT DNA queries and PCR primer searches, as well as send the genome’s DNA to external tools.

As an example, let’s say you are curious if we have a specific bat genome. The first step would be to go to the Gateway page and search “bat” and discover multiple hits. 

Looking at search results you see your desired specific “little brown bat” assembly and click on it so that hub is now selected, where under “Find Position” on the right there would now be “Mammal assemblies Hub Assembly” attached with “little brown bat” displayed and a specific GCF_000147115.1 NCBI accession. Clicking the “Go” button would bring you to the main Browser display. The same result happens from clicking this short direct GenArk /h/ hub link: https://genome.ucsc.edu/h/GCF_000147115.1

BLAT DNA Search

With this bat genome displaying if you had a short DNA sequence you wanted to search, you could paste it right in the top search box. For instance, after clicking the above link, try pasting on the main browser display CATTAGGCAAATATATGCATATAAGTTCTTTGTTTAATCTCT and hit “go”.  The result, shown after a few seconds, will be sequence matches across the little brown bat genome. You can also go to the top Tools menu and then select “Blat”, and do the same step of pasting DNA sequence, required when searching especially longer strings. The Blat Tool page also allows you to search alternative sequences. 

BLAT Protein Search

On the Tools > Blat page you can put in protein sequences to search. This is especially interesting if you want to find the location of a known protein from another species in your genome of interest. For example, if again you are on the little brown bat genome and you go to the Blat page, try to blat this portion of the human SOD1 protein:, LSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN

You will find a match (again note for protein searches, be sure to go to the Tools > Blat page). When viewing the results you can either click a “browser” link to see the matching spots on the genome. Or if you click a “details” link you will see the side-by-side alignment like this image below.

Besides DNA and protein searches, BLAT also allows translated RNA and translated DNA searches. Also the results from BLAT searches can be saved as custom tracks. This allows you to download and save these annotations, or save them in Sessions making the results more permanent and shareable. See this other blog post about sharing sessions for more information: https://bit.ly/UCSC_blog_sharing 

PCR Primer Search

GenArk also provides PCR primer searches, by going to the Tools menu and selecting PCR. With the same “little brown bat” genome loaded, for instance, go to the Tools menu and select “In-Silico PCR” to arrive on the PCR page. Then enter these two primers, forward primer: AGTCATGGTCTCAGGAACCG and reverse primer:  GTTACTAGGGCTCAGACCTC  (there is no need in this example to click any other settings). 

Then click “submit” to search the “little brown bat” genome for matches.

The results will be two hits, in part because this assembly has 11,654 scaffolds with some identical sequences (to see all the scaffolds click the “view sequences” link on the Gateway page described later). 

Send DNA to External Tools

Another way to use GenArk hubs is to send the current DNA in the viewing window to external sites. By going to the View menu you can select the “In External Tools” option and export the current DNA for processing outside of the UCSC Genome Browser.

In this image a 7,477 bp region will be selected to be sent to external sites where selecting “In External Tools” under the View menu will result in a pop-up of various options.

In this case all of the options are presented as available for this 7,477 bp span, except for RNAfold, which requires the viewer to zoom in to less than 5 kpb, before sending the DNA to that external tool. 

Send DNA to External Tools -Primer Design: Primer-BLAST

If you were interested in PCR Primer design in this region you could use the Primer3Plus or  Primer-BLAST links. The Primer-BLAST link starts a job at NCBI, where after some time the results at NCBI will be optimal PCR Primers for this stretch of DNA. Here are example results sending the 7,477 bp  span of the NW_005878708v1 little brown bat scaffold to NCBI.

With these results, one can return to the UCSC PCR Tool to test each result in order to discover if these primers will have potential off-target results beyond the desired chromosome.

Send DNA to External Tools -Primer Design: Primer3Plus

Another PCR Primer design option in the “In External Tools” menu is Primer3Plus. Here are example results sending the same 7,477 bp span of the NW_005878708v1 little brown bat scaffold to Primer3Plus.

Primer3Plus has the added benefits of a “Return to Genome Browser” button (top left) that if clicked will dynamically generate a custom track of the results to be seen back on the UCSC Genome Browser.

Above the Primer3Plus custom track identifies the input region that was sent (top grey bar), and then the individual left and right matching primer pair locations. At UCSC the primers can then be tested again with the UCSC PCR Tool where a highlight for the Primer3Plus suggested “Primer 5” is highlighted in the above image. 

Send DNA to External Tools for oligo-analysis 

Another tool you can export DNA of interest to is Regulatory Sequence Analysis Tools (RSAT) Metazoa for motif discovery. For instance, when looking at a GenArk assembly for Zebu Cattle, https://genome.ucsc.edu/h/GCF_000247795.1, using the View menu and In External Tools option one could select the RSAT link. RSAT provides a way to analyze the DNA sequence for transcription factor binding sites and over-represented oligo-nucleotides. Because RSAT requests your organism, in this example Bos taurus was used as a relative to zebu cattle, allowing for proceeding to request examination of  the region. The DNA being sent in this example was near a region for the start of a gene predicted by Augustus. One of the RSAT results was a predicted motif, aaacttatagata, just upstream of the transcription start site for the predicted gene.

By going back to the UCSC Genome Browser and clicking into the Short Match track (under the top Mapping section) and pasting in the motif sequence, aaacttatagata, a display in the GenArk hub of where these matches occurred could be visualized.

The Short Match track’s ability to visualize the motif identifies the potential binding sites of transcription factors, predicted by RSAT.  This Browser view of the Zebu Cattle GenArk assembly hub can be viewed with this Public Session link

Can I add custom tracks to a GenArk Assembly hub?

Yes, users can add tracks to their data by going to the My Data menu and then selecting Custom Tracks to paste in information. Simple text-based tracks can be loaded, or more complicated binary-indexed files such as BAMs or VCFs or bigBeds can be loaded as well. 

How do I name my sequences for my custom tracks?

Another special feature of GenArk hubs is that they are loaded with a special chromAlias file allowing for multiple alias names. When building custom tracks the scaffold names for sequences need to match the names in the assembly, but many options exist. For instance, with the Zebu Cattle genome, https://genome.ucsc.edu/h/GCF_000247795.1, if you type “v s” to view sequences, or click the top “Genomes” name and then the “view sequences” button, you will end up on a page where all the scaffolds of a genome are displayed.

Scrolling down you on the resulting page you will see a link titled “GCF_000247795.1.chromAlias.txt” which will have results like this:

# sequenceName    alias names    assembly: GCF_000247795.1_Bos_indicus_1.0
chr1    1    CM003021.1    NC_032650.1
chr10    10    CM003030.1    NC_032659.1
chr11    11    CM003031.1    NC_032660.1
...

What this chromAlias.txt file displays is how “chr1”, or “1”, or “CM003021.1” or “NC_032650.1” can be used to create custom tracks on chromosome one for this assembly (i.e., BED custom tracks “chr1 300 500” = “1 300 500” = “CM003021.1 300 500” = “NC_032650.1 300 500”). 

Can I add a Track Hub to a GenArk Assembly Hub?

Yes, after loading a hub you can user go to the My Data menu and paste in the location of a hub to display on any of the GenArk assembly hubs. The one special detail is that your Track Hub’s genomes.txt genomes line only needs to have the GCA/GCF number such as “genome GCF_001984765.1”. See this example hub.txt file for an idea of how a hub could be loaded on a GenArk hub.  Here is a link that will load that hub on a GenArk hub for American beaver:

https://genome.ucsc.edu/h/GCF_001984765.1?position=NW_017869957v1:1,285,000-1,793,000&hubUrl=https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/hub.txt

Can I share data on a GenArk assembly hub?

Yes, you can make a session and share the URL with others. Even better, publish your session to the Public Session page to make it more discoverable. See this previous blog about sharing data for more information:  https://bit.ly/UCSC_blog_sharing   

The next blog post in this series will provide some technical details about the GenArk hub architecture. The first post focussed on how to discover and access the hubs. 


This entry written by Brian Lee. If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

GenArk Hubs Part 1 – Accessing the data

This blog post is the first of three to discuss the Genome Archive (GenArk) assembly hubs. This first post discusses accessing the GenArk hubs, the second post gives examples of using the data, and the third post describes the technical infrastructure behind the hubs. 

Let’s start with a real-world story: imagine you are a researcher working on zebrafish, but you are using an alternative strain with unique polymorphic properties. You have a desire to do CRISPR on your particular zebrafish and you already have a FASTA file for the genome assembled into chromosomes, but have no annotations or way to visualize the data yet. 

One option to visualize your FASTA would be to independently create a UCSC Assembly Track Hub to work on your zebrafish. Or now that UCSC has developed the Genome Archive (GenArk) system, when you submit your assembly into NCBI’s assembly database, you could contact us directly and request we generate the browser for you behind the scenes. This happened for a specific lab, where they submitted their specific TD5 zebrafish assembly to NCBI, https://www.ncbi.nlm.nih.gov/assembly/GCA_018400075.1, and the result after contacting us was a new assembly hub that could be easily loaded at UCSC with the following link: https://genome.ucsc.edu/h/GCA_018400075.1  In this case, the team at UCSC even helped generate liftOver alignment files between the UCSC zebrafish in this new TD5 zebrafish GenArk Public Hub addition, aiding identification of lifting annotations to the new browser. 

So what are the GenArk Assembly hubs?

GenArk hubs are a collection of data files externally hosted from the main UCSC data website enabling browsing new genomes. GenArk genomes have NCBI Genbank assembly accessions starting with either GCA or GCF and the browsers allow visualizing and attaching laboratory-generated data. New software also enables UCSC to dynamically turn on query servers to search GenArk hubs with DNA sequences or test PCR primer pairs. GenArk hubs are part of the UCSC Public Hubs list where UCSC can update the data files with pipelines. 

ACCESSING THE DATA

How do I access GenArk Assembly Hubs?

There are multiple ways to access the GenArk hubs, including searching the UCSC Genome Browser’s main gateway page, building direct links to NCBI GCA/GCF assembly accessions, searching the UCSC Public Hubs page, and navigating directly to individual taxonomic group pages.

Browser Gateway Page

The easiest way to find GenArk hubs is to search the species name on the Browser Gateway.

On the Gateway page in the top left box you can search a term such as “dog” and find all the genomes both hosted in our internal databases and in external Public Hubs that have dog in the name. In this image, a search for “dog” returns a top “Dog” match (UCSC database) as well as results for several species in Assembly Track Hubs that match on the term “dog” with the specific labrador dog breed selected from the GenArk Mammal Assemblies Hub (GCF_014441545.1).

Direct GCA/GCF Accession Links

In the situation where you may know the GCF/GCA identifier for an assembly, you can also search that term on the Gateway page or build a short link to directly load the hub.  Links to UCSC with a hub (“/h/”) address, such as https://genome.ucsc.edu/h/GCF_000698965.1 will attempt to find and attach a matching final GCF-value,  which originates from the NCBI accession, in this case, for an African ostrich assembly. If you don’t find a match, read more below about contacting us.

Public Hubs Page

Another place to find GenArk hubs is on the Public Hubs page where you can enter various terms, like “ostrich”: https://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=ostrich,  You can expand the “Search details” to examine matching results. To load a desired hub, use a right-click to display an “Open this assembly” pop-up, or an option to configure individual track settings.

Genomes Menu

Another option to gain an overview of all the GenArk hubs is to click the “Genome Archive GenArk” link available under the “Genomes” menu.

This Genomes menu link will open the GenArk homepage. On the GenArk homepage, a variety of links exist including the line,“Please note: text file listing of 1,600 NCBI/VGP genome assembly hubs.” Clicking that link will open a single text file that lists all available hubs allowing a quick overview: https://hgdownload.soe.ucsc.edu/hubs/UCSC_GI.assemblyHubList.txt 

Individual Taxonomic Pages

The GenArk homepage also has links to specific taxonomic groupings hub pages, such as mammals, fishes, or fungi. For instance, a “birds” link, https://hgdownload.soe.ucsc.edu/hubs/birds/index.html, brings you to a webpage with links to launch browsers, along with links to other details for each assembly.

These taxonomic group pages, such as this image of the bird’s page, have links to launch the browser (2nd column: common name and view in browser) and links to the source files (4th column: NCBI assembly). 

Access to these taxonomic group pages is also available from the Public Hubs page.

By going to the Description column on the Public Hubs page you can click a link (Bird genome assemblies) to end up at the related taxonomic grouping page. Also of note that on the Public Hubs page, you can click a  [+] plus button to expand the list of Assemblies and click one of the GCA/GCF accession links to directly load an assembly. 

What if I don’t find my Assembly in the GenArk collection?

If the assembly of interest is not found, please visit our assembly request page. Search that page for your assembly, if there is a “view” link you can launch the existing genome browser. Otherwise, click the “request” button to fill out a form to add your genome of interest.

The assembly request page does require there to already be an existing GCA/GCF identifier. You can also always email us at our public mailing-list genome@ucsc.soe.edu to request we add the assembly to the GenArk collection. This archived mailing-list is searchable from links on our contacts page, http://genome.ucsc.edu/contacts.html. Alternatively, if you don’t want your request to be public, you can email our private internal mailing-list at genome-www@soe.ucsc.edu

What if my assembly doesn’t have a GCA/GCF NCBI accession?

If NCBI does not have a GCA/GCF accession for your assembly then our scripts will not be able to pull the data and generate the GenArk hub. You will need to deposit the assembly at NCBI and notify us once the assembly has become available. You can find directions at NCBI for how to submit new genomes: https://www.ncbi.nlm.nih.gov/assembly/docs/submission/ 

The next blog post in this series will provide examples of using the GenArk hubs, such as the BLAT and PCR tools that are available, or how you can send DNA of any Assembly Hubs to External Tools for processing.  The final post examines the infrastructure behind the hubs.


This entry written by Brian Lee. If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Accessing the Genome Browser Programmatically Part 2 – Using the Public MySQL Server and gbdb System

If you missed part 1 about obtaining sequence data, you can catch up here. Note: We now have an API which can also perform many of these functions.

The UCSC Genome Browser is a large repository of data from multiple sources, and if you want to query that annotation data, the easiest way to get started is via the Table Browser. Choose the assembly and track of interest and click the “describe table schema” button, which will show the MySQL database name, the primary table name, the fields of the table and their descriptions. If the track is stored not in MySQL but as a binary file (like bigBed or bigWig) in /gbdb, it will show a file name, e.g. "Big Bed File: /gbdb/dm6/ncbiRefSeq/ncbiRefSeqOther.bb". If this is the case, skip directly to the Accessing the gbdb directory system section below. Otherwise, the track data is either a single MySQL table or a set of related tables, which you can either download as gzipped text files from the “Annotation Database” section on our downloads page (here’s the GRCh37/hg19 listing) and work on them locally, or use the public MySQL server and issue MySQL queries remotely. Generally speaking, the format for most of our tables is similar to the formats described here, e.g., in bed (“chrom chromStart chromEnd”) format, and we do not store any sequence or contigs in our databases, which means you’ll need to use the instructions in Part 1 of this blog series in order to get any raw sequence data.

Accessing the public MySQL server
The best way to showcase the public MySQL server is to show some examples — here are a few to get you started:
1. If you want to download some transcripts from the new NCBI RefSeq Genes track, you can use the following command:

$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select * from ncbiRefSeq limit 2" hg38
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+
| bin | name        | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts                                                         | exonEnds                                                           | score | name2   | cdsStartStat | cdsEndStat | exonFrames                        |
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+
| 585 | NR_046018.2 | chr1  | +      |   11873 | 14409 |    14409 |  14409 |         3 | 11873,12612,13220,                                                 | 12227,12721,14409,                                                 |     0 | DDX11L1 | none         | none       | -1,-1,-1,                         |
| 585 | NR_024540.1 | chr1  | -      |   14361 | 29370 |    29370 |  29370 |        11 | 14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320, | 14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370, |     0 | WASH7P  | none         | none       | -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, |
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+

2. If you are interested in a particular enhancer region, for instance “chr1:166,167,154-166,167,602”, and want to find the nearest genes within a 10kb range, then the following query will do the job:

$ chrom="chr1"
$ chromStart="166167154"
$ chromEnd="166167602"
$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select \
   e.chrom, e.txStart, e.txEnd, e.strand, e.name, j.name as geneSymbol from ncbiRefSeqCurated e,\
   ncbiRefSeqLink j where e.name = j.id AND e.chrom='${chrom}' AND \
      ((e.txStart >= ${chromStart} - 10000 AND e.txStart <= ${chromEnd} + 10000) OR \ (e.txEnd >= ${chromStart} - 10000 AND e.txEnd <= ${chromEnd} + 10000)) \
order by e.txEnd desc " hg38
+-------+-----------+-----------+--------+----------------+------------+
| chrom | txStart   | txEnd     | strand | name           | geneSymbol |
+-------+-----------+-----------+--------+----------------+------------+
| chr1  | 166055917 | 166166755 | -      | NR_135199.1    | FAM78B     |
| chr1  | 166055917 | 166166755 | -      | NM_001320302.1 | FAM78B     |
| chr1  | 166069298 | 166166755 | -      | NM_001017961.4 | FAM78B     |
+-------+-----------+-----------+--------+----------------+------------+

3. If you need to get gene names and their lengths for RNA-seq read normalization, you can use the following query:

$ mysql -h genome-mysql.soe.ucsc.edu -u genome -A -e “ \
  select l.name, kr.value, psl.qEnd - psl.qStart as length \
  from   refGene r, hgFixed.refLink l, knownToRefSeq kr, knownCanonical kc, refSeqAli psl \
  where  r.name = l.mrnaAcc and r.name = kr.value and kr.name = kc.transcript \
         and r.name = psl.qName group by kr.value limit 3” hg38
+-------+-----------+--------+
| name  | value     | length |
+-------+-----------+--------+
| A2M   | NM_000014 |   4920 |
| NAT2  | NM_000015 |   1317 |
| ACADM | NM_000016 |   2622 |
+-------+-----------+--------+

In addition to our download site and public MySQL server hosted here in California, we have also recently added support for a download site (http://hgdownload-euro.soe.ucsc.edu) and public MySQL server (genome-euro-mysql.soe.ucsc.edu) hosted in Europe, which will speed up downloads for many of our users.

Please follow the Conditions for Use when querying the public MySQL servers.

Many of the command line utilities available on our utilities downloads server are also able to interact with our databases or download files, like mafFetch (as long as your ~/.hg.conf file is present as discussed below):

$ mafFetch xenTro9 multiz11way region.bed stdout
##maf version=1
##maf version=1 scoring=blastz
a score=0.000000
s xenTro9.chr9     15946024 497 +  80437102 ACTAT...
e galGal5.chr14     1678315   0 -  15595052 I
e xenLae2.chr9_10L 13130032 2034 - 117834370 I

a score=2992.000000
s xenTro9.chr9     15946521 145 +  80437102 TCATC...
s xenLae2.chr9_10L 13132066 148 - 117834370 TTATC...

Note: Only the first 5 bases on each line and only the first 10 lines are shown for brevity.

Here we are directly querying the mutliz11way table for the Xenopus tropicalis xenTro9 assembly, no need to download the entire alignment file to the local disk and query manually. Commands of this nature usually require a special private .hg.conf file in the user’s home directory (note the leading dot). This configuration file contains a couple key=value lines that most of our programs can parse and then use to access the public MySQL server. This page contains a sample .hg.conf file that can be used by most of the command line utilities to direct them to access either our US MySQL server or our European MySQL server. That sample .hg.conf is certainly enough to get started, but for more information about the various Genome Browser configuration options, please see the comments in the ex.hg.conf and minimal.hg.conf files.


Accessing the gbdb directory system
The third method of grabbing our data is via the /gbdb/ directory system. This location, browsable here, holds most of the bigBed, bigWig, and other large data files that we do not keep directly in MySQL databases/tables. There are many utilities available for manipulating these files, and most of them are able to work on remote files, for example:

$ bigBedToBed -chrom=chr1 -start=5563837 -end=5564370 http://hgdownload.soe.ucsc.edu/gbdb/hg38/crispr/crispr.bb stdout 
chr1    5563870    5563893        55    +    5563870    5563890    0,200,0    255,255,0    128,128,0    CAAGTGGAATCAGGATGCCT    GGG    55    72% (57)    52% (46)    10    60    MIT Spec. Score: 55, Doench 2016: 72%, Moreno-Mateos: 52%    3345002138
chr1    5563878    5563901        59    +    5563878    5563898    0,200,0    0,200,0    128,128,0    ATCAGGATGCCTGGGATATG    TGG    59    63% (54)    61% (50)    6    63    MIT Spec. Score: 59, Doench 2016: 63%, Moreno-Mateos: 61%    22777603204

Also note that we have all of this data available via rsync as well, so the following command will work to download the crispr.bb file referenced above:

$ rsync -vh hgdownload.soe.ucsc.edu::gbdb/hg38/crispr/crispr.bb
-rw-rw-r--  1466266135 2017/03/30 14:31:48 crispr.bb

sent 33 bytes  received 70 bytes  206.00 bytes/sec
total size is 1.47G  speedup is 14235593.54

If you are interested in say, Human GRCh37/hg19 gbdb data, then all you have to do is change the “hg38” at the end of the template http://hgdwonload.soe.ucsc.edu/gbdb/hg38 url to “hg19”, resulting in http://hgdwonload.soe.ucsc.edu/gbdb/hg19. This holds for all databases at UCSC, like mm10 or bosTau8.

Summary
Just as in part 1, if you are going to continually request parts of the same files or table over and over again, it is best to download the file from our downloads server and operate on it locally. All of our track data, including MySQL tables and bigBed/Wig/BAM files are hosted on our downloads server at http://hgdownload.soe.ucsc.edu. Generally speaking bigBeds/bigWigs/BAMs and other binary files are located in the hgdownload.soe.ucsc.edu/gbdb/ location discussed earlier, while MySQL table data in gzipped plain text format can be found at http://hgdownload.soe.ucsc.edu/goldenPath/$db (where $db is a database name like hg19 or hg38) or via queries against the public MySQL server directly.

Stay tuned for part 3 of this programmatic access series — controlling the Genome Browser image!


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Accessing the Genome Browser Programmatically Part 1 – How to get sequence from the UCSC Genome Browser

Note: We now have an API which can also perform many of these functions.

As the number of bioinformaticians have grown since the inception of the UCSC Genome Browser in 2000, there has been an increased need for programmatic access to the data and tools hosted at UCSC. Although there is no true API developed by UCSC (yet), there are a number of ways to interface with the UCSC Genome Browser, some more efficient than others. The intention of this blog post series is to explain some of the preferred ways to access the commonly requested Genome Browser data and tools and to add a bit of explanation of the architecture of the UCSC Genome Browser in general. The three most common requests are 1) how to download a single stretch of sequence in FASTA format, 2) how to download multiple ranges of sequence, and 3) how to get basic statistics on the nucleotides in a sequence. If you want the in-depth examples and explanation, skip down, but if you’re crunched for time, all you really need to know is the following three Q&As:

Q: How do I extract some sequence?
A: The best choice is to use the twoBitToFa command, available for your system here (Windows 10 users can use the linux.x86_64/ binaries in the Windows Subsystem for Linux). Here’s an example:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

Q: What if I have a list of coordinates?
A: Again use twoBitToFa, this time with the -bed option (also check out the post on coordinate systems):

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Q: How do I count A, C, G, T?
A: twoBitToFa followed by faCount (available from the same location as twoBitToFa):

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

Run twoBitToFa or faCount with no arguments to get a usage message and view all of their options:

$ faCount
faCount - count base statistics and CpGs in FA files.
...


The most efficient way to get sequence from UCSC Genome Browser

The most common data request we receive is a request for FASTA sequence or sequences, making it a fitting subject for part 1 of this blog series about programmatic access to the Genome Browser. If you are browsing a region in the genome browser and you want to get a FASTA sequence for just the region you are browsing, using the keyboard shortcut ‘vd’ (v then d for view DNA) is probably the easiest way. But what about when you want to get sequences for a list of regions? What about if you need your web application to download the sequence? You could download sequence interactively with the Table Browser, although the solution is somewhat cumbersome: first you must make a custom track of the region(s) you would like sequence for, and then use the “output format: sequence” option with your custom track selected as the primary track. Fortunately, there is a much easier approach – downloading the 2bit file for your organism of interest and then using the twoBitToFa command on it like so:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ twoBitToFa hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

The twoBitToFa command is available from the list of public utilities, in the directory appropriate to your operating system. twoBitToFa even accepts a URL to our downloads server as the 2bit argument, so if you wanted to grab some mm10 sequence, or even a list of sequences, you can just query the downloads server directly like so:

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Note that “stdout” in the above commands is a special option (along with the corresponding “stdin”) that tells the majority of UCSC commands to read/write from/to /dev/stdin and /dev/stdout instead of the required filenames, and is exemplified by the following common usage of generating some quick statistics on a region like chr1:100100-100200:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

The twoBitToFa and URL to hgdownload 2bit combo is important because our downloads server is significantly more robust than our DAS CGI, can support more requests, and won’t slow the main site down for other users. We’ve also noticed that our DAS server often receives many requests for the same sequence, so for those of you providing software where the same query will be made multiple times, consider whether it would be more efficient to download an entire 2bit file to your local disk, rather than send the same query thousands of times to our servers.

Summary
twoBitToFa and faCount are two useful utilities, among the many other hundreds of tools available, that are useful for extracting sequence data. While not as preferable to working with locally downloaded files, twoBitToFa can also work with URLs to 2bit files, such as those on the UCSC Genome Browser download site. Stay tuned for part 2 of this programmatic access series — Using the Genome Browser public MySQL server and gbdb.


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

How to share your UCSC screenthoughts

by Robert Kuhn      August 12, 2015

The UCSC Genome Browser is great tool for visualizing your data alongside a ton of data from all over the place.  Perhaps, at long last, you have loaded up a gene set, the supporting mRNAs and maybe the SNPs from OMIM or dbSNP, and the Conservation track to make a great point.

Now you want to save that thought, or share it with a colleague, or make a slide for a meeting, or publish it in a paper. Saving your screenthought can take two forms: static or dynamic.  You can snap and save a picture of the screen, or you can share a link to an active Genome Browser.  We’ll talk about both approaches here and discuss some of the advantages and pitfalls of each.

Share a static image.    You can always take a screen grab and throw it onto a slide with little effort.  The screen resolution is fine for  a slide, because your computer and your slide will viewFingerboth be 72 or 96 dpi.  But, if you try that for a publication, your image will have to be really small (scale down 3x in each dimension to get 300 dpi for print) or it will be unacceptably fuzzy.

To get high resolution images for publication, use the Browser’s .pdf export function to allow the vector-graphics image to scale to full journal size and resolution. Look for the .pdf output in the “View” pulldown menu at the top of the Browser page.  Both the chromosome ideogram and the main Browser graphic can be saved in this fashion.

Share a dynamic session, but DO NOT copy a URL.  To save a dynamic screen session that would allow you or others to look around, add more data tracks, check out other genes, etc., you might be tempted to simply copy the URL from your Firefox or Chrome web browser.  That might even seem to work OK at first, but it is in fact not a stable link and can lead to weird Browser behavior.  Worse, you may not even be sharing what you think you are, and will never know it.

Let’s break down a URL as copied directly from my Firefox and see how it plays out.

url2

This URL contains a parameter, hgsid, which is actually a pointer to a row in a UCSC database identifying your session and keeping the state of all your variables (we borrowed the name “cart”).  If you send this URL to someone, yet keep browsing around, your cart will continue to change as you work, and your friend will see the latest state your Genome Browser is in when she clicks the link. The original state of your cart when you shared the URL is long gone before she sees it.

Your shared URL might even appear to work OK because two of the variables in the URL, db (database) and position, will override values stored in your cart (cart variables are separated by an ampersand).  Your friend will see the right genome assembly (db variable) and location (position variable) and think she’s seeing what you want.  But, if you have turned any data tracks on or off in the interim, or removed a custom track, those changes will also be part of what she sees. The original state is lost.  A different colleague could click the link at some other time and see something different still.

As an experiment, here is that same URL in a form you can click or copy/paste into your web browser:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr8%3A38311140-38327276&hgsid=438231169_c2xrrbHK2bQhTuHqjEIOniXGqenu

Does it look like this?

Untitled

That’s what it looked like when I shared the URL. Your click will show the 5’ end of the FGFR1 gene region on human assembly hg19 (because the URL has explicitly included db and position variables), but who knows what tracks might be turned on or off in the interim? Whatever the last person to click it did to it will rule. Every person who reads this blog and clicks the link can change the track configuration for whomever comes next. Only the db and position are going to persist.

Quick-and-dirty URL hack.    If you really want a quick-and-dirty way to share a link, here are a couple of suggestions.  You could send the link as it is above, then strip a few characters out of the hgsid in the URL in your own browser and refresh.  Because the new long hgsid string will not exist in our database, you will be assigned a new hgsid and the state of the old one will stick – until your friend starts messing with it.  Or you could strip out the hgsid parameter entirely and add in other parameters that define the tracks you want to turn on, e.g.:

&knownGene=pack&snp142=dense

That will better define the tracks you want, but it is neither as stable nor as easy as saving a Session. You can use “hide,” too, to be sure certain tracks are turned off. Read more about configuring your links here.

Share a stable dynamic Session.    The best way to save a train of thought in a stable fashion is via the Saved Session tools under the “My Data” pulldown menu. A Saved Session acts as a mydataFingerstable snapshot of all the details of your Browser view.  Saving a thought using this feature requires a login, but it allows you to save the state of a Browser session (semi)-permanently. Anyone viewing your session will be able to further browse around the genome without affecting the session you saved.  After you have saved a session, you will see a “Browser” link that can be copied and shared.

For example, to load the view above as a stable session, try this link (no login is required to view some else’s Saved Session):

http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=sessionGallery&hgS_otherUserSessionName=hg19_watsonKriek

Although anyone with this URL can view this session, no one can change it unless logged in as user “SessionGallery.”

In the past we endeavored to save the Session for at least 3-4 months after the last time it was viewed, and custom tracks in sessions were subject to persist for at least 48 hours after the last time they were viewed. We have now moved to not remove session data, unless deleted, and to not remove custom tracks in sessions.  We still encourage people to save their Session cart to a local file using the “Save Settings” feature (and to keep backups of all their custom tracks on a local machine).  That way, you can load your Session settings any time and onto any copy of the Browser (such as to the European mirror or a local Genome Browser-in-a-Box) and avoid any possible loss of data due to unforeseen circumstances.  We do the best we can to maintain our servers so that you do not lose your sessions, but computers are only human and they break.

Really stable sessions.    If you are looking to create a permanent link for a publication, you should consider hosting your downloaded Session and any of your own custom data on a server you control (such as in a Track Hub). It will still be loaded onto the UCSC Genome Browser, but you are not at the mercy of California earthquakes, wildfires or crashed servers (except for your own).  You can read more about building links to remotely hosted user information here and on our Session’s Gallery page here.

On both pages you can learn about the following parameters for forming links to launch sessions from your hub:

hgS_doLoadUrl=submit
hgS_loadUrlName=

We hope we have given you some food for thought about how to make the Genome Browser more useful in your work.  Using a reliable method for saving and sharing sessions is great way to avoid the frustration of lost data and misleading links.  Stay tuned for more useful Browser tips in future blogs.