GenArk Hubs Part 4 – New assembly request page

This blog post adds to an earlier series that discusses the Genome Archive (GenArk) assembly hubs. To help users both to find Genome Archive (GenArk) hubs and to inform us what GCA/GCF accessions to add to the collection, we have created a new genome assembly request page: http://genome.ucsc.edu/assemblyRequest.html

Last year we announced the creation of a new collection of GenArk assembly hubs. GenArk hubs are constructed from NCBI Genbank GCA/GCF accessioned assembly data, for instance where GCF_001984765.1 is the accession for an American beaver assembly. When present in UCSC’s GenArk collection, these genome browsers can be loaded instantly with direct links (e.g., http://genome.ucsc.edu/h/GCF_001984765.1), and they come ready with dynamically invoked BLAT and PCR servers, enabling searching for sequences and primers. The first released GenArk hubs were organized into phylogenetic groups, for example, all bird assemblies were listed here.

Our newly added genome assembly request page displays which assemblies are available for viewing, and presents a single-click “request” button to send an email to UCSC to add any GCA/GCF assembly available at NCBI not yet part of the GenArk collection.

GenArk currently has over 1,700 assembly hubs available for browsing at the click of a “view” button. To view only those in the current collection, you can use the middle “select assembly type to display” option and remove the “Request browser” checkbox, and only completed browsers will show. Click “view” to launch a specific genome browser listed on the page. This earliest version of the page has some performance issues, and selections may take some time, a future improvement planned is to present an active wait cursor when the page is busy filtering results.

By using the “choose clade to view/hide” option it is possible to subselect groups, such as only displaying plants, when narrowing down which genomes may exist or could be requested.

The first “show/hide columns” option on the page enables displaying additional metadata, such as adding the assembly build date or Biosample number, where links exist back to NCBI for more information.

If NCBI does not have a GCA/GCF accession for your desired assembly then our scripts will not be able to pull the data and generate the GenArk addition. Such new assemblies will need to be submitted to NCBI first, after which you can then notify UCSC. You can find directions at NCBI for how to submit new genomes: https://www.ncbi.nlm.nih.gov/assembly/docs/submission/ Also, please review the UCSC GenArk Blog posts for more information on accessing and using tools on GenArk hubs.


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

GenArk Hubs Part 3 – Technical details

This blog post is the final of three to discuss the Genome Archive (GenArk) assembly hubs. This third post discusses the technical infrastructure of the GenArk hubs, while the first post was about accessing the data, and the second shared examples of using the data. 

TECHNICAL DETAILS

What are the systems behind GenArk hubs?

In essence, GenArk hubs are assembly hubs that have been added as Public Hubs. 

Anyone can build Track Hubs and Assembly Hubs, where finished hubs can then be requested to be published as Public Hubs.  UCSC does have a Public Hub Guidelines page to encourage hub developers to document their data fully before submitting for inclusion, this is mainly to ensure users of Public Hubs know who to contact for their data and can understand what they are visualizing.

To aid independent groups that do not want to build assembly hubs, when the underlying assembly data is already available at NCBI, our engineers have crafted scripts to build these files automatically.  GenArk scripts pull data from NCBI and then programmatically construct all of the binary-indexed files needed to visualize them on the UCSC Genome Browser. Additional special features have been included, especially the ability to generate and provide BLAT and PCR dynamic servers through the pre-generation of special index files. Our engineers have also optimized other elements of these GenArk Hubs by applying the latest available Track Hub features. 

But what are the internal methods UCSC uses to populate the GenArk hubs?

Here at UCSC, an internal process maintains a local mirror image of the NCBI genome assembly resources. In essence, there is first a transfer of data with a rsync request,  rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/  to a matching local hierarchy of directories. These matching directory structures and naming conventions for files enable scripting procedures to automatically find and process the source files into the formats the UCSC Genome Browser recognizes to visualize data, mainly byte-range accessible binary-indexed versions of the data.  A Perl script, doAssemblyHub.pl, manages all the steps of the procedure (https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/doAssemblyHub.pl).

So how are gene tracks created for these GenArk hubs?

For the assemblies with GCF accessions, the script uses the supplied data of the NCBI gene annotations to create gene tracks that are provided with those files. A specific gene track type called bigGenePred, https://genome.ucsc.edu/goldenPath/help/bigGenePred.html, allows amino acid displays when zoomed-in at the base level. Likewise, a bigGenePred track is made to display Xeno RefGene data which is computed from a selection of best alignments of RefSeq mRNA sequences from many organisms to the genome, using the BLAT algorithm (http://www.kentinformatics.com/). Another bigGenePred track is made using the Augustus gene prediction software (http://bioinf.uni-greifswald.de/augustus/) from the Stanke lab.

How are the other GenArk annotation tracks made?

GenArk assemblies also have Repeat Masker tracks, which use the data when supplied from NCBI source. Otherwise, the track can be computed with a local installation of the Repeat Masker software (https://www.repeatmasker.org/). The Simple Repeats track is computed with the Tandem Repeats Finder software (https://tandem.bu.edu/trf/trf.submit.options.html) and the Window Masker track is computed with the WindowMasker software included in the NCBI C++ toolkit (https://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/). The CpG Islands are computed with a modification of a program developed by G. Miklem and L. Hillier and the GC Percent track is computed using the ‘kent’ command hgGcPercent (http://hgdownload.soe.ucsc.edu/admin/exe/).  Examining the doAssemblyHub.pl script (https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/doAssemblyHub.pl), will illustrate more details about how individual steps are run (i.e.,  hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 test ../../\$asmId.2bit  | gzip -c > \$asmId.wigVarStep.gz).

What if I don’t find my Assembly in the GenArk collection?

If you can’t find the assembly you want in the GenArk hub collection, but you do already have the GCA/GCF identifier you can email us at our public mailing-list genome@ucsc.soe.edu to request we add the assembly to the GenArk collection. This archived mailing-list is searchable from links on our contacts page, http://genome.ucsc.edu/contacts.html. Alternatively, if you don’t want your request to be public, you can email our private internal mailing-list at genome-www@soe.ucsc.edu.  Also, since this original blog post, we created a new assembly request page, you can find details in this 4th GenArk blog post.

What if my assembly doesn’t have a GCA/GCF NCBI accession?

If NCBI does not have a GCA/GCF accession for your assembly then our scripts will not be able to pull the data and generate the GenArk hub. You will need to deposit the assembly at NCBI and notify us once the assembly has become available. You can find directions at NCBI for how to submit new genomes: https://www.ncbi.nlm.nih.gov/assembly/docs/submission/ 

A future manuscript is also in the works to further detail the background of the GenArk hubs.

This was the final blog in a three-part series about GenArk hubs authored by Brian Lee. The first post focused on how to discover and access the hubs, while the second blog post provided tutorial examples of using the GenArk hubs, such as the BLAT and PCR tools that are available, or how you can send DNA of any Assembly Hubs to External Tools for processing.


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

GenArk Hubs Part 2 – Using the data

This blog post is the second of three to discuss the Genome Archive (GenArk) assembly hubs. This second post discusses examples of using the GenArk hubs’ data, with the first post about accessing the data, and the third shares technical infrastructure behind the hubs.  

Before launching into using the new GenArk hubs, let’s go quickly over how the first blog post examined the multiple ways to access the GenArk hubs. The easiest way to find a GenArk hub is by searching the UCSC Genome Browser’s main Gateway page with a name like “hummingbird” and clicking on the GCA/GCF identifier to attach it.  Another is to build direct links to NCBI GCA/GCF assembly accessions when you know them to instantly arrive at the main Browser view, such as https://genome.ucsc.edu/h/GCF_005190385.1 for narwhal. Yet another is searching the UCSC Public Hubs page or going to the main GenArk homepage where you can in turn navigate directly to individual taxonomic group pages, such as for birds.

USING THE DATA 

What can you do with a GenArk Assembly hub?

The new GenArk hubs come with the ability to perform BLAT DNA queries and PCR primer searches, as well as send the genome’s DNA to external tools.

As an example, let’s say you are curious if we have a specific bat genome. The first step would be to go to the Gateway page and search “bat” and discover multiple hits. 

Looking at search results you see your desired specific “little brown bat” assembly and click on it so that hub is now selected, where under “Find Position” on the right there would now be “Mammal assemblies Hub Assembly” attached with “little brown bat” displayed and a specific GCF_000147115.1 NCBI accession. Clicking the “Go” button would bring you to the main Browser display. The same result happens from clicking this short direct GenArk /h/ hub link: https://genome.ucsc.edu/h/GCF_000147115.1

BLAT DNA Search

With this bat genome displaying if you had a short DNA sequence you wanted to search, you could paste it right in the top search box. For instance, after clicking the above link, try pasting on the main browser display CATTAGGCAAATATATGCATATAAGTTCTTTGTTTAATCTCT and hit “go”.  The result, shown after a few seconds, will be sequence matches across the little brown bat genome. You can also go to the top Tools menu and then select “Blat”, and do the same step of pasting DNA sequence, required when searching especially longer strings. The Blat Tool page also allows you to search alternative sequences. 

BLAT Protein Search

On the Tools > Blat page you can put in protein sequences to search. This is especially interesting if you want to find the location of a known protein from another species in your genome of interest. For example, if again you are on the little brown bat genome and you go to the Blat page, try to blat this portion of the human SOD1 protein:, LSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN

You will find a match (again note for protein searches, be sure to go to the Tools > Blat page). When viewing the results you can either click a “browser” link to see the matching spots on the genome. Or if you click a “details” link you will see the side-by-side alignment like this image below.

Besides DNA and protein searches, BLAT also allows translated RNA and translated DNA searches. Also the results from BLAT searches can be saved as custom tracks. This allows you to download and save these annotations, or save them in Sessions making the results more permanent and shareable. See this other blog post about sharing sessions for more information: https://bit.ly/UCSC_blog_sharing 

PCR Primer Search

GenArk also provides PCR primer searches, by going to the Tools menu and selecting PCR. With the same “little brown bat” genome loaded, for instance, go to the Tools menu and select “In-Silico PCR” to arrive on the PCR page. Then enter these two primers, forward primer: AGTCATGGTCTCAGGAACCG and reverse primer:  GTTACTAGGGCTCAGACCTC  (there is no need in this example to click any other settings). 

Then click “submit” to search the “little brown bat” genome for matches.

The results will be two hits, in part because this assembly has 11,654 scaffolds with some identical sequences (to see all the scaffolds click the “view sequences” link on the Gateway page described later). 

Send DNA to External Tools

Another way to use GenArk hubs is to send the current DNA in the viewing window to external sites. By going to the View menu you can select the “In External Tools” option and export the current DNA for processing outside of the UCSC Genome Browser.

In this image a 7,477 bp region will be selected to be sent to external sites where selecting “In External Tools” under the View menu will result in a pop-up of various options.

In this case all of the options are presented as available for this 7,477 bp span, except for RNAfold, which requires the viewer to zoom in to less than 5 kpb, before sending the DNA to that external tool. 

Send DNA to External Tools -Primer Design: Primer-BLAST

If you were interested in PCR Primer design in this region you could use the Primer3Plus or  Primer-BLAST links. The Primer-BLAST link starts a job at NCBI, where after some time the results at NCBI will be optimal PCR Primers for this stretch of DNA. Here are example results sending the 7,477 bp  span of the NW_005878708v1 little brown bat scaffold to NCBI.

With these results, one can return to the UCSC PCR Tool to test each result in order to discover if these primers will have potential off-target results beyond the desired chromosome.

Send DNA to External Tools -Primer Design: Primer3Plus

Another PCR Primer design option in the “In External Tools” menu is Primer3Plus. Here are example results sending the same 7,477 bp span of the NW_005878708v1 little brown bat scaffold to Primer3Plus.

Primer3Plus has the added benefits of a “Return to Genome Browser” button (top left) that if clicked will dynamically generate a custom track of the results to be seen back on the UCSC Genome Browser.

Above the Primer3Plus custom track identifies the input region that was sent (top grey bar), and then the individual left and right matching primer pair locations. At UCSC the primers can then be tested again with the UCSC PCR Tool where a highlight for the Primer3Plus suggested “Primer 5” is highlighted in the above image. 

Send DNA to External Tools for oligo-analysis 

Another tool you can export DNA of interest to is Regulatory Sequence Analysis Tools (RSAT) Metazoa for motif discovery. For instance, when looking at a GenArk assembly for Zebu Cattle, https://genome.ucsc.edu/h/GCF_000247795.1, using the View menu and In External Tools option one could select the RSAT link. RSAT provides a way to analyze the DNA sequence for transcription factor binding sites and over-represented oligo-nucleotides. Because RSAT requests your organism, in this example Bos taurus was used as a relative to zebu cattle, allowing for proceeding to request examination of  the region. The DNA being sent in this example was near a region for the start of a gene predicted by Augustus. One of the RSAT results was a predicted motif, aaacttatagata, just upstream of the transcription start site for the predicted gene.

By going back to the UCSC Genome Browser and clicking into the Short Match track (under the top Mapping section) and pasting in the motif sequence, aaacttatagata, a display in the GenArk hub of where these matches occurred could be visualized.

The Short Match track’s ability to visualize the motif identifies the potential binding sites of transcription factors, predicted by RSAT.  This Browser view of the Zebu Cattle GenArk assembly hub can be viewed with this Public Session link

Can I add custom tracks to a GenArk Assembly hub?

Yes, users can add tracks to their data by going to the My Data menu and then selecting Custom Tracks to paste in information. Simple text-based tracks can be loaded, or more complicated binary-indexed files such as BAMs or VCFs or bigBeds can be loaded as well. 

How do I name my sequences for my custom tracks?

Another special feature of GenArk hubs is that they are loaded with a special chromAlias file allowing for multiple alias names. When building custom tracks the scaffold names for sequences need to match the names in the assembly, but many options exist. For instance, with the Zebu Cattle genome, https://genome.ucsc.edu/h/GCF_000247795.1, if you type “v s” to view sequences, or click the top “Genomes” name and then the “view sequences” button, you will end up on a page where all the scaffolds of a genome are displayed.

Scrolling down you on the resulting page you will see a link titled “GCF_000247795.1.chromAlias.txt” which will have results like this:

# sequenceName    alias names    assembly: GCF_000247795.1_Bos_indicus_1.0
chr1    1    CM003021.1    NC_032650.1
chr10    10    CM003030.1    NC_032659.1
chr11    11    CM003031.1    NC_032660.1
...

What this chromAlias.txt file displays is how “chr1”, or “1”, or “CM003021.1” or “NC_032650.1” can be used to create custom tracks on chromosome one for this assembly (i.e., BED custom tracks “chr1 300 500” = “1 300 500” = “CM003021.1 300 500” = “NC_032650.1 300 500”). 

Can I add a Track Hub to a GenArk Assembly Hub?

Yes, after loading a hub you can user go to the My Data menu and paste in the location of a hub to display on any of the GenArk assembly hubs. The one special detail is that your Track Hub’s genomes.txt genomes line only needs to have the GCA/GCF number such as “genome GCF_001984765.1”. See this example hub.txt file for an idea of how a hub could be loaded on a GenArk hub.  Here is a link that will load that hub on a GenArk hub for American beaver:

https://genome.ucsc.edu/h/GCF_001984765.1?position=NW_017869957v1:1,285,000-1,793,000&hubUrl=https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/hub.txt

Can I share data on a GenArk assembly hub?

Yes, you can make a session and share the URL with others. Even better, publish your session to the Public Session page to make it more discoverable. See this previous blog about sharing data for more information:  https://bit.ly/UCSC_blog_sharing   

The next blog post in this series will provide some technical details about the GenArk hub architecture. The first post focussed on how to discover and access the hubs. 


This entry written by Brian Lee. If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

GenArk Hubs Part 1 – Accessing the data

This blog post is the first of three to discuss the Genome Archive (GenArk) assembly hubs. This first post discusses accessing the GenArk hubs, the second post gives examples of using the data, and the third post describes the technical infrastructure behind the hubs. 

Let’s start with a real-world story: imagine you are a researcher working on zebrafish, but you are using an alternative strain with unique polymorphic properties. You have a desire to do CRISPR on your particular zebrafish and you already have a FASTA file for the genome assembled into chromosomes, but have no annotations or way to visualize the data yet. 

One option to visualize your FASTA would be to independently create a UCSC Assembly Track Hub to work on your zebrafish. Or now that UCSC has developed the Genome Archive (GenArk) system, when you submit your assembly into NCBI’s assembly database, you could contact us directly and request we generate the browser for you behind the scenes. This happened for a specific lab, where they submitted their specific TD5 zebrafish assembly to NCBI, https://www.ncbi.nlm.nih.gov/assembly/GCA_018400075.1, and the result after contacting us was a new assembly hub that could be easily loaded at UCSC with the following link: https://genome.ucsc.edu/h/GCA_018400075.1  In this case, the team at UCSC even helped generate liftOver alignment files between the UCSC zebrafish in this new TD5 zebrafish GenArk Public Hub addition, aiding identification of lifting annotations to the new browser. 

So what are the GenArk Assembly hubs?

GenArk hubs are a collection of data files externally hosted from the main UCSC data website enabling browsing new genomes. GenArk genomes have NCBI Genbank assembly accessions starting with either GCA or GCF and the browsers allow visualizing and attaching laboratory-generated data. New software also enables UCSC to dynamically turn on query servers to search GenArk hubs with DNA sequences or test PCR primer pairs. GenArk hubs are part of the UCSC Public Hubs list where UCSC can update the data files with pipelines. 

ACCESSING THE DATA

How do I access GenArk Assembly Hubs?

There are multiple ways to access the GenArk hubs, including searching the UCSC Genome Browser’s main gateway page, building direct links to NCBI GCA/GCF assembly accessions, searching the UCSC Public Hubs page, and navigating directly to individual taxonomic group pages.

Browser Gateway Page

The easiest way to find GenArk hubs is to search the species name on the Browser Gateway.

On the Gateway page in the top left box you can search a term such as “dog” and find all the genomes both hosted in our internal databases and in external Public Hubs that have dog in the name. In this image, a search for “dog” returns a top “Dog” match (UCSC database) as well as results for several species in Assembly Track Hubs that match on the term “dog” with the specific labrador dog breed selected from the GenArk Mammal Assemblies Hub (GCF_014441545.1).

Direct GCA/GCF Accession Links

In the situation where you may know the GCF/GCA identifier for an assembly, you can also search that term on the Gateway page or build a short link to directly load the hub.  Links to UCSC with a hub (“/h/”) address, such as https://genome.ucsc.edu/h/GCF_000698965.1 will attempt to find and attach a matching final GCF-value,  which originates from the NCBI accession, in this case, for an African ostrich assembly. If you don’t find a match, read more below about contacting us.

Public Hubs Page

Another place to find GenArk hubs is on the Public Hubs page where you can enter various terms, like “ostrich”: https://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=ostrich,  You can expand the “Search details” to examine matching results. To load a desired hub, use a right-click to display an “Open this assembly” pop-up, or an option to configure individual track settings.

Genomes Menu

Another option to gain an overview of all the GenArk hubs is to click the “Genome Archive GenArk” link available under the “Genomes” menu.

This Genomes menu link will open the GenArk homepage. On the GenArk homepage, a variety of links exist including the line,“Please note: text file listing of 1,600 NCBI/VGP genome assembly hubs.” Clicking that link will open a single text file that lists all available hubs allowing a quick overview: https://hgdownload.soe.ucsc.edu/hubs/UCSC_GI.assemblyHubList.txt 

Individual Taxonomic Pages

The GenArk homepage also has links to specific taxonomic groupings hub pages, such as mammals, fishes, or fungi. For instance, a “birds” link, https://hgdownload.soe.ucsc.edu/hubs/birds/index.html, brings you to a webpage with links to launch browsers, along with links to other details for each assembly.

These taxonomic group pages, such as this image of the bird’s page, have links to launch the browser (2nd column: common name and view in browser) and links to the source files (4th column: NCBI assembly). 

Access to these taxonomic group pages is also available from the Public Hubs page.

By going to the Description column on the Public Hubs page you can click a link (Bird genome assemblies) to end up at the related taxonomic grouping page. Also of note that on the Public Hubs page, you can click a  [+] plus button to expand the list of Assemblies and click one of the GCA/GCF accession links to directly load an assembly. 

What if I don’t find my Assembly in the GenArk collection?

If the assembly of interest is not found, please visit our assembly request page. Search that page for your assembly, if there is a “view” link you can launch the existing genome browser. Otherwise, click the “request” button to fill out a form to add your genome of interest.

The assembly request page does require there to already be an existing GCA/GCF identifier. You can also always email us at our public mailing-list genome@ucsc.soe.edu to request we add the assembly to the GenArk collection. This archived mailing-list is searchable from links on our contacts page, http://genome.ucsc.edu/contacts.html. Alternatively, if you don’t want your request to be public, you can email our private internal mailing-list at genome-www@soe.ucsc.edu

What if my assembly doesn’t have a GCA/GCF NCBI accession?

If NCBI does not have a GCA/GCF accession for your assembly then our scripts will not be able to pull the data and generate the GenArk hub. You will need to deposit the assembly at NCBI and notify us once the assembly has become available. You can find directions at NCBI for how to submit new genomes: https://www.ncbi.nlm.nih.gov/assembly/docs/submission/ 

The next blog post in this series will provide examples of using the GenArk hubs, such as the BLAT and PCR tools that are available, or how you can send DNA of any Assembly Hubs to External Tools for processing.  The final post examines the infrastructure behind the hubs.


This entry written by Brian Lee. If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Sharing Data with Sessions and URLs

In this blog post, I’ll give an overview of ways to share Genome Browser data views with others.

Visualizing and sharing custom data is one of the most useful features of the UCSC Genome Browser tool. An independent review evaluating various genome browsers (http://tinyurl.com/genome-browsers), emphasized “the local and global exports for sharing sessions” is one of the site’s most “attractive functionalities,” with the report concluding that the UCSC Genome Browser “is the best tool of our evaluation from that point of view.”

Many veteran users are not aware of how easy it is to create and share browser views called sessions, especially using the more recent Public Sessions feature. Few users know that there are ways to modify URLs to share custom data, even to build URL links on top of data or sessions created by others. This blog post will give a wide overview of the many ways to share data on the Browser.

  • TIP: You can watch a great introductory video to Saving and Sharing Sessions, which walks users through the steps to build a session and illustrates the new Public Sessions tool: http://bit.ly/sessionVid

SESSIONS AND PUBLIC SESSIONS

To access Sessions, under the top “My Data” menu there is a “My Sessions” option that leads to the page to create a URL snapshot of the view you are looking at in the Genome Browser. Once a user has created an account, on the Sessions Management page, they can then save a snapshot by giving the current view any “sessionName”.  A link, built from the userName and given sessionName, will be created that can be shared with others: https://genome.ucsc.edu/s/userName/sessionName

Once a session is created users have the option to click a “details” button on the Sessions Management page that leads them to an additional screen where they can enter a description. Newly created sessions are shareable by default, but can be made private (thereby requiring an account login to access), or they can be published to the Public Sessions page, where a search such as on the userName (https://genome.ucsc.edu/cgi-bin/hgPublicSessions?search=userName) will bring up all sessions that the author published.

Public Sessions with descriptions are even more discoverable since matches will be returned on words found in the description. Public Sessions can be accessed under the “My Data” menu and a search term can be entered in the box on the right, or a URL can be built to scan for specific search terms as illustrated above for userName. If you search “protein” you will find all the sessions, for instance, that have mentioned protein in their description. Here’s an example: http://genome.ucsc.edu/cgi-bin/hgPublicSessions?search=protein

  • TIP: When sessions are created with custom data uploaded, the uploaded data becomes “immortalized.”  Usually any uploaded custom text-based tracks will be deleted in a few days, but by creating a session any uploaded tracks are marked as belonging to the associated userName account and attempts are made to preserve it.  Please keep a local backup of your sessions contents, however, as the Browser is not a data storage service.

BUILDING URLS TO SET TRACK VISIBILITIES

Sometimes users want to hide all the tracks and only display certain data, and this can be done even without creating sessions. You can control the visibility of tracks from the URL with some of the following parameters:

  • hideTracks=1 – hides all tracks
  • <trackName>=hide|dense|pack|full – sets specified track or subtrack to a chosen visibilites
  • <trackName>.heightPer=<###> – sets a bigWig track’s height to a particular number of pixels (between 20-100)

For example, you can use the following URL to hide every track (hideTracks=1), set the genome database to hg38 (db=hg38), set the mappability track to full visibility (mappability=full), and set the umap track height to 100 pixels (umap24Quantitative.heightPer=100): http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hideTracks=1&mappability=full&umap24Quantitative.heightPer=100

BUILDING URLS TO CUSTOM TRACKS

Users can also share data with links without first creating a session by adding a “hgct_customText=” parameter to their base URL. For instance, if a group has data for the human hg38 database in a web-accessible location that meets the criteria for loading as a custom track, they can build URL links in this fashion: https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgct_customText=http://location.online/dataFile

That online dataFile can be the track data, or a collection of more URLs to load more custom tracks. For instance, in a recent blog post about building bigBed tracks, https://bit.ly/UCSC_blog_bigBed, there was an example of hosting bigBed data at CyVerse. Since the data only displays in the position range of 1,405,000-1,448,000 on chromosome 5, a URL such as the below will load the hg19 genome (db=hg19) and go to a specific position (position=chr5:1405000-1448000) and then attach the remote file (hgct_customText=): https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr5:1405000-1448000&hgct_customText=https://data.cyverse.org/dav-anon/iplant/home/brianlee/Lab_Primers.bigBed

  • TIP: One advantage of not using sessions is that a user’s preexisting preferences for track displays will not be impacted.  For instance, if they have a collection of clinical tracks displaying, using the hgct_customText= parameter or hubUrl= will add the new remote data to a user’s existing preferred clinical track configurations. Sessions, on the  other hand, would disconnect existing remote data and change the position location as well as reconfigure tracks, to match everything saved when the session was created.

BUILDING URLS TO TRACK HUBS

Once a user has taken the step to build binary-indexed files such as bigBeds or bigWigs, they can go a step further and put their collection of tracks into a Track Hub. Track Hubs provide much more power for loading external data in more complex ways, such as enabling search indexes on uniquely named items in the remote data, or coloring tracks or individual elements.

Track Hubs are similar to the idea of having a text file that points to a collection of remotely hosted custom tracks. To make sharing easy, just one URL, called a hubUrl is given to the browser to load the Track Hub, and all the remotely hosted data, which must be in a binary-indexed format is then attached so only the data in the current view is transferred over the Internet. Here is a generic example of  a link that would load hg38 track hub data: https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38?hubUrl=http://location/hub.txt

Here is a working example that loads onto the hg19 assembly (db=hg19) around a position (position=chr21:33,030,000-33,043,000) an example hub (hubUrl=): https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr21:33,030,000-33,043,000&hubUrl=http://genome.ucsc.edu/goldenPath/help/examples/hubDirectory/hub.txt

  • TIP: Once you start using URLs to share data instead of sessions, take caution to have only the first element use the question mark ? and then all other parameters to use the ampersand &, “?parameter1=value&parameter2=value&parameter3=value”.  If you are having trouble, check to be sure that you have not confused the order of & and ? for your values.

BUILDING URLS TO TRACK HUBS ON ASSEMBLY HUBS

The UCSC Genome Browser provides a means to attach Track Hubs that can display novel genomes not hosted within the Browser. These are called assembly hubs.  If a new assembly is being hosted remotely as an assembly hub, additional hub attachments also can be linked on top of that assembly hub, where the db= parameter is swapped with a genome= parameter as defined in the external assembly hub’s genomes.txt file (or genomes stanza when useOneFile is applied –see below).

In this following conceptual link, a genomeName is defined in an external assemblyHub.txt file that provides the Browser the underlying sequence of a declared genomeName. Then another collection of data, called hub.txt,  is attached to that assembly hub, where that hub.txt is using the same genomeName in its genomes.txt file (or genome stanza). In the URL the very first parameter (genome=genomeName) tells the Browser that in one of these hubs there should be a similarly defined genome in order for the Browser to display the correct underlying sequence: https://genome.ucsc.edu/cgi-bin/hgTracks?genome=genomeName&hubUrl=http://location/assemblyHub.txt&hubUrl=http://location/hub.txt

  • TIP: Note that hubUrl= can be used multiple times to attach multiple hubs, but only the genome=genomeName will inform the Browser which genome to display.  The second hub.txt in this example can piggyback entirely on the first assemblyHub.txt to provide all the novel underlying genomeName sequence data.

Just to illustrate how complex the system can get, a further step could also add custom tracks to the Assembly Hub, which has a Track Hub attached simultaneously: https://genome.ucsc.edu/cgi-bin/hgTracks?genome=genomeName&hubUrl=http://location/assemblyHub.txt&hubUrl=http://location/hub.txt&hgct_customText=http://location.online/dataFile

ASSEMBLY HUB EXAMPLES WITH GenArk HUBS

The new GenArk assemblies come with quick links to load hubs from that collection. An example is https://genome.ucsc.edu/h/GCF_001984765.1, which will load the American beaver assembly (GCF_001984765.1). This short link is the equivalent of loading the hubUrl=https://hgdownload.soe.ucsc.edu/hubs/GCF/001/984/765/GCF_001984765.1/hub.txt and setting the genome=GCF_001984765.1 to the URL and pointing to the hgTracks CGI (the main Browser display).  By condensing it all to this new short link format, we’ve attempted to make loading GenArk hubs easier.

  • TIP: Once you start using URLs to define the Browser view, you will likely wish to reset the view occasionally. You can do this by going to the “Reset All User Settings” under the top “Genome Browser” menu. Another option is to directly point the browser to the cartReset CGI: https://genome.ucsc.edu/cgi-bin/cartReset

These https://genome.ucsc.edu/h/GCF_### short links to GenArk assembly hubs can have additional parameters added to them, such as the following link that loads a custom track onto the GCF_001984765.1 assembly hub.  The remote custom track in this example is a single bigBed hosted at CyVerse, where the URL is  simultaneously setting the position to NW_017869957v1:1,437,578-1,648,889: https://genome.ucsc.edu/h/GCF_001984765.1?position=NW_017869957v1:1,437,578-1,648,889&hgct_customText=https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/GCF_001984765.1_C.can_genome_v1.0.cpgIslandExt.bb

A Track Hub can be attached to the Assembly Hub as seen in this version where the GCF_001984765.1assembly hub is redirected from the default position to NW_017869957v1:1,285,000-1,793,000 and the hubUrl= defines a CyVerse hosted hub.txt: https://genome.ucsc.edu/h/GCF_001984765.1?position=NW_017869957v1:1,285,000-1,793,000&hubUrl=https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/hub.txt

  • TIP: Take a moment to look at this example hub.txt (https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/hub.txt). Note that it only has “genome GCF_001984765.1” for the genomes stanza (since it is using useOneFile on and is also expecting to find a GenArk hub).  It relies entirely on the GenArk assembly hub for the underlying assembly information.

Track Hubs loaded on Assembly Hubs are not limited to GenArk hubs. The GenArk hubs have special privileges because they have short links. If you try to attach any hub with something like “genome GCF_###” the Genome Browser will make an effort to find a match in the existing GenArk collection, and attach it automatically.

To illustrate how other assembly hubs outside of GenArk would work to have hubs attached, here is the longer version of the above link.  In this case, the first hubUrl= is used to call out the location of this assembly hub, then the second hubUrl= is used again to load the second hub, and finally also hgct_customText comes into use to load a custom track

https://genome.ucsc.edu/cgi-bin/hgTracks?hubUrl=https://hgdownload.soe.ucsc.edu/hubs/GCF/001/984/765/GCF_001984765.1/hub.txt&genome=GCF_001984765.1&position=NW_017869957v1:1,285,000-1,793,000&hubUrl=https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/hub.txt&hgct_customText=https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/GCF_001984765.1_C.can_genome_v1.0.cpgIslandExt.bb

The point of these rather tortuous examples is that multiple groups can own the sources of the data. Everything after the base URL, https://genome.ucsc.edu/cgi-bin/hgTracks, can point to other places on the Internet with either the hubUrl= or hgct_customText= parameters. This means lab_X might have the assembly data, and lab_Y can generate a hub to view on that assembly, and lab_Z can further attach to those external groups even more custom data.  And all this sharing and interoperability can happen without ever creating session links.

BUILDING URLS ATTACHING TRACK HUBS AND CUSTOM TRACKS TO SESSIONS

Using sessions is powerful since it lets you customize your view of the Genome Browser. Users can create a session (or borrow another from the Public Session page) and use that session’s userName and sessionName to attach their own custom data.

  • Here is a model link for attaching custom tracks: https://genome.ucsc.edu/s/userName/sessionName?hgct_customText=http://location.online/dataFile
  • Here is a model link for attaching track hubs: https://genome.ucsc.edu/s/userName/sessionName?hubUrl=http://location.online/hub.txt

This can have the advantage of creating shorter links or also preconfiguring the browser to a certain position or display.  We recently added the ability to customize the font on the Browser so a session can even be used just as a different way of viewing the same data stylistically, for instance making the display easier for you to read.

Here are some real-world examples borrowing from real Public Sessions.  To load on a Public Session, go to the “My Data” menu, then choose “Public Sessions”, and then you can click on the image of any session to load it. You can build your own URL from an existing Public Session by noting the Author field (equivalent to the session’s source userName) and the Session Name field, like so: https://genome.ucsc.edu/s/userName/sessionName

  • TIP: Session names will URL encode whitespace or other special characters, where any spaces in the name would become %20 (My%20session%20name), this is one reason using underscores (or camelCase) instead of spaces in your sessionNames makes for cleaner links.

Here’s a session on hg19 that will load and also attach the earlier CyVerse custom track: https://genome.ucsc.edu/s/brianlee/AvantG_Font?position=chr5:1405000-1448000&hgct_customText=https://data.cyverse.org/dav-anon/iplant/home/brianlee/Lab_Primers.bigBed

Here’s one that will load a few hubs on a session that points to hg38 and also opens the display to  the SIRT1 gene using the &singleSearch=knownCanonical&position=SIRT1 parameters: https://genome.ucsc.edu/s/brianlee/Times_Font?hubUrl=http://fantom.gsc.riken.jp/5/datahub/hub.txt&hubUrl=http://expdata.cmmt.ubc.ca/JASPAR/UCSC_tracks/hub.txt&hubUrl=http://remap.univ-amu.fr/storage/public/hubReMap2020UCSC/hub.txt&singleSearch=knownCanonical&position=SIRT1

Again, these complex links are to illustrate that there are multiple ways to view multiple groups of data across the world in the Genome Browser. You can get to the data either through clicks and searches on the website or by building Sessions or Public Sessions and URL links to remotely hosted data.  This blog post could not cover every topic but gives a good introduction to the ways to share data with sessions or complex URLs. To learn more about links, see these documentation pages:

  • TIP: If you love modifying URLs, click on the “example links” in the second #optParams section above to see how you can even add parameters like highlight= to define multiple colored vertical highlights.

Links to guides for Sessions, Track Hubs, Custom Tracks, and videos can be found on our training page:


This entry written by Brian Lee. If after reading this blog post you have any questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.


How to make a bigBed file – Part 1

In this blog post, I’ll share the experience a user could be having where they have an existing text-based custom track that could be made into a more shareable bigBed version.

Let’s say the original track is in the bedDetail format that allows for BED12+ columns using tabs to define additional columns. This original track can  be made into a bigBed track to be put in a Track Hub or to be hosted alone and shared across multiple sessions, where the bigBed could act as a universal custom track.  If it were updated at the bigBed hosted location, all the related sessions that referenced the new bigBed remotely-hosted location of the data would have their representations of the data updated as well.

Let’s begin with the idea that Jerry’s Lab would like to host a primers track and share it between sessions for their lab group. The lab has already created a primers custom track in text files that can be updated and uploaded successfully.

The below steps will take Jerry’s lab from this uploading approach, to putting the data in a  shared online location and using a binary-indexed format of the custom track called bigBed. The bigBed is hosted at an online location defined by a bigDataUrl which allows Jerry’s entire lab to see the updated data as new primers are added.  This way each lab member in Jerry’s lab can use their early sessions, but get new data in their views, provided the bigBed is updated with the new information at the URL shared between all the sessions.

For this example, imagine Jerry’s lab is already using a tab-separated bedDetail custom track text file that might look like this:

track name=Primers type=bedDetail description=Primers visibility=2 color=221,55,118
browser position chr5:1405000-1448000
chr5    1413367    1413387    hDAT32061R    0    .    1413367    1413387    221,55,118    1    20    0    catggagtgggccctttcag
chr5    1414322    1414343    hDAT31086F    0    .    1414322    1414343    221,55,118    1    21    0    cctcaagcccaaatgcagctg
...

This track with type=bedDetail can upload  a text file to display BED12 items (http://genome.ucsc.edu/FAQ/FAQformat.html#format1) with an additional 13th column with sequence (making it a bedDetail format: http://genome.ucsc.edu/FAQ/FAQformat.html#format1.7). With bedDetail a user has either the first 4 or 12 columns of data in BED format, and can extend the format with additional fields, such as sequence data here, to enhance the track details pages.

By going to the My Data and Custom Tracks page, the above text can be pasted and will work (provided there are tabs between the columns, some cut and paste interfaces will remove tabs).

When this custom track is added to a session as a text file, it is uploaded one time and does not update further unless there is a new upload. If Jerry’s lab wanted to update the Primers tracks in their sessions, a future upload of the text-based track would be required in each individual session. Once created, the original sessions that have uploaded text data are static. To solve this issue for Jerry’s lab, an option is to make a URL-hosted location of the data and  turn the data into a binary-indexed bigBed format.  In this way the new URL-hosted bigBed could act as a universal custom track across many sessions.

Here are the steps to do that.

1. The first would be to edit the file and remove the top track and browser lines, they will be used again at a later step after the bigBed is created.

chr5    1413367    1413387    hDAT32061R    0    .    1413367    1413387    221,55,118    1    20    0    catggagtgggccctttcag
chr5    1414322    1414343    hDAT31086F    0    .    1414322    1414343    221,55,118    1    21    0    cctcaagcccaaatgcagctg
...

This link is an example of that file for those that want to follow along with the next steps.

curl -O https://data.cyverse.org/dav-anon/iplant/home/brianlee/Lab_Primers.txt

2. Next, in a command-line environment, you can use the UNIX sort command to sort the data in your file and call the file Lab_Primers.txt

sort -k1,1 -k2,2n Lab_Primers.txt > Lab_Primers_sorted.txt

The command creates a new file Lab_Primers_sorted.txt where all the entries are ordered correctly.

3. Next we will acquire the bedToBigBed utility assuming you are using a MacBook

curl -O http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/bedToBigBed.

4. Then we will make the bedToBigBed utility executable:

chmod 700 bedToBigBed

5. With this utility we will need a definitions file to explain what each column means. We will get an example that will work with these 13 columns, but we could edit this file or make our own.

curl -O https://genome-source.gi.ucsc.edu/gitlist/kent.git/raw/master/src/hg/lib/bed12Source.as

6. With the bedToBigBed utility and the bed12Source.as file, we can now use the tool to build from the Lab_Primers_sorted.txt file a new Lab_Primers.bigBed file for the hg19 genome, using a URL to find the chromosome sizes for the hg19 assembly.

./bedToBigBed -type=bed12+ -as=bed12Source.as -tab Lab_Primers_sorted.txt http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes Lab_Primers.bigBed

With the following three optional steps, we can get another tool called bigBedToBed to check the extraction of data from the file:

curl -O http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/bigBedToBed
chmod 700 bigBedToBed
./bigBedToBed -chrom=chr5 -start=1419444 -end=1445682 Lab_Primers.bigBed stdout

7. Now we need to host this data somewhere online so that it can be found by the Browser. One option is CyVerse, you can read more about them at this location: http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html#Hosting

8. Once you have an online location to the bigBed (for example: https://data.cyverse.org/dav-anon/iplant/home/brianlee/Lab_Primers.bigBed) you can add it to your sessions. Go to the custom track page and put in a track like the following, where you can use your track and browser lines again, but change type=bedDetail to type=bigBed and use a bigDataUrl:

browser position chr5:1405000-1448000
track name=Primers type=bigBed description=Primers visibility=2 color=221,55,118 bigDataUrl=https://data.cyverse.org/dav-anon/iplant/home/brianlee/Lab_Primers.bigBed

9. Save a session with this bigBed as a custom track. Example: https://www.genome.ucsc.edu/s/brianlee/Primers

10. Now anytime  the file has updates, the session that references this bigDataUrl location of the bigBed data should also update. If  CyVerse is used to host the bigBed data file online, this may require deleting and replacing your file to force a browser to reload (Control-Shift-R) the file to trigger caching to expire. Contact CyVerse directly for help.

Finding your own institution to host  your data is often the best solution as you can then work with your system administrators to have the best experience.

Once you have the bigBed, it is not much more work to take it to the next step and put it inside a Track Hub. Once in a Track Hub, many additional features are possible, such as using a searchIndex feature that allows finding unique named items within your custom track on the search bar or ultimately creating a Public Hub to share your data with the wider community.


This entry written by Brian Lee. If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Patching up the Genome

From biologists to computer scientists, the human genome has presented a grand puzzle. With regards to UCSC, the story began in 1985 when our chancellor, molecular biologist Robert Sinsheimer, proposed a bold endeavor – sequence the complete human genome. 5 years later the International Genome Project was launched. The next chapter took place in 1999 when computer science professor David Haussler was asked to join the project.  Haussler, in turn, enlisted then graduate student Jim Kent to help with assembling the genome. This collaboration culminated on July 7, 2000, when the first human genome assembly was made available on the UCSC servers. Over 500 GB were downloaded worldwide in 24 hours.  (Hey, back in 2000, that was a lot!)

UCSCReleaseDownloads

Total web traffic at the University of California Santa Cruz in 2000. When the genome becomes available online, all other web activity at the university shrank to the background.

Three months later, the UCSC Genome Browser came online as a resource to distribute and visualize the genome.  The first ten releases, hg1-hg10 were assembled at UCSC, after which the task was taken over by NCBI. As NCBI incremented the official releases and changed the naming scheme, UCSC released browsers at a slower rate, continuing to increment the hg* nomenclature.  By the time NCBI released NCBI33 in 2003, UCSC released it as hg15. After releasing so many browsers in under three years, the pace slowed, with each assembly taking around one year longer than the previous.

Patches: What are they and why are they important?

Blog_table

Note: hg38 follows hg19. The UCSC nomenclature was changed to match the Genome Reference Consortium (GRC)’s GRCh release number.

The early genome assemblies were largely aiming to increase the fidelity of the reference. However, with each release, research progress was temporarily hampered as scientists adjusted to sequence changes and shifted coordinates. This has often led to scientists continuing to use an older release as it may be better annotated and established. This is evident in the Genome Browser as a majority of our users continue to work on GRCh37/hg19 in spite of GRCh38/hg38’s release more than 4 years ago.

Looking at the numbers, however, we can see that GRCh38 is the most accurate human genome to date. With these benchmarks in accuracy, the GRC has shifted focus beyond fidelity to inclusion. The GRC  now strives to capture more of the genetic diversity present in the human population. The initial release of GRCh38/hg38 included 261 alternate haplotype sequences, nearly a 30-fold increase over GRCh37/hg19.

UCSC builds a new assembly database for each full release of a genome assembly, but the GRC also releases “patch” updates for genome assemblies. Through patch releases, the GRC adds new alternate haplotype sequences, and also corrected sequences, without changing the sequences or coordinate system of the initial assembly release.

To quote directly from the GRC:

Patches are accessioned scaffold sequences that represent assembly updates. They add information to the assembly without disrupting the chromosome coordinates. Patches are given chromosome context via alignment to the current assembly. Together, the scaffold sequence and alignment define the patch.

These patch sequences are more important now than ever before as the GRC has decided to indefinitely postpone the release of the next coordinate-changing assembly (which would have been GRCh39/hg39), instead opting for additional patches to GRCh38/hg38. There are two kinds of patch sequences:

Novel patches (alternative haplotypes): Chromosomal regions of the genome that exhibit sufficient variability to prevent adequate representation by a single sequence. Also referred to as alternate loci. UCSC labels these haplotype sequences by appending “_alt” to their names.

Fix patches: Error corrections (addressed by approaches such as base changes, component replacements/updates, switch-point updates or tiling-path changes) or assembly improvements, such as the extension of sequence into gaps. UCSC labels these fix sequences by appending “_fix” to their names.

These patch sequences, especially novel patches, have been increasing in number and will continue to do so.

patches

The number of human assembly patch sequences is quickly growing. This is primarily due to alternative haplotypes (_alt) sequences, though fix sequences (_fix) are also being introduced. The fix patches reset from GRCh37.p13 to GRCh38 as they were integrated into the assembly.

A better approach to patches

Our approach thus far in the Genome Browser has been to make data tracks indicating the locations of these patch releases along the initial assembly chromosomes. While these are useful, they provide little in the way of annotations and are largely underutilized by users. With the increase of these patches and postponement of GRCh39, however, we have decided to switch our approach and add the new sequences, and annotations on the new sequences, to the UCSC hg38 database. This will allow patches to be visualized on the Browser as standalone reference sequences, not unlike a regular chromosome or the alternate haplotype sequences that were included in the initial assembly release. BLAT results may also include alignments to these sequences.

The addition of new genomic sequences to an existing UCSC database is a departure from our longstanding practice of building a new database every time we import a new genome assembly release.  To minimize disruption to pipelines that use our download files, especially those in the bigZips directory, we will leave the original bigZips/hg38.* files unchanged, and add a subdirectory when we incorporate sequences from a patch release; for example, bigZips/p12/ for patch release GRCh38.p12.  We will also add bigZips/latest/ which will link to the most recent patch release subdirectory, so that pipelines may stay up to date with UCSC’s patch sequence annotations if desired. In other words, the bigZips downloads will be “opt-in” for patch sequences.

Changes and improvements to hg38

Currently, we are in the process of adding these sequences to the GRCh38/hg38 genome database with the potential to do the same for GRCh37/hg19 and GRCm38/mm10 at a future date. Changes that users may see are as follows:

  • BLAT/In-Silico PCR – Additional hits on _alt and _fix sequences
  • Position searches in the hg38 browser may lead to _alt and _fix sequences in addition to or instead of initial assembly chromosomes
  • Replacing the ‘GRC Patch Release’ and ‘Alt Map’ tracks with ‘Fix Patches’ and ‘Alt Haplotypes’ tracks which include alignments to alts/fixes with details pages and links to jump between main chromosomes and alts/fixes
  • New subdirectories of bigZips download directory (initial, p12, latest)
  • New sequences/annotations in /gbdb/hg38 download files (same file names, extended contents)
  • SQL queries to genome-mysql.soe.ucsc.edu may include new results on _alt and _fix sequences

It is also worth noting what will not change. Existing sequences, and annotations on existing sequences, will not change. Download files in the bigZips directory, such as bigZips/hg38.2bit and bigZips/hg38.fa.masked.gz, will not change.

So what kind of annotations can be found on these hg38 patch sequences?

  • Annotations generated by UCSC such as RepeatMasker, CpG Islands, AUGUSTUS, Human mRNAs and Pfam
  • NCBI’s sequence alignments of patch sequences to chromosomes: Fix Patches, Alt Haplotypes
  • External annotation sources such as RefSeq and GENCODE that include annotations on patch sequences (up to this point we have ignored those patch annotations)
  • Select tracks have been lifted from main chromosomes onto the patches using NCBI’s alignments, most notably GTEx Gene and ENCODE Regulation

For additional information on these patch sequences, and a full list of sequences in hg38, you may visit the hg38 Genome Browser Gateway page.

We are always receptive to our users and their needs. If there are any specific track annotations you would like to see on these patches or if you have any questions regarding this implementation and how it may affect you, please write into our public mailing list (genome@soe.ucsc.edu) or our private mailing list if your message includes sensitive data (genome-www@soe.ucsc.edu).

Accessing the Genome Browser Programmatically Part 3 – Controlling the Genome Browser Image

The previous parts of this series (part 1 and part 2) focused on how to use the Genome Browser to obtain data, and for this third and final post we’re gonna divert from that theme and talk about how to control the track image itself. Note: We now have an API which can also perform many of these functions.

Standard procedure for obtaining images of the browser is to configure the view exactly as you want, and then use the “View->PDF/PS” option in the menu bar in order to download a PDF or PostScript of your image. In addition to this method, you can generate PNG images on the fly with the following hgRenderTracks template:
http://genome.ucsc.edu/cgi-bin/hgRenderTracks?parameters

Parameters should be replaced by the URL key-value pairs that the main track display, hgTracks, understands, like ‘db=hg19’ or ‘knownGene=pack’. For example, to compare the transcripts provided by NCBI to UCSC’s own alignments of the transcripts at the ABO locus, you can use the following URL and the cURL program to download a PNG file:

curl 'http://genome.ucsc.edu/cgi-bin/hgRenderTracks?db=hg19&position=chr9:136130563-136150630&hideTracks=1&refSeqComposite=pack&ncbiRefSeqCurated_sel=1&ucscRefSeqView=pack&refGene_sel=1&pubs=pack' > example.png

Opening example.png in your favorite image viewer will display the following image:
refSeqAndPubs

There are many additional parameters described on the Sharing your custom track section of the custom tracks help page. Using these parameters you can configure hgRenderTracks to display any combinations of tracks, and along the hgt.customText parameter, also show your custom tracks with them.

To illustrate, if I have the following custom track:

browser hide all
browser gold=pack
browser gap=pack
browser visibility=pack
chr1 1000 2000
chr1 2100 3000
chr1 3100 4000
…

Hosted on the web at http://genome-test.soe.ucsc.edu/~chmalee/exCustomTrack.bed, then I can tell hgRenderTracks to load this file with the hgt.customText parameter like so:

http://genome.ucsc.edu/cgi-bin/hgRenderTracks?db=hg38&position=chr1:1-100000&hgt.customText=http://genome-test.soe.ucsc.edu/~chmalee/exCustomTrack.bed
customTrackAndGoldGap

The “browser” lines at the beginning of the custom track indicate which native tracks to turn on along their visibilities, while the “hide all” line turns all the other native tracks off. In addition to these basic instructions there are many more examples on the UCSC Genome Browser Wiki.

What about when you want to view a genome and annotations not hosted on our site? If you have a FASTA file of your genome available, you can use faToTwoBit to convert your genome into a 2bit file, then make an assembly hub out of your data. Once you’ve created your hub, you can view the hub with the hubUrl setting. As an example, I have hosted an assembly hub for Arabadopsis thaliana here, and I can view the hub via a single URL like so:

https://genome.ucsc.edu/cgi-bin/hgTracks?genome=araTha1&hubUrl=https://genome-test.gi.ucsc.edu/~chmalee/araTha1/plantAraTha1/hub.txt.
assemblyHubViaUrl

If your data needs to stay behind your local firewall, then you can use the GBiB and GBiC products so you can set up your own “copy” of the UCSC Genome Browser that meets your privacy needs.

Further Reading:


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Accessing the Genome Browser Programmatically Part 2 – Using the Public MySQL Server and gbdb System

If you missed part 1 about obtaining sequence data, you can catch up here. Note: We now have an API which can also perform many of these functions.

The UCSC Genome Browser is a large repository of data from multiple sources, and if you want to query that annotation data, the easiest way to get started is via the Table Browser. Choose the assembly and track of interest and click the “describe table schema” button, which will show the MySQL database name, the primary table name, the fields of the table and their descriptions. If the track is stored not in MySQL but as a binary file (like bigBed or bigWig) in /gbdb, it will show a file name, e.g. "Big Bed File: /gbdb/dm6/ncbiRefSeq/ncbiRefSeqOther.bb". If this is the case, skip directly to the Accessing the gbdb directory system section below. Otherwise, the track data is either a single MySQL table or a set of related tables, which you can either download as gzipped text files from the “Annotation Database” section on our downloads page (here’s the GRCh37/hg19 listing) and work on them locally, or use the public MySQL server and issue MySQL queries remotely. Generally speaking, the format for most of our tables is similar to the formats described here, e.g., in bed (“chrom chromStart chromEnd”) format, and we do not store any sequence or contigs in our databases, which means you’ll need to use the instructions in Part 1 of this blog series in order to get any raw sequence data.

Accessing the public MySQL server
The best way to showcase the public MySQL server is to show some examples — here are a few to get you started:
1. If you want to download some transcripts from the new NCBI RefSeq Genes track, you can use the following command:

$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select * from ncbiRefSeq limit 2" hg38
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+
| bin | name        | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts                                                         | exonEnds                                                           | score | name2   | cdsStartStat | cdsEndStat | exonFrames                        |
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+
| 585 | NR_046018.2 | chr1  | +      |   11873 | 14409 |    14409 |  14409 |         3 | 11873,12612,13220,                                                 | 12227,12721,14409,                                                 |     0 | DDX11L1 | none         | none       | -1,-1,-1,                         |
| 585 | NR_024540.1 | chr1  | -      |   14361 | 29370 |    29370 |  29370 |        11 | 14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320, | 14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370, |     0 | WASH7P  | none         | none       | -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, |
+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+

2. If you are interested in a particular enhancer region, for instance “chr1:166,167,154-166,167,602”, and want to find the nearest genes within a 10kb range, then the following query will do the job:

$ chrom="chr1"
$ chromStart="166167154"
$ chromEnd="166167602"
$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select \
   e.chrom, e.txStart, e.txEnd, e.strand, e.name, j.name as geneSymbol from ncbiRefSeqCurated e,\
   ncbiRefSeqLink j where e.name = j.id AND e.chrom='${chrom}' AND \
      ((e.txStart >= ${chromStart} - 10000 AND e.txStart <= ${chromEnd} + 10000) OR \ (e.txEnd >= ${chromStart} - 10000 AND e.txEnd <= ${chromEnd} + 10000)) \
order by e.txEnd desc " hg38
+-------+-----------+-----------+--------+----------------+------------+
| chrom | txStart   | txEnd     | strand | name           | geneSymbol |
+-------+-----------+-----------+--------+----------------+------------+
| chr1  | 166055917 | 166166755 | -      | NR_135199.1    | FAM78B     |
| chr1  | 166055917 | 166166755 | -      | NM_001320302.1 | FAM78B     |
| chr1  | 166069298 | 166166755 | -      | NM_001017961.4 | FAM78B     |
+-------+-----------+-----------+--------+----------------+------------+

3. If you need to get gene names and their lengths for RNA-seq read normalization, you can use the following query:

$ mysql -h genome-mysql.soe.ucsc.edu -u genome -A -e “ \
  select l.name, kr.value, psl.qEnd - psl.qStart as length \
  from   refGene r, hgFixed.refLink l, knownToRefSeq kr, knownCanonical kc, refSeqAli psl \
  where  r.name = l.mrnaAcc and r.name = kr.value and kr.name = kc.transcript \
         and r.name = psl.qName group by kr.value limit 3” hg38
+-------+-----------+--------+
| name  | value     | length |
+-------+-----------+--------+
| A2M   | NM_000014 |   4920 |
| NAT2  | NM_000015 |   1317 |
| ACADM | NM_000016 |   2622 |
+-------+-----------+--------+

In addition to our download site and public MySQL server hosted here in California, we have also recently added support for a download site (http://hgdownload-euro.soe.ucsc.edu) and public MySQL server (genome-euro-mysql.soe.ucsc.edu) hosted in Europe, which will speed up downloads for many of our users.

Please follow the Conditions for Use when querying the public MySQL servers.

Many of the command line utilities available on our utilities downloads server are also able to interact with our databases or download files, like mafFetch (as long as your ~/.hg.conf file is present as discussed below):

$ mafFetch xenTro9 multiz11way region.bed stdout
##maf version=1
##maf version=1 scoring=blastz
a score=0.000000
s xenTro9.chr9     15946024 497 +  80437102 ACTAT...
e galGal5.chr14     1678315   0 -  15595052 I
e xenLae2.chr9_10L 13130032 2034 - 117834370 I

a score=2992.000000
s xenTro9.chr9     15946521 145 +  80437102 TCATC...
s xenLae2.chr9_10L 13132066 148 - 117834370 TTATC...

Note: Only the first 5 bases on each line and only the first 10 lines are shown for brevity.

Here we are directly querying the mutliz11way table for the Xenopus tropicalis xenTro9 assembly, no need to download the entire alignment file to the local disk and query manually. Commands of this nature usually require a special private .hg.conf file in the user’s home directory (note the leading dot). This configuration file contains a couple key=value lines that most of our programs can parse and then use to access the public MySQL server. This page contains a sample .hg.conf file that can be used by most of the command line utilities to direct them to access either our US MySQL server or our European MySQL server. That sample .hg.conf is certainly enough to get started, but for more information about the various Genome Browser configuration options, please see the comments in the ex.hg.conf and minimal.hg.conf files.


Accessing the gbdb directory system
The third method of grabbing our data is via the /gbdb/ directory system. This location, browsable here, holds most of the bigBed, bigWig, and other large data files that we do not keep directly in MySQL databases/tables. There are many utilities available for manipulating these files, and most of them are able to work on remote files, for example:

$ bigBedToBed -chrom=chr1 -start=5563837 -end=5564370 http://hgdownload.soe.ucsc.edu/gbdb/hg38/crispr/crispr.bb stdout 
chr1    5563870    5563893        55    +    5563870    5563890    0,200,0    255,255,0    128,128,0    CAAGTGGAATCAGGATGCCT    GGG    55    72% (57)    52% (46)    10    60    MIT Spec. Score: 55, Doench 2016: 72%, Moreno-Mateos: 52%    3345002138
chr1    5563878    5563901        59    +    5563878    5563898    0,200,0    0,200,0    128,128,0    ATCAGGATGCCTGGGATATG    TGG    59    63% (54)    61% (50)    6    63    MIT Spec. Score: 59, Doench 2016: 63%, Moreno-Mateos: 61%    22777603204

Also note that we have all of this data available via rsync as well, so the following command will work to download the crispr.bb file referenced above:

$ rsync -vh hgdownload.soe.ucsc.edu::gbdb/hg38/crispr/crispr.bb
-rw-rw-r--  1466266135 2017/03/30 14:31:48 crispr.bb

sent 33 bytes  received 70 bytes  206.00 bytes/sec
total size is 1.47G  speedup is 14235593.54

If you are interested in say, Human GRCh37/hg19 gbdb data, then all you have to do is change the “hg38” at the end of the template http://hgdwonload.soe.ucsc.edu/gbdb/hg38 url to “hg19”, resulting in http://hgdwonload.soe.ucsc.edu/gbdb/hg19. This holds for all databases at UCSC, like mm10 or bosTau8.

Summary
Just as in part 1, if you are going to continually request parts of the same files or table over and over again, it is best to download the file from our downloads server and operate on it locally. All of our track data, including MySQL tables and bigBed/Wig/BAM files are hosted on our downloads server at http://hgdownload.soe.ucsc.edu. Generally speaking bigBeds/bigWigs/BAMs and other binary files are located in the hgdownload.soe.ucsc.edu/gbdb/ location discussed earlier, while MySQL table data in gzipped plain text format can be found at http://hgdownload.soe.ucsc.edu/goldenPath/$db (where $db is a database name like hg19 or hg38) or via queries against the public MySQL server directly.

Stay tuned for part 3 of this programmatic access series — controlling the Genome Browser image!


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Accessing the Genome Browser Programmatically Part 1 – How to get sequence from the UCSC Genome Browser

Note: We now have an API which can also perform many of these functions.

As the number of bioinformaticians have grown since the inception of the UCSC Genome Browser in 2000, there has been an increased need for programmatic access to the data and tools hosted at UCSC. Although there is no true API developed by UCSC (yet), there are a number of ways to interface with the UCSC Genome Browser, some more efficient than others. The intention of this blog post series is to explain some of the preferred ways to access the commonly requested Genome Browser data and tools and to add a bit of explanation of the architecture of the UCSC Genome Browser in general. The three most common requests are 1) how to download a single stretch of sequence in FASTA format, 2) how to download multiple ranges of sequence, and 3) how to get basic statistics on the nucleotides in a sequence. If you want the in-depth examples and explanation, skip down, but if you’re crunched for time, all you really need to know is the following three Q&As:

Q: How do I extract some sequence?
A: The best choice is to use the twoBitToFa command, available for your system here (Windows 10 users can use the linux.x86_64/ binaries in the Windows Subsystem for Linux). Here’s an example:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

Q: What if I have a list of coordinates?
A: Again use twoBitToFa, this time with the -bed option (also check out the post on coordinate systems):

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Q: How do I count A, C, G, T?
A: twoBitToFa followed by faCount (available from the same location as twoBitToFa):

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

Run twoBitToFa or faCount with no arguments to get a usage message and view all of their options:

$ faCount
faCount - count base statistics and CpGs in FA files.
...


The most efficient way to get sequence from UCSC Genome Browser

The most common data request we receive is a request for FASTA sequence or sequences, making it a fitting subject for part 1 of this blog series about programmatic access to the Genome Browser. If you are browsing a region in the genome browser and you want to get a FASTA sequence for just the region you are browsing, using the keyboard shortcut ‘vd’ (v then d for view DNA) is probably the easiest way. But what about when you want to get sequences for a list of regions? What about if you need your web application to download the sequence? You could download sequence interactively with the Table Browser, although the solution is somewhat cumbersome: first you must make a custom track of the region(s) you would like sequence for, and then use the “output format: sequence” option with your custom track selected as the primary track. Fortunately, there is a much easier approach – downloading the 2bit file for your organism of interest and then using the twoBitToFa command on it like so:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ twoBitToFa hg38.2bit:chr1:100100-100200 stdout
>chr1:100100-100200
gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta
taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt

The twoBitToFa command is available from the list of public utilities, in the directory appropriate to your operating system. twoBitToFa even accepts a URL to our downloads server as the 2bit argument, so if you wanted to grab some mm10 sequence, or even a list of sequences, you can just query the downloads server directly like so:

$ cat input.bed
chr1 4150100 4150200 seq1
chr1 4150300 4150400 seq2
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout
>seq1
gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca
cttgggattctctcacccccatatttaggagaccttattagggtcacctt
>seq2
tatccccttccctccccaccagatactacaattcacatcatactctgtcc
cccagtctacccataaaatctattctatttacctctccaaacgaagatct

Note that “stdout” in the above commands is a special option (along with the corresponding “stdin”) that tells the majority of UCSC commands to read/write from/to /dev/stdin and /dev/stdout instead of the required filenames, and is exemplified by the following common usage of generating some quick statistics on a region like chr1:100100-100200:

$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin
#seq    len     A       C       G       T       N       cpg
chr1:100100-100200      100     37      17      21      25      0       0
total   100     37      17      21      25      0       0

The twoBitToFa and URL to hgdownload 2bit combo is important because our downloads server is significantly more robust than our DAS CGI, can support more requests, and won’t slow the main site down for other users. We’ve also noticed that our DAS server often receives many requests for the same sequence, so for those of you providing software where the same query will be made multiple times, consider whether it would be more efficient to download an entire 2bit file to your local disk, rather than send the same query thousands of times to our servers.

Summary
twoBitToFa and faCount are two useful utilities, among the many other hundreds of tools available, that are useful for extracting sequence data. While not as preferable to working with locally downloaded files, twoBitToFa can also work with URLs to 2bit files, such as those on the UCSC Genome Browser download site. Stay tuned for part 2 of this programmatic access series — Using the Genome Browser public MySQL server and gbdb.


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.