Author Archives: Brian Lee

GenArk Hubs Part 4 – New assembly request page

This blog post adds to an earlier series that discusses the Genome Archive (GenArk) assembly hubs. To help users both to find Genome Archive (GenArk) hubs and to inform us what GCA/GCF accessions to add to the collection, we have created a new genome assembly request page: http://genome.ucsc.edu/assemblyRequest.html

Last year we announced the creation of a new collection of GenArk assembly hubs. GenArk hubs are constructed from NCBI Genbank GCA/GCF accessioned assembly data, for instance where GCF_001984765.1 is the accession for an American beaver assembly. When present in UCSC’s GenArk collection, these genome browsers can be loaded instantly with direct links (e.g., http://genome.ucsc.edu/h/GCF_001984765.1), and they come ready with dynamically invoked BLAT and PCR servers, enabling searching for sequences and primers. The first released GenArk hubs were organized into phylogenetic groups, for example, all bird assemblies were listed here.

Our newly added genome assembly request page displays which assemblies are available for viewing, and presents a single-click “request” button to send an email to UCSC to add any GCA/GCF assembly available at NCBI not yet part of the GenArk collection.

GenArk currently has over 1,700 assembly hubs available for browsing at the click of a “view” button. To view only those in the current collection, you can use the middle “select assembly type to display” option and remove the “Request browser” checkbox, and only completed browsers will show. Upon first visiting the page shows only the first 500 assemblies. Use the “select assembly type to display” to “show all” assemblies if you do not see your assembly. Click “view” to launch a specific genome browser listed on the page. This earliest version of the page has some performance issues, and selections may take some time, a future improvement planned is to present an active wait cursor when the page is busy filtering results.

By using the “choose clade to view/hide” option it is possible to subselect groups, such as only displaying plants, when narrowing down which genomes may exist or could be requested.

The first “show/hide columns” option on the page enables displaying additional metadata, such as adding the assembly build date or Biosample number, where links exist back to NCBI for more information.

If NCBI does not have a GCA/GCF accession for your desired assembly then our scripts will not be able to pull the data and generate the GenArk addition. Such new assemblies will need to be submitted to NCBI first, after which you can then notify UCSC. You can find directions at NCBI for how to submit new genomes: https://www.ncbi.nlm.nih.gov/assembly/docs/submission/ Also, please review the UCSC GenArk Blog posts for more information on accessing and using tools on GenArk hubs.

If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

GenArk Hubs Part 3 – Technical details

This blog post is the final of three to discuss the Genome Archive (GenArk) assembly hubs. This third post discusses the technical infrastructure of the GenArk hubs, while the first post was about accessing the data, and the second shared examples of using the data.

TECHNICAL DETAILS

What are the systems behind GenArk hubs?

In essence, GenArk hubs are assembly hubs that have been added as Public Hubs.

Anyone can build Track Hubs and Assembly Hubs, where finished hubs can then be requested to be published as Public Hubs. UCSC does have a Public Hub Guidelines page to encourage hub developers to document their data fully before submitting for inclusion, this is mainly to ensure users of Public Hubs know who to contact for their data and can understand what they are visualizing.

To aid independent groups that do not want to build assembly hubs, when the underlying assembly data is already available at NCBI, our engineers have crafted scripts to build these files automatically. GenArk scripts pull data from NCBI and then programmatically construct all of the binary-indexed files needed to visualize them on the UCSC Genome Browser. Additional special features have been included, especially the ability to generate and provide BLAT and PCR dynamic servers through the pre-generation of special index files. Our engineers have also optimized other elements of these GenArk Hubs by applying the latest available Track Hub features.

But what are the internal methods UCSC uses to populate the GenArk hubs?

Here at UCSC, an internal process maintains a local mirror image of the NCBI genome assembly resources. In essence, there is first a transfer of data with a rsync request, rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/ to a matching local hierarchy of directories. These matching directory structures and naming conventions for files enable scripting procedures to automatically find and process the source files into the formats the UCSC Genome Browser recognizes to visualize data, mainly byte-range accessible binary-indexed versions of the data. A Perl script, doAssemblyHub.pl, manages all the steps of the procedure (https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/doAssemblyHub.pl).

So how are gene tracks created for these GenArk hubs?

For the assemblies with GCF accessions, the script uses the supplied data of the NCBI gene annotations to create gene tracks that are provided with those files. A specific gene track type called bigGenePred, https://genome.ucsc.edu/goldenPath/help/bigGenePred.html, allows amino acid displays when zoomed-in at the base level. Likewise, a bigGenePred track is made to display Xeno RefGene data which is computed from a selection of best alignments of RefSeq mRNA sequences from many organisms to the genome, using the BLAT (http://www.kentinformatics.com/) algorithm (blat -noHead -q=rnax -t=dnax -mask=lower target.fa.gz query.fa.gz results.psl). Another bigGenePred track is made using the Augustus gene prediction software (http://bioinf.uni-greifswald.de/augustus/) from the Stanke lab.

How are the other GenArk annotation tracks made?

GenArk assemblies also have Repeat Masker tracks, which use the data when supplied from NCBI source. Otherwise, the track can be computed with a local installation of the Repeat Masker software (https://www.repeatmasker.org/). The Simple Repeats track is computed with the Tandem Repeats Finder software (https://tandem.bu.edu/trf/trf.submit.options.html) and the Window Masker track is computed with the WindowMasker software included in the NCBI C++ toolkit (https://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/). The CpG Islands are computed with a modification of a program developed by G. Miklem and L. Hillier and the GC Percent track is computed using the ‘kent’ command hgGcPercent (http://hgdownload.soe.ucsc.edu/admin/exe/). Examining the doAssemblyHub.pl script (https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/utils/automation/doAssemblyHub.pl), will illustrate more details about how individual steps are run (i.e., hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 test ../../\$asmId.2bit | gzip -c > \$asmId.wigVarStep.gz).

What if I don’t find my Assembly in the GenArk collection?

If you can’t find the assembly you want in the GenArk hub collection, but you do already have the GCA/GCF identifier you can email us at our public mailing-list genome@ucsc.soe.edu to request we add the assembly to the GenArk collection. This archived mailing-list is searchable from links on our contacts page, http://genome.ucsc.edu/contacts.html. Alternatively, if you don’t want your request to be public, you can email our private internal mailing-list at genome-www@soe.ucsc.edu. Also, since this original blog post, we created a new assembly request page, you can find details in this 4th GenArk blog post.

What if my assembly doesn’t have a GCA/GCF NCBI accession?

If NCBI does not have a GCA/GCF accession for your assembly then our scripts will not be able to pull the data and generate the GenArk hub. You will need to deposit the assembly at NCBI and notify us once the assembly has become available. You can find directions at NCBI for how to submit new genomes: https://www.ncbi.nlm.nih.gov/assembly/docs/submission/

A future manuscript is also in the works to further detail the background of the GenArk hubs.

This was the final blog in a three-part series about GenArk hubs authored by Brian Lee. The first post focused on how to discover and access the hubs, while the second blog post provided tutorial examples of using the GenArk hubs, such as the BLAT and PCR tools that are available, or how you can send DNA of any Assembly Hubs to External Tools for processing.

GenArk Hubs Part 2 – Using the data

This blog post is the second of three to discuss the Genome Archive (GenArk) assembly hubs. This second post discusses examples of using the GenArk hubs’ data, with the first post about accessing the data, and the third shares technical infrastructure behind the hubs.

Before launching into using the new GenArk hubs, let’s go quickly over how the first blog post examined the multiple ways to access the GenArk hubs. The easiest way to find a GenArk hub is by searching the UCSC Genome Browser’s main Gateway page with a name like “hummingbird” and clicking on the GCA/GCF identifier to attach it. Another is to build direct links to NCBI GCA/GCF assembly accessions when you know them to instantly arrive at the main Browser view, such as https://genome.ucsc.edu/h/GCF_005190385.1 for narwhal. Yet another is searching the UCSC Public Hubs page or going to the main GenArk homepage where you can in turn navigate directly to individual taxonomic group pages, such as for birds.

USING THE DATA

What can you do with a GenArk Assembly hub?

The new GenArk hubs come with the ability to perform BLAT DNA queries and PCR primer searches, as well as send the genome’s DNA to external tools.

As an example, let’s say you are curious if we have a specific bat genome. The first step would be to go to the Gateway page and search “bat” and discover multiple hits.

Looking at search results you see your desired specific “little brown bat” assembly and click on it so that hub is now selected, where under “Find Position” on the right there would now be “Mammal assemblies Hub Assembly” attached with “little brown bat” displayed and a specific GCF_000147115.1 NCBI accession. Clicking the “Go” button would bring you to the main Browser display. The same result happens from clicking this short direct GenArk /h/ hub link: https://genome.ucsc.edu/h/GCF_000147115.1

BLAT DNA Search

With this bat genome displaying if you had a short DNA sequence you wanted to search, you could paste it right in the top search box. For instance, after clicking the above link, try pasting on the main browser display CATTAGGCAAATATATGCATATAAGTTCTTTGTTTAATCTCT and hit “go”. The result, shown after a few seconds, will be sequence matches across the little brown bat genome. You can also go to the top Tools menu and then select “Blat”, and do the same step of pasting DNA sequence, required when searching especially longer strings. The Blat Tool page also allows you to search alternative sequences.

BLAT Protein Search

On the Tools > Blat page you can put in protein sequences to search. This is especially interesting if you want to find the location of a known protein from another species in your genome of interest. For example, if again you are on the little brown bat genome and you go to the Blat page, try to blat this portion of the human SOD1 protein:, LSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN

You will find a match (again note for protein searches, be sure to go to the Tools > Blat page). When viewing the results you can either click a “browser” link to see the matching spots on the genome. Or if you click a “details” link you will see the side-by-side alignment like this image below.

Besides DNA and protein searches, BLAT also allows translated RNA and translated DNA searches. Also the results from BLAT searches can be saved as custom tracks. This allows you to download and save these annotations, or save them in Sessions making the results more permanent and shareable. See this other blog post about sharing sessions for more information: https://bit.ly/UCSC_blog_sharing

PCR Primer Search

GenArk also provides PCR primer searches, by going to the Tools menu and selecting PCR. With the same “little brown bat” genome loaded, for instance, go to the Tools menu and select “In-Silico PCR” to arrive on the PCR page. Then enter these two primers, forward primer: AGTCATGGTCTCAGGAACCG and reverse primer: GTTACTAGGGCTCAGACCTC (there is no need in this example to click any other settings).

Then click “submit” to search the “little brown bat” genome for matches.

The results will be two hits, in part because this assembly has 11,654 scaffolds with some identical sequences (to see all the scaffolds click the “view sequences” link on the Gateway page described later).

Send DNA to External Tools

Another way to use GenArk hubs is to send the current DNA in the viewing window to external sites. By going to the View menu you can select the “In External Tools” option and export the current DNA for processing outside of the UCSC Genome Browser.

In this image a 7,477 bp region will be selected to be sent to external sites where selecting “In External Tools” under the View menu will result in a pop-up of various options.

In this case all of the options are presented as available for this 7,477 bp span, except for RNAfold, which requires the viewer to zoom in to less than 5 kpb, before sending the DNA to that external tool.

Send DNA to External Tools -Primer Design: Primer-BLAST

If you were interested in PCR Primer design in this region you could use the Primer3Plus or Primer-BLAST links. The Primer-BLAST link starts a job at NCBI, where after some time the results at NCBI will be optimal PCR Primers for this stretch of DNA. Here are example results sending the 7,477 bp span of the NW_005878708v1 little brown bat scaffold to NCBI.

With these results, one can return to the UCSC PCR Tool to test each result in order to discover if these primers will have potential off-target results beyond the desired chromosome.

Send DNA to External Tools -Primer Design: Primer3Plus

Another PCR Primer design option in the “In External Tools” menu is Primer3Plus. Here are example results sending the same 7,477 bp span of the NW_005878708v1 little brown bat scaffold to Primer3Plus.

Primer3Plus has the added benefits of a “Return to Genome Browser” button (top left) that if clicked will dynamically generate a custom track of the results to be seen back on the UCSC Genome Browser.

Above the Primer3Plus custom track identifies the input region that was sent (top grey bar), and then the individual left and right matching primer pair locations. At UCSC the primers can then be tested again with the UCSC PCR Tool where a highlight for the Primer3Plus suggested “Primer 5” is highlighted in the above image.

Send DNA to External Tools for oligo-analysis

Another tool you can export DNA of interest to is Regulatory Sequence Analysis Tools (RSAT) Metazoa for motif discovery. For instance, when looking at a GenArk assembly for Zebu Cattle, https://genome.ucsc.edu/h/GCF_000247795.1, using the View menu and In External Tools option one could select the RSAT link. RSAT provides a way to analyze the DNA sequence for transcription factor binding sites and over-represented oligo-nucleotides. Because RSAT requests your organism, in this example Bos taurus was used as a relative to zebu cattle, allowing for proceeding to request examination of the region. The DNA being sent in this example was near a region for the start of a gene predicted by Augustus. One of the RSAT results was a predicted motif, aaacttatagata, just upstream of the transcription start site for the predicted gene.

By going back to the UCSC Genome Browser and clicking into the Short Match track (under the top Mapping section) and pasting in the motif sequence, aaacttatagata, a display in the GenArk hub of where these matches occurred could be visualized.

The Short Match track’s ability to visualize the motif identifies the potential binding sites of transcription factors, predicted by RSAT. This Browser view of the Zebu Cattle GenArk assembly hub can be viewed with this Public Session link.

Can I add custom tracks to a GenArk Assembly hub?

Yes, users can add tracks to their data by going to the My Data menu and then selecting Custom Tracks to paste in information. Simple text-based tracks can be loaded, or more complicated binary-indexed files such as BAMs or VCFs or bigBeds can be loaded as well.

How do I name my sequences for my custom tracks?

Another special feature of GenArk hubs is that they are loaded with a special chromAlias file allowing for multiple alias names. When building custom tracks the scaffold names for sequences need to match the names in the assembly, but many options exist. For instance, with the Zebu Cattle genome, https://genome.ucsc.edu/h/GCF_000247795.1, if you type “v s” to view sequences, or click the top “Genomes” name and then the “view sequences” button, you will end up on a page where all the scaffolds of a genome are displayed.

Scrolling down you on the resulting page you will see a link titled “GCF_000247795.1.chromAlias.txt” which will have results like this:

# sequenceName    alias names    assembly: GCF_000247795.1_Bos_indicus_1.0
chr1    1    CM003021.1    NC_032650.1
chr10    10    CM003030.1    NC_032659.1
chr11    11    CM003031.1    NC_032660.1
...

What this chromAlias.txt file displays is how “chr1”, or “1”, or “CM003021.1” or “NC_032650.1” can be used to create custom tracks on chromosome one for this assembly (i.e., BED custom tracks “chr1 300 500” = “1 300 500” = “CM003021.1 300 500” = “NC_032650.1 300 500”).

Can I add a Track Hub to a GenArk Assembly Hub?

Yes, after loading a hub you can user go to the My Data menu and paste in the location of a hub to display on any of the GenArk assembly hubs. The one special detail is that your Track Hub’s genomes.txt genomes line only needs to have the GCA/GCF number such as “genome GCF_001984765.1”. See this example hub.txt file for an idea of how a hub could be loaded on a GenArk hub. Here is a link that will load that hub on a GenArk hub for American beaver:

https://genome.ucsc.edu/h/GCF_001984765.1?position=NW_017869957v1:1,285,000-1,793,000&hubUrl=https://data.cyverse.org/dav-anon/iplant/home/brianlee/examples/hub.txt

Can I share data on a GenArk assembly hub?

Yes, you can make a session and share the URL with others. Even better, publish your session to the Public Session page to make it more discoverable. See this previous blog about sharing data for more information: https://bit.ly/UCSC_blog_sharing

The next blog post in this series will provide some technical details about the GenArk hub architecture. The first post focussed on how to discover and access the hubs.

This entry written by Brian Lee. If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

GenArk Hubs Part 1 – Accessing the data

This blog post is the first of three to discuss the Genome Archive (GenArk) assembly hubs. This first post discusses accessing the GenArk hubs, the second post gives examples of using the data, and the third post describes the technical infrastructure behind the hubs.

Let’s start with a real-world story: imagine you are a researcher working on zebrafish, but you are using an alternative strain with unique polymorphic properties. You have a desire to do CRISPR on your particular zebrafish and you already have a FASTA file for the genome assembled into chromosomes, but have no annotations or way to visualize the data yet.

One option to visualize your FASTA would be to independently create a UCSC Assembly Track Hub to work on your zebrafish. Or now that UCSC has developed the Genome Archive (GenArk) system, when you submit your assembly into NCBI’s assembly database, you could contact us directly and request we generate the browser for you behind the scenes. This happened for a specific lab, where they submitted their specific TD5 zebrafish assembly to NCBI, https://www.ncbi.nlm.nih.gov/assembly/GCA_018400075.1, and the result after contacting us was a new assembly hub that could be easily loaded at UCSC with the following link: https://genome.ucsc.edu/h/GCA_018400075.1 In this case, the team at UCSC even helped generate liftOver alignment files between the UCSC zebrafish in this new TD5 zebrafish GenArk Public Hub addition, aiding identification of lifting annotations to the new browser.

So what are the GenArk Assembly hubs?

GenArk hubs are a collection of data files externally hosted from the main UCSC data website enabling browsing new genomes. GenArk genomes have NCBI Genbank assembly accessions starting with either GCA or GCF and the browsers allow visualizing and attaching laboratory-generated data. New software also enables UCSC to dynamically turn on query servers to search GenArk hubs with DNA sequences or test PCR primer pairs. GenArk hubs are part of the UCSC Public Hubs list where UCSC can update the data files with pipelines.

ACCESSING THE DATA

How do I access GenArk Assembly Hubs?

There are multiple ways to access the GenArk hubs, including searching the UCSC Genome Browser’s main gateway page, building direct links to NCBI GCA/GCF assembly accessions, searching the UCSC Public Hubs page, and navigating directly to individual taxonomic group pages.

Browser Gateway Page

The easiest way to find GenArk hubs is to search the species name on the Browser Gateway.

On the Gateway page in the top left box you can search a term such as “dog” and find all the genomes both hosted in our internal databases and in external Public Hubs that have dog in the name. In this image, a search for “dog” returns a top “Dog” match (UCSC database) as well as results for several species in Assembly Track Hubs that match on the term “dog” with the specific labrador dog breed selected from the GenArk Mammal Assemblies Hub (GCF_014441545.1).

Direct GCA/GCF Accession Links

In the situation where you may know the GCF/GCA identifier for an assembly, you can also search that term on the Gateway page or build a short link to directly load the hub. Links to UCSC with a hub (“/h/”) address, such as https://genome.ucsc.edu/h/GCF_000698965.1 will attempt to find and attach a matching final GCF-value, which originates from the NCBI accession, in this case, for an African ostrich assembly. If you don’t find a match, read more below about contacting us.

Public Hubs Page

Another place to find GenArk hubs is on the Public Hubs page where you can enter various terms, like “ostrich”: https://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=ostrich, You can expand the “Search details” to examine matching results. To load a desired hub, use a right-click to display an “Open this assembly” pop-up, or an option to configure individual track settings.

Genomes Menu

Another option to gain an overview of all the GenArk hubs is to click the “Genome Archive GenArk” link available under the “Genomes” menu.

This Genomes menu link will open the GenArk homepage. On the GenArk homepage, a variety of links exist including the line,“Please note: text file listing of 1,600 NCBI/VGP genome assembly hubs.” Clicking that link will open a single text file that lists all available hubs allowing a quick overview: https://hgdownload.soe.ucsc.edu/hubs/UCSC_GI.assemblyHubList.txt

Individual Taxonomic Pages

The GenArk homepage also has links to specific taxonomic groupings hub pages, such as mammals, fishes, or fungi. For instance, a “birds” link, https://hgdownload.soe.ucsc.edu/hubs/birds/index.html, brings you to a webpage with links to launch browsers, along with links to other details for each assembly.

These taxonomic group pages, such as this image of the bird’s page, have links to launch the browser (2nd column: common name and view in browser) and links to the source files (4th column: NCBI assembly).

Access to these taxonomic group pages is also available from the Public Hubs page.

By going to the Description column on the Public Hubs page you can click a link (Bird genome assemblies) to end up at the related taxonomic grouping page. Also of note that on the Public Hubs page, you can click a [+] plus button to expand the list of Assemblies and click one of the GCA/GCF accession links to directly load an assembly.

What if I don’t find my Assembly in the GenArk collection?

If the assembly of interest is not found, please visit our assembly request page. Search that page for your assembly, if there is a “view” link you can launch the existing genome browser. Otherwise, click the “request” button to fill out a form to add your genome of interest.

The assembly request page does require there to already be an existing GCA/GCF identifier. You can also always email us at our public mailing-list genome@ucsc.soe.edu to request we add the assembly to the GenArk collection. This archived mailing-list is searchable from links on our contacts page, http://genome.ucsc.edu/contacts.html. Alternatively, if you don’t want your request to be public, you can email our private internal mailing-list at genome-www@soe.ucsc.edu.

What if my assembly doesn’t have a GCA/GCF NCBI accession?

The next blog post in this series will provide examples of using the GenArk hubs, such as the BLAT and PCR tools that are available, or how you can send DNA of any Assembly Hubs to External Tools for processing. The final post examines the infrastructure behind the hubs.