Monthly Archives: February 2019

Patching up the Genome

From biologists to computer scientists, the human genome has presented a grand puzzle. With regards to UCSC, the story began in 1985 when our chancellor, molecular biologist Robert Sinsheimer, proposed a bold endeavor – sequence the complete human genome. 5 years later the International Genome Project was launched. The next chapter took place in 1999 when computer science professor David Haussler was asked to join the project.  Haussler, in turn, enlisted then graduate student Jim Kent to help with assembling the genome. This collaboration culminated on July 7, 2000, when the first human genome assembly was made available on the UCSC servers. Over 500 GB were downloaded worldwide in 24 hours.  (Hey, back in 2000, that was a lot!)

UCSCReleaseDownloads

Total web traffic at the University of California Santa Cruz in 2000. When the genome becomes available online, all other web activity at the university shrank to the background.

Three months later, the UCSC Genome Browser came online as a resource to distribute and visualize the genome.  The first ten releases, hg1-hg10 were assembled at UCSC, after which the task was taken over by NCBI. As NCBI incremented the official releases and changed the naming scheme, UCSC released browsers at a slower rate, continuing to increment the hg* nomenclature.  By the time NCBI released NCBI33 in 2003, UCSC released it as hg15. After releasing so many browsers in under three years, the pace slowed, with each assembly taking around one year longer than the previous.

Patches: What are they and why are they important?

Blog_table

Note: hg38 follows hg19. The UCSC nomenclature was changed to match the Genome Reference Consortium (GRC)’s GRCh release number.

The early genome assemblies were largely aiming to increase the fidelity of the reference. However, with each release, research progress was temporarily hampered as scientists adjusted to sequence changes and shifted coordinates. This has often led to scientists continuing to use an older release as it may be better annotated and established. This is evident in the Genome Browser as a majority of our users continue to work on GRCh37/hg19 in spite of GRCh38/hg38’s release more than 4 years ago.

Looking at the numbers, however, we can see that GRCh38 is the most accurate human genome to date. With these benchmarks in accuracy, the GRC has shifted focus beyond fidelity to inclusion. The GRC  now strives to capture more of the genetic diversity present in the human population. The initial release of GRCh38/hg38 included 261 alternate haplotype sequences, nearly a 30-fold increase over GRCh37/hg19.

UCSC builds a new assembly database for each full release of a genome assembly, but the GRC also releases “patch” updates for genome assemblies. Through patch releases, the GRC adds new alternate haplotype sequences, and also corrected sequences, without changing the sequences or coordinate system of the initial assembly release.

To quote directly from the GRC:

Patches are accessioned scaffold sequences that represent assembly updates. They add information to the assembly without disrupting the chromosome coordinates. Patches are given chromosome context via alignment to the current assembly. Together, the scaffold sequence and alignment define the patch.

These patch sequences are more important now than ever before as the GRC has decided to indefinitely postpone the release of the next coordinate-changing assembly (which would have been GRCh39/hg39), instead opting for additional patches to GRCh38/hg38. There are two kinds of patch sequences:

Novel patches (alternative haplotypes): Chromosomal regions of the genome that exhibit sufficient variability to prevent adequate representation by a single sequence. Also referred to as alternate loci. UCSC labels these haplotype sequences by appending “_alt” to their names.

Fix patches: Error corrections (addressed by approaches such as base changes, component replacements/updates, switch-point updates or tiling-path changes) or assembly improvements, such as the extension of sequence into gaps. UCSC labels these fix sequences by appending “_fix” to their names.

These patch sequences, especially novel patches, have been increasing in number and will continue to do so.

patches

The number of human assembly patch sequences is quickly growing. This is primarily due to alternative haplotypes (_alt) sequences, though fix sequences (_fix) are also being introduced. The fix patches reset from GRCh37.p13 to GRCh38 as they were integrated into the assembly.

A better approach to patches

Our approach thus far in the Genome Browser has been to make data tracks indicating the locations of these patch releases along the initial assembly chromosomes. While these are useful, they provide little in the way of annotations and are largely underutilized by users. With the increase of these patches and postponement of GRCh39, however, we have decided to switch our approach and add the new sequences, and annotations on the new sequences, to the UCSC hg38 database. This will allow patches to be visualized on the Browser as standalone reference sequences, not unlike a regular chromosome or the alternate haplotype sequences that were included in the initial assembly release. BLAT results may also include alignments to these sequences.

The addition of new genomic sequences to an existing UCSC database is a departure from our longstanding practice of building a new database every time we import a new genome assembly release.  To minimize disruption to pipelines that use our download files, especially those in the bigZips directory, we will leave the original bigZips/hg38.* files unchanged, and add a subdirectory when we incorporate sequences from a patch release; for example, bigZips/p12/ for patch release GRCh38.p12.  We will also add bigZips/latest/ which will link to the most recent patch release subdirectory, so that pipelines may stay up to date with UCSC’s patch sequence annotations if desired. In other words, the bigZips downloads will be “opt-in” for patch sequences.

Changes and improvements to hg38

Currently, we are in the process of adding these sequences to the GRCh38/hg38 genome database with the potential to do the same for GRCh37/hg19 and GRCm38/mm10 at a future date. Changes that users may see are as follows:

  • BLAT/In-Silico PCR – Additional hits on _alt and _fix sequences
  • Position searches in the hg38 browser may lead to _alt and _fix sequences in addition to or instead of initial assembly chromosomes
  • Replacing the ‘GRC Patch Release’ and ‘Alt Map’ tracks with ‘Fix Patches’ and ‘Alt Haplotypes’ tracks which include alignments to alts/fixes with details pages and links to jump between main chromosomes and alts/fixes
  • New subdirectories of bigZips download directory (initial, p12, latest)
  • New sequences/annotations in /gbdb/hg38 download files (same file names, extended contents)
  • SQL queries to genome-mysql.soe.ucsc.edu may include new results on _alt and _fix sequences

It is also worth noting what will not change. Existing sequences, and annotations on existing sequences, will not change. Download files in the bigZips directory, such as bigZips/hg38.2bit and bigZips/hg38.fa.masked.gz, will not change.

So what kind of annotations can be found on these hg38 patch sequences?

  • Annotations generated by UCSC such as RepeatMasker, CpG Islands, AUGUSTUS, Human mRNAs and Pfam
  • NCBI’s sequence alignments of patch sequences to chromosomes: Fix Patches, Alt Haplotypes
  • External annotation sources such as RefSeq and GENCODE that include annotations on patch sequences (up to this point we have ignored those patch annotations)
  • Select tracks have been lifted from main chromosomes onto the patches using NCBI’s alignments, most notably GTEx Gene and ENCODE Regulation

For additional information on these patch sequences, and a full list of sequences in hg38, you may visit the hg38 Genome Browser Gateway page.

We are always receptive to our users and their needs. If there are any specific track annotations you would like to see on these patches or if you have any questions regarding this implementation and how it may affect you, please write into our public mailing list (genome@soe.ucsc.edu) or our private mailing list if your message includes sensitive data (genome-www@soe.ucsc.edu).