Annotating millions of private variants with vai.pl

For almost 4 years, Genome Browser users have been able to use the Variant Annotation Integrator (VAI) to predict the functional effects of their variants of interest. The VAI takes a variety of inputs (pgSnp/VCF custom track or hub, dbSNP rsID, HGVS terms) and annotates all the variants with their functional effect in Sequence Ontology terms (e.g. synonymous_variant, missense_variant, frameshift_variant, etc). The VAI returns predictions in Variant Effect Predictor (VEP) format, which is described here.

The VAI is quite flexible, and offers the option to choose any gene set or gene prediction track in the chosen genome database for functional annotation. All human genome databases include UCSC Genes (based on GENCODE V24 in GRCh38/hg38), RefSeq Genes, GENCODE/Ensembl genes as well as gene predictions produced by tools such as Augustus. Gene/transcript annotations used as the basis for functional effect prediction should be chosen carefully since they have a large effect on results (McCarthy et al.). For the GRCh37/hg19 and GRCh38/hg38 assemblies in particular, regulatory regions from ENCODE summary datasets can be used to identify variants that may have a regulatory effect, and disease or pathogenicity information from the Database of Non-Synonymous Functional Predictions (dbNSFP) can help distinguish between protein changes that are likely to be very disruptive to function, versus those that are likely to have little functional effect. Conservation scores may also be added to the output.

To reduce the volume of output and narrow in on the variants that are most likely to damage genes, filters can be added to restrict the output to specific functional effects (such as missense, frameshift, etc.) and/or variants overlapping conserved elements predicted from multi-species alignments.

Unfortunately, as a web tool the VAI does have some limitations, namely that only 100,000 variants at a time can be annotated, which prevents annotating variants derived from whole genome sequencing experiments. Also, for clinical users, privacy restrictions may prevent the uploading of a patient’s variant data to the UCSC Genome Browser.

Now our new vai.pl program provides a way around these restrictions. This program is intended to be run on a Genome Browser in a box (GBiB) or server hosting a mirror of the Genome Browser. vai.pl forms an interface to the VAI program running on your GBiB or mirror (so private data stays local), and is able to bypass the variant limit imposed by the web-based VAI. The script has many of the same configuration options as the web-based VAI, including filtering via functional effect term, position filters, and dbSNP rsID annotation. The script even includes a “–dry-run” option, so power users can further configure the VAI to better suit their needs.

Example Usage

For example, say you have a VCF file with a couple thousand variants and you want to check to see if there are any dbSNP rs IDs associated with your variants. Use the --rsId option:

$ vai.pl hg19 --rsId gatkUG.vcf.gz

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:40:35
## Connected to UCSC database hg19
## Variants: from file or URL (/hive/users/chmalee/hgVaiScriptTesting/gatkUG.vcf.gz)
## Transcripts: RefSeq Genes (hg19.refGene)
## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 149) (/gbdb/hg19/vai/snp149.bed4.bb)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
chr20_10000117_C/T chr20:10000117 T SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs4816203 DISTANCE=4343
chr20_10000211_C/T chr20:10000211 T SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs4813908 DISTANCE=4249
chr20_10000439_T/G chr20:10000439 G SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs4816204 DISTANCE=4021
chr20_10000598_T/A chr20:10000598 A SNAP25-AS1 NR_040710 Transcript downstream_gene_variant - - rs6057087 DISTANCE=3862
...
...
...

What if your colleague gave you a list of rs IDs, and you want to know what genes they fall in and what changes they might cause? Just pass vai.pl your list of rs IDs as an input file, and it will do the rest!

$ vai.pl hg19 listOfRsIDs.txt

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:45:46
## Connected to UCSC database hg19
## Variants: Variant Identifiers (/data/tmp/hgv/hg19_bd61d73837d586acba8b9a674d8bf351.vcf)
## Transcripts: RefSeq Genes (hg19.refGene)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs762221666 chr1:36228116 T CLSPN NM_001330490 Transcript intron_variant - - - - - INTRON=4/24
rs762221666 chr1:36228116 T CLSPN NM_001190481 Transcript intron_variant - - - - - INTRON=4/23
rs762221666 chr1:36228116 T CLSPN NM_022111 Transcript intron_variant - - - - - INTRON=4/24
rs528917690 chr1:229013556 G - - - intergenic_variant - - - - - - -
rs558192635 chr10:25615277 C GPR158 NM_020752 Transcript intron_variant - - - - - INTRON=2/10
rs769006799 chr10:26225804 T LOC101929073 NR_120650 Transcript upstream_gene_variant - - - DISTANCE=3165
...
...
...

Note the missing gene name for the rs528917690, because this is an intergenic variant.

What if you only care about the variants that fall on chr22? vai.pl supports a --position option built just for that:

$ vai.pl hg19 --rsId --position=chr22 chr22.1000GenomesPhase3.vcf.gz

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:36:32
## Connected to UCSC database hg19
## Variants: from file or URL (/hive/users/chmalee/hgVaiScriptTesting/chr221000GenomesPhase3.vcf.gz)
## Transcripts: RefSeq Genes (hg19.refGene)
## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 149) (/gbdb/hg19/vai/snp149.bed4.bb)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs587697622 chr22:16050075 G - - - intergenic_variant - - - - - rs587697622 -
rs587755077 chr22:16050115 A - - - intergenic_variant - - - - - rs587755077 -
rs587654921 chr22:16050213 T - - - intergenic_variant - - - - - rs587654921 -
rs587712275 chr22:16050319 T - - - intergenic_variant - - - - - rs587712275 -
rs587769434 chr22:16050527 A - - - intergenic_variant - - - - - rs587769434 -
...
...
...

Ok great, but web-based VAI lets me annotate only specific variants, and I don’t care about intronic variants, upstream/downstream variants, or intergenic variants, only those that fall within exons as annotated by the GENCODE V24 track. Well good thing vai.pl supports annotation via specific gene tracks with the --geneTrack option, and can include/exclude different functional types with the include_ option:

$ vai.pl hg38 --include_intron=off --include_upDownstream=off --include_intergenic=off \
--geneTrack=wgEncodeGencodeCompV24 listOfRsIDs.txt

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-05 13:53:57
## Connected to UCSC database hg38
## Variants: Variant Identifiers (/data/tmp/hgv/hg38_bd61d73837d586acba8b9a674d8bf351.vcf)
## Transcripts: Comprehensive Gene Annotation Set from GENCODE Version 24 (Ensembl 83) (hg38.wgEncodeGencodeCompV24)
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs371031144 chr13:112864559 A ATP11A ENST00000471555.5 Transcript NMD_transcript_variant - - - INTRON=8/12
rs750654524 chr16:57996479 T ZNF319 ENST00000299237.2 Transcript 3_prime_UTR_variant 2410 - - EXON=2/2
rs575863935 chr16:83575109 A CDH13 ENST00000539548.6 Transcript NMD_transcript_variant - - - INTRON=6/12
rs78060447 chr4:83316348 C HPSE ENST00000507150.5 Transcript NMD_transcript_variant - - - INTRON=4/11
...
...
...

By default, vai.pl includes all functional types, use the --include_type=off switch to turn them off. vai.pl also limits output to only 10,000 variants, but you can override this with the --variantLimit option. The following example compares the number of annotated variants with default settings and with the --variantLimit option (Please note the grep command at the end is only a rough approximation of finding all the unique variants):

$ vai.pl hg19 ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.vcf.gz > HRC.vai
$ grep -v ^# HRC.vai | grep -v ^Uploaded | awk '{print $2 ":" $1;}' | uniq | wc -l
9992
$ vai.pl hg19 --variantLimit=10000000 ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.vcf.gz > HRC.vai
$ grep -v ^# HRC.vai | grep -v ^Uploaded | awk '{print $2 ":" $1;}' | uniq | wc -l
9992195

Unfortunately the script will not annotate more than approximately 10,000,000 variants due to the amount of memory needed (it caps its usage at 6GB; this may change in the future), so setting the --variantLimit option any higher than 10,000,000 will not work. Instead you will need to split up your VCF file. For a full list of all the options outlined here as well as others, run vai.pl with no arguments to get the usage message.

The script has some other drawbacks as well. For one, as previously mentioned, the script can only be run on a GBiB, a mirror site (installed via the Genome Browser in the Cloud (GBiC) script or a manual installation), or another machine running our CGIs. This is because under the hood the script uses the existing web-based VAI executable to run, which in turn requires either manual compilation of our source code, or a precompiled binary from our downloads server. Furthermore, VAI is tightly coupled to our genome databases and files.

Secondly, users will need to have a .hg.conf file in their home directory in order to run the program. The .hg.conf file is a file that a majority of UCSC-specific utilities use to set various configuration options. This file specifies options like which MySQL server to point to (a local server with private data), a fallback MySQL server if one isn’t available (UCSC public MySQL server), the primary location of bigData files, etc. Our source tree includes a very minimal hg.conf file that should allow basic usage of the script.

Different system setups will require different options in each users’ hg.conf settings, and if you are running a full mirror, then the CGIs will require their own hg.conf, separate from each user who may be running vai.pl! GBiB users should have a functioning .hg.conf file set up already, and thus for the script to work out of the box, you should only need to change the udc.cacheDir setting from:
udc.cacheDir=/data/trash/udcCache

to a user-writable directory such as:
udcCacheDir=./udcCache

Mirror users won’t have an .hg.conf file by default, but can create one and add the line:
include /usr/local/apache/cgi-bin/hg.conf

This will take care of most of the work outside of fine-tuning a few settings like the udc.cacheDir mentioned previously. For more information about hg.conf parameters and settings, please see the example hg.conf file here. Any questions about fine-tuning these parameters should be sent to our public support forum mentioned below.

Download

vai.pl is available from the UCSC Genome Browser Store via download of the GBiB (use the gbibAddTools command), GBiC (use the browserSetup.sh addTools command), or full source. vai.pl is free for non-commercial use. If you encounter issues or have any questions while running vai.pl, please send your questions to our public mailing list at genome@soe.ucsc.edu, or if your question involves private data to genome-www@soe.ucsc.edu.

References

Choice of transcripts and software has a large effect on variant annotation.
McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P.
Genome Med. 2014 Mar 31;6(3):26. doi: 10.1186/gm543.

UCSC Data Integrator and Variant Annotation Integrator.
Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR, Zweig AS, Haussler D, Kent WJ.
Bioinformatics. 2016 May 1;32(9):1430-2. doi: 10.1093/bioinformatics/btv766.


If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

One thought on “Annotating millions of private variants with vai.pl

Comments are closed.