Obtaining dbSNP identifiers for CA IDs¶
Requirements¶
- Requirements
	- Unix / Linux 
	- We are using CentOS 6.9
 
- Ruby 1.8.7+
	- Required for extracting HGVS expressions from input, but the logic could be ported to another language
 
- C++ complier (gcc 4.8.2+)
	- Required for building bcftools
- gcc --version
	- gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
 
 
- bcftools and htslib
	- Versions
	- bcftools 1.9-40-g44192ea
- htslib 1.9-21-gb7e5a51
 
- bcftools htslib install notes
 
- Versions
	
- Download request with payload script (we'll use the shell script for this example, but there is also a ruby and python version)
	- http://reg.clinicalgenome.org/doc/scripts/
	- Make executable
	- chmod 755 request_with_payload.sh
 
- request_with_payload.sh
 
- Make executable
	
 
- http://reg.clinicalgenome.org/doc/scripts/
	
 
- Unix / Linux 
	
Download dbSNP data¶
- Download latest dbSNP data set
	- mkdir 10-13-21-dbSNP
- cd 10-13-21-dbSNP
- time wget -r –level=0 -E –ignore-length -x -k -p -erobots=off -np --no-check-certificate -N https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
	- 21m (47G)
 
 
- Verify checksums
	- cd ftp.ncbi.nih.gov/snp/latest_release/VCF/
- md5sum -c GCF_000001405.25.gz.md5 #GRCh37
	- GCF_000001405.25.gz: OK
 
- md5sum -c GCF_000001405.39.gz.md5 #GRCh38
	- GCF_000001405.39.gz: OK
 
 
Extract variants¶
- Extract variants using bcftools
	- time bcftools norm -m -any GCF_000001405.25.gz | bcftools view -V snps > GCF_000001405.25.gz.indels.vcf
	- 167m
 
- time bcftools norm -m -any GCF_000001405.39.gz | bcftools view -V snps > GCF_000001405.39.gz.indels.vcf
	- 182m
 
 
- time bcftools norm -m -any GCF_000001405.25.gz | bcftools view -V snps > GCF_000001405.25.gz.indels.vcf
	
- Observe line counts
	- GCF_000001405.25.gz.indels.vcf
	- Total lines: 131,924,564
- Variants: 131,924,224
 
- GCF_000001405.39.gz.indels.vcf
	- Total lines: 135,985,520
- Variants: 135,984,844
 
 
- GCF_000001405.25.gz.indels.vcf
	
Extract HGVS¶
- Extract HGVS from input
	- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.25.gz.indels.vcf
	- 99m
	- 131,924,224 lines
 
 
- 99m
	
- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.39.gz.indels.vcf
	- 100m
	- 135,984,844 lines
 
 
- 100m
	
 
- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.25.gz.indels.vcf
	
Split input files into 1M chunks¶
- Split input files into 1M chunks 
	- time split -a 3 -d -l 1000000 GCF_000001405.25.gz.indels.vcf.hgvs GCF_000001405.25.gz.indels.vcf.hgvs.
- time split -a 3 -d -l 1000000 GCF_000001405.39.gz.indels.vcf.hgvs GCF_000001405.39.gz.indels.vcf.hgvs.
 
Extract HGVS only for Allele Registry POST query and combine input / output into mapping file¶
Note: This is for (1) 1M chunk of GRCh38. The same procedure would be used for the remaining files in GRCh38/37.
- Extract HGVS only from the input
	- cut -f1 GCF_000001405.39.gz.indels.vcf.hgvs.000 > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only
 
- Search dbSNP variants by HGVS
	- Get CA ID only
	- time sh ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id+end" GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json
	- 45s
 
 
- time sh ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id+end" GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json
	
 
- Get CA ID only
	
- Combine input and output
	- HGVS  RSID  CAID
	- paste GCF_000001405.39.gz.indels.vcf.hgvs.000 <(grep -o CA[0-9]* GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json) > GCF_000001405.39.gz.indels.vcf.hgvs.000.combined.txt
 
 
- HGVS  RSID  CAID
	
NC_000001.11:g.10013_10014delinsT 1639538231 CA1147837201 NC_000001.11:g.10019_10020delinsT 775809821 CA16716391 NC_000001.11:g.10024delinsCT 1639541758 CA1148213494 NC_000001.11:g.10031_10032delinsT 1639542364 CA1148307565 NC_000001.11:g.10036_10037delinsC 1639542429 CA1148307570 NC_000001.11:g.10039delinsAC 1639542564 CA1139532654 NC_000001.11:g.10042delinsCT 1639543135 CA1148307582 NC_000001.11:g.10045delinsAC 1639543279 CA1148307591 NC_000001.11:g.10049_10050delinsT 1639543682 CA1148307601 NC_000001.11:g.10051delinsAC 1326880612 CA884477203 ...