Analysis of human female personal genome

From Biojournal.org

Jump to: navigation, search
 
 Automatic analysis of the first publically available female human genome by next generation sequencers as a part of PGP (Personal Genome Project).

 

Authors:

Jong Bhak, Taehyung Kim, Jongsun Park, Sunghoon Lee, Sungwoong Jho, Suan Jho, Minseob Lee, Rosalynn Gill, George Church, and Byungchul Kim

 

Affiliations:

1. Genome Research Foundation, Suwon, 443-270, South Korea.

2. Harvard Medical School, Boston, MA, USA

3. Diagnomics, Inc. San Diego, CA, USA.

Contacts:

bckim00@gmail.com, jongbhak@yahoo.com, gmc@harvard.edu

Unique Paper Identifier: 2011.11.05.01.Bhak


Abstract

A Caucasian female human genome analyses are reported. It was sequenced using Illumina GA2 next generation sequencers and 90GB of short DNA fragments of ~90 base pairs were obtained. Overall coverage on to the reference human geome (HG19) was 99.2% with sequencing depth of 33 folds. 3.5 million SNPs, 200,000 short indels, 10,000 long structural variations, and 125,000 copy number variations were found. We found 10,000 non-synonymous mutations resulting in 100 genes affected by amino acid changes. Out of 100, we found 2 presumably critical mutations that resulted in some structural changes. The whole analysis was carried out by a bioinformatics pipeline called 'Genome Engine' which finished the whole analysis process in about 2 weeks, semi-automatically. The donor of the genome is Rosalynn Gill who donated all the rights and information associated with the data to the public as a part of PGP (Personal Genome Project). Gill's genome was the first publically available female human genome and the data were available from http://bioftp.org since April 2010 with analyses results. Here we present the second stage comparative analyses of her genome.

 

Introduction

NGS (next generation sequencers) produced many personal genomes in various research projects such as PGP (Personal Genome Project), 1000 genome project, YH genome project of China, Korean Genome Project (KGP) by Genome Research Foundation in Korea, Japanese genome project, and Indian genome project. By 2011, the number of genomes has increased from 10s in 2010 to hundreds. However, relatively few female human genomes have been competely publicized. In 2008, Leiden university announced that they had sequenced the first female human genome. Whatever the reasons behind, claimed to be  the first female human genome data were not publicized nor its research report was announced by Nov. 2011. In the mean time, PGP has publicized the first female genome, Rosalynn Gill's, on the web in April 2010 with extensive analyses that were similar to the previous genomes such as James Watson, YH, and SJK, the first Korean genome. Gill was PGP number 9 and many of phenotypic and clinical data points were also public at the time of data publication.

The data have been used to be compared with various existing genomes and 10s of Korean genomes that have been available through KPGP (Korean Personal Genome Project). Here, we report the second stage analyses of PGP9 using automated genome pipeline called "Genome Engine". Genome Engine was developed to analyse the first Korean human genome in 2008. It is a personal genome analyzer with common functions such as detecting SNPs, Indels, SVs, CNVs, chart drawing functions, and genome browsing. The main idea of the second stage female genome analysis was to apply an already common genome analysis package to annotate it as much as possible with minimum effort.

 

Results

 

Whole-genome sequencing results for PGP9
 
Read length (bp)
Number of reads
Total bases (Gb)
Rate of mapped reads by BWA
Rate of unmapped reads
 
 
 
 
 

 

Illumina sequence data result by using short read mapping program.

Read length (bp)
Numbers of read (M)
Total bases(Gb)
Numbers of aligned reads(M)
Aligned bases (Gb)
Percentage of genome covered
Coverage depth
 
 
 
 
 
 
 

  
매핑되지 않은 readde novo로 보고 Velvet으로 assembly 분석. 이때 de novoquality score20 이상인 것만 사용한다.
 
리드 시퀀싱 에러 짐작하기
시퀀싱 에러 율을 짐작하기 위해 chromosomes Xpredominant한건 true로 나머진 error로 본다.
 
시퀀스 quality score를 계산한다. r=
  
염기 간 치환 에러 비율
CA간, GT 등을 Illumina Genome Analyzer로 관찰한다.
 
 
Result of de novo assembly.

Assembler
Contigs
n50(bp)
Minimum Length (bp)
Total Length (bp)
Scafolds
Reads Used
 
 
 
 
 
 
 

 
Genotyping result
 
Comparation of X chromosome SNP calls made from sequence vs. genotype data

 
Genotyped
n
Samtools
GATK
n
%
n
%
covered by sequence
Homozygote
 
 
 
 
 
Heterozygote
 
 
 
 
 
All
 
 
 
 
 
Concordant calls
Homozygote
 
 
 
 
 
Heterozygote
 
 
 
 
 
All
 
 
 
 
 
All disagreements
GT>Seq
 
 
 
 
 
Seq>GT
 
 
 
 
 
Other discordances
 
 
 
 
 

Genotyping of the PGP9 sample was carried out using the Illumina HumanHap BeadChip. GT>Seq denotes a heterozygous genotyping SNP call where there is a homozygous sequencing SNP call, Seq>GT denotes the converse, I.e. a heterozygous sequencing SNP call where there is a homozygous genotyping call. Other discordances are differences in the two SNP calls that cannot be accounted for by one allele being missing from one call.

 

Short indels identificaion in PGP9

 
 
Diploid
 
 
Total
Novel
non-exonic
Homozygous
 
 
Heterozyous
 
 
Subtotal
 
 
UTR
Homozygous
 
 
Heterozyous
 
 
Subtotal
 
 
CDS
Homozygous
 
 
Heterozyous
 
 
Subtotal
 
 
Sum
Total
 
 

 
 
Statistics of SNPs identified in PGP9 genome sequences through application x Compared with dbSNP

 
Total
Homozygous
Heterozygous
Total SNPs
 
 
 
Autosomal SNPs
 
 
 
CDS nsSNPs
 
 
 
Novel
 
 
 
non-exonic
 
 
 
UTR
 
 
 
CDS
 
 
 

 

List of nsSNPs detected in the PGP9 genome.

 

Distribution of size of PGP 9 indels.
Figure. X: Length of indel (bp) Y: Number of cases
 
Analysis of short indel calls in human genome data for PGP9.

Total number of calls and fraction that match previous entries in dbSNP.

 
Number of short indels
%dbSNP
Heterozygotes
 
 
Homozygoues
 
 
All indels
 
 

 

Homozagous deletion figure -

Heterozagous deletion figure -

Homozagous x kb inversion figure -

 

 

Annotation

 

 

Out of 90GB short DNA reads, 6% were not matched to the human reference. These were subject to de novo assembly program XXXX. We found 2 genes in the unassigned salvaged fragments. We assigned 25,xxx ORFs to PGP9. There were YYY genes mutated non-synonymously and they are listed in Table 1. 

What are the mutated genes? 

Any gene that is causing any clinical symptom to Gill?

  
By using Trait-o-matric we found and classified phenotypic correlations for variations in whole genomes.
We found x SNPs that were potentially associated with clinical phenotypes were identified. Of these, x were relatively common SNPs previously associated with risk of complex disorders or traits. For example, the genome of PGP9 contained x SNPs that have shown associations with susceptibility to various cancers, x SNPs with type 2 diabetes mellitus, The genome of PGP9 also contained x non-synonymous SNPs in genes associated with complex or Mendelian disorders or traits. Of these, x were stop codons and x were homozygous. Among Mendelian traits, PGP9 was homozygous for a variant conferring wet earwax that has a high allele frequency in caucasians. score of 4 or more)

 

PGP9 genome statistics with numbers of variants found (Polyphen 2 prediction, PharmGKB or HuGENet, GeneTests, nonsense and frameshift mutations, priority

Individual
Number of nonsyn. with
probably
damaging
Polyphen 2
prediction
Number of vars in
PharmGKB or
HuGENet
# of nonsyn
vars in genes
with clinical
testing
(GeneTests)
# of nonsense
and frameshift
mutations
# with
priority
score of
4 or
more
PGP9
 
 
 
 
 

 

SNPs Potestially associated with clinical phenotypes derived from the database of human gene mutation data (HGMD), OMIM, SNPedia and other hypotheses. DM.

Gene Symbol
HGMD
Phenotype
Disease Realationship
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 Result of homozygous and heterozygous OMIM HGMD. 

 
Gene Symbol
OMIM ID
Phenotype
Disease Realationship
homozygous
ABCF1
 
 
 
 
 
 
 
 
 
 
 
heterozygous
 
 
 
 
 
 
 
 
 
 
 
 

 

  

Ancestry

The ancestry of PGP9 was European maternally (XXX type) and paternally (XXX), Fig. 1. a.  

Geographic map showing the region of ancestry. MT type, mitochomdrial haplogroup. Figure -
Autosomal phylogenic tree using allele sharing distanve (ASD) and Neighbor Joining (NJ) methods Figure-
 
 
 
 
Comparative genomics

 We compared PGP9 with 10  other genomes.

How many unique SNPs in her genome? (compared to dbSNP and other genomes)

Compare Gill with Maryana genome.

 

Result of blast analysis of de novo contigs

Category
Velvet
 
Number of Contigs
Length (bp)
Total
 
 
Hs build 36.3
0
0
Hs GRCh 37
 
 
Hs Alt
 
 
Hs Other
 
 
Chimpanzee
 
 
Orangutan
 
 
Rhesus Macaque
 
 
Marmoset
 
 
Herpesvirus 4
 
 
Bos taurus
 
 
Remaning
 
 

 

Number of SNPs in the genome and the sequenced exome containing regions

Individual
Genomic SNPs
Novel SNPs
Coding exon SNPs
PGP9
 
 
 
NB1
 
 
 
MD8
 
 
 
TK1
 
 
 
ABT SOLID
 
 
 
ABT exome
 
 
 
NA 18507
 
 
 
NA 19240
 
 
 
J. Watson
 
 
 
J.C. Venter
 
 
 
NA12891
 
 
 
NA 12892
 
 
 
Chinese
 
 
 
Korean (SJK)
 
 
 

 

 
Number of SNPs identified for pair of mitochodrial sequences.
"CRS"= Cambridge Reference Sequence.

 

Conclusion

 

 

Methods.
DNA extraction, library preparation, and massively parallel sequencing (MPS)

Genomic DNA (gDNA) was extracted from whole blood with a QIAamp DNA blood kit according to the manufacturer's instructions (Qiagen). Libraries were prepared according to the manufacturer's instructions (Illumina). Briefly, 5 ㎍ of gDNA in 200 ㎕ nuclease free water was fragmented by bioruptor (Diagenode) at high power for 30 min. (30 sec ON and 30 sec OFF). Overhangs of fragmented gDNA were converted to blunt ends using T4 DNA ligase and Klenow enzyme. Subsequently, an ‘A’ base was added to the ends of double-stranded DNA using Klenow exo (3' to 5' exo minus).

The paired end adaptor (Illumina) with a single ‘T’ base overhang at the 3' end was ligated to the above products. The PE adaptor ligated products were separated on a 2% agarose gel and excised from the gel at positions approximately for span size ranges (100bp, 200bp, 300bp, and 400bp). Size-selected DNA fragments were enriched by PCR with PE primers 1.1 and 2.1 (Illumina). The concentration of libraries was measured by both nanodrop (Thermo) and Qubit IT (Invitrogen). Finally, the libraries were validated by Experion (BIO-RAD).
The gDNA library was sequenced using the Illumina 1G genome analyzer according to the manufacturer’s instructions.

Short read alignment
We used a fast short-read alignment program, BWA (ver. XXXX) with default parameters. BWA utilizes the read-pair information of paired-end reads. Using the read-pair
information, BWA corrects wrong alignments, adds confidence to correct alignments, and
accurately maps reads to repetitive sequences if their mate pairs are confidently aligned.


Calling SNPs
Using MAQ, we aligned short reads to the NCBI reference genome and produced a
consensus genotype sequence from the alignment. The options used for SNP calling were:
minimum four read depth (-d 4), maximum depth (-D 100) to filter out randomly placed
repetitive hits, consensus quality (Q20), adjacent sequence quality (Q20), and no SNP call
if any indel occurred in the 3bp flanking region.

De novo assembly of unmapped reads and mapping assembled contigs
Velvet version 0.7.27 (Zerbino and Birney 2008) was used to assemble unmapped reads into contigs with hash length 235, coverage cutoff 23, and minimum contigs length 100. Among 41.2 million unmapped 75bp reads, we selected 24.6 million high quality reads. We filtered out low quality reads with total scores below 1500 or reads with ‘N’.
Assembled contigs were aligned against the unanchored scaffolds of NCBI build 36 to find homologous sequence regions using the Blat alignment algorithm (Kent 2002). Before the alignment, the repeat sequences of contigs were masked by RepeatMasker (http://www.repeatmasker.org/). To select significantly homologous alignments, thresholds >95% identity and >80% coverage of contigs were used as thresholds. We performed a Blastx alignment against the NCBI refseq protein database of human
(NCBI build 36, 37,742 proteins), chimpanzee (NCBI build 2, 51,947 proteins), and mouse (NCBI build 37, 34,966 proteins). Hits with E-value <0.0001, percent identity >40%, and protein coverage >40% were used as significance thresholds.


Detection of short indels
We called short indels (from 29 bp deletion to 14 bp insertion) by using the MAQ pairedend indel detection method. Short indels were confirmed when they were identified in both strands with a minimum of three reads. When more than one indel candidates co-occured in a window size of 20bp, they were filtered out because alignment errors were found to represent most cases.

Detection of structural variants (SV)
We detected SVs based on span size and orientation information of each paired-end read. Paired-end reads with an anomalously long span size (more than double the average span size of each DNA library) were identified as SV candidates (deletion and inversion), especially when they had a minimum of three reads in the region, maximum 100 read depth and mapping quality (Q20). SV candidates either found in repeat regions of the genome or having more than 100kb of genomic deletions were filtered out. For insertion detection larger than the short indels (-29 - +14 bp), the longest 300bp span size of our paired-end libraries was used. Thus, we could fill 175~250bp insert gaps between short inserts and large inserts. The criteria used for detecting these insertions absent from the reference genome in the range of 175~250bp were: minimum four read depth, maximum 60 read depth to filter out randomly placed hits in repetitive structure region, and mapping quality (Q20).


Genetic ancestry
The chromosome Y haplotype data from Y Chromosomal Consortium (YCC) were used for the paternal lineage analysis. The mitochondrial DNA sequence was aligned with rCRS (Revised Cambridge Reference Sequence of the human mitochondrial DNA) (Andrews, Kubacka et al. 1999), and identified variations were mapped to a MitoVariome set (http://variome.kobic.re.kr/MitoVariome), which is an extended set of Mitomap (Brandon, Lott et al. 2005). The haplogroup-diagnostic nucleotide sequence variants were assembled from reference information (Kong, Bandelt et al. 2006). To make an autosomal phylogenic tree, Allele Sharing Distance (ASD) and Neighbor Joining (NJ) methods were used (Saitou and Nei 1987; Bowcock, Ruiz-Linares et al. 1994; Mountain and Cavalli-Sforza 1997). Four personal genomes, Watson, HuRef (Venter), YH (Han Chinese), and SJK, were used with HapMap Phase III samples of four populations, YRI (Yoruba), CHB (Chinese), CEU (Caucasian), and JPT (Japanese). HapMap samples had no relationship within them. Eight randomly chosen Koreans and the donor’s mother were genotyped with a Illumina 1M bead array. To calculate ASD, the proportion (Pi) of shared alleles between individuals was estimated by  --- 1) Where the number of shared alleles S is added over all loci u, distance between individuals
(Di ) is estimated by
Di = 1 - Pi --- 2)
Phylogenic analyses were conducted in MEGA4 (Tamura, Dudley et al. 2007).

 

 

Acknowledgement

JB thanks TheragenEtex for generously supporting genomics research.

 

References

 

 

 

Personal tools