Enlightenment is an integrated Database bringing Genomics data, and NGS based DNA sequence profiles of individuals, in a single platform. It provides a simple user friendly platform, and customized download feature to access genomic data annotations and NGS-based whole genome sequence data, and obtain their correlation with cancer. Enlightenment, computes, annotates and stores several novel variants (CNV) of individuals, organ specific significant biomarkers associated with cancer, and drug associated biomarkers. In addition, Enlightenment overcomes the challenge of storage and processing of NGS technology-based whole-genome DNA sequences, by deploying the database on a distributed NoSQL database, HBase, on top of hadoop distributed file system, which provides a flexible architecture for integrating data from different sources. The scalable architecture of Enlightenment provides an open scope of incorporating other omics data, especially, proteomics into the existing database, in addition to the genomics data annotated in the current version of the database.
The integrated database Enlightenment has been annotated from different perspectives in order to give insights with respect to Genomic/Gene analyses; Nucleotide level analyses; Disease cancer specific analyses;Ethnicity based analyses. The databases considered are based on the following categories:
1)DNA Alterations and Variants:
The databases included under this category are Database of Genomic Variants (DGV) (release: 2020-02-25) and Copy Number Variation in Disease (CNVD) (©2012). The contents of the Database of Genomic Variants corresponds to three genome assemblies: NCBI Build36(hg18), GRCh37(hg19), and GRCh38(hg38). The primary goal for considering these has been to obtain a comprehensive view on the disease related CNVs and Common CNVs associated with healthy individuals as provided by CNVD and DGV respectively.
From DGV corresponding to NCBI/Build36 assembly, 214100 CNV information has been integrated. Out of these 29453 deletion events, 19549 duplication events, 37200 gain
events, 25399 insertion events, and 81727 loss information have been ingested in the database. 52746 disease-related CNVs of H. sapiens have been
retrieved from CNVD. Among these, 2586 copy gain, 6276 copy loss, 6681 gain, 719 duplications, 15179 gain plus loss
events, 7843 deletions, and 7877 insertion events have been retrieved and stored in the database
2)Cancer Associated and PharmaGenomic Biomarkers: The databases included under this category are Tumor-Associated Gene database (TAG) and Tumor Suppressor Gene database (TsGene 2.0). A total of 252 tumor suppressor genes and 241 oncogenes has been curated from TAG, whereas 771 tumor suppressor genes have been further integrated from TsGene.
The database ChimerDB (4.0) consists of all genes where fusions have occurred. In total, 3138 fusion genes have been integrated in the database. The PharmaGenomic biomarkers and associated drugs are further annotated, having a total of 519 drugs and biomarker associations.
3)Genes and Genetics: In order to analyse mutations in different human genes, knowledge on all the genes present in human genome, HUGO Gene Nomenclature Committee provides the details on all approved human gene nomenclature and can be found in the HGNC (release: 01/10/2020) database. It assigns aproximately 33000 unique symbols and names for human loci, that includes all protein coding genes, non-coding RNA genes and pseudogenes.
4)Whole Genome nucleotide level alterations of Human samples: Enlightment integrates NGS raw reads alignment information for the full genome of individuals, belonging to multiple ethnic groups. At present two H. sapiens family trios from European and African ancestry which are sequenced with deep coverage (20x to 60x per genome) are considered. The binary alignment files(.bam) files of individual samples are processed and annotated informations are stored. For alignment BWA mem algorithm is used and further processed and computed to obtain the depth of coverage data at each genomic coordinate which are subsequently stored in the database. Next, statistical inferences on the depth of coverage data are also performed and algorithms are applied to obtain genome wide copy number variations (CNVs).