India, with over 1.47 billion people, represents one of the most genetically complex regions in the world. Yet, it remains significantly underrepresented in global genomic databases. This disparity limits the transferability of risk prediction models and clinical tools, and constrains the development of precision medicine strategies. India’s population structure, driven by migration, complex admixture, and widespread endogamy, provides a powerful framework for understanding how demographic events and socio-cultural history influences genetic diversity.
Recognizing these challenges, the GenomeIndia initiative was conceptualized as a national effort to generate a comprehensive and representative genomic dataset that captures the breadth of Indian diversity. A population-scale genomic survey of India is therefore essential to improve the resolution, accuracy, and equity of global human genomics data.
A total of 20,195 individuals representing 83 diverse populations were enrolled in the study. Whole-genome sequencing (WGS) at ~30× coverage was performed on 10,074 individuals, including trios, of whom 9,768 passed stringent quality control filters and met inclusion criteria for downstream analyses. The study was implemented through coordinated efforts of the 20 partner institutions across the nation, following uniform protocols for recruitment, sample collection, metadata annotation, and phenotypic measurements.
The project sequenced 9,768 individuals from 83 anthropologically defined, endogamous populations spanning the ethnolinguistic and biogeographic spectrum of India. The study design ensured representation from all major linguistic families, including Indo-European, Dravidian, Austroasiatic, and Tibeto-Burman groups, and incorporated both tribal and non-tribal populations.
In our view, the GenomeIndia dataset represents a significant advancement in cataloguing genetic variation. We have identified 129.93 million high-confidence biallelic variants, 44.03 million of which are previously unreported in global databases. This highlights that a substantial fraction of human variation remains uncatalogued, residing within previously unsampled populations.
The allele frequency spectrum was dominated by rare variation, with a long tail of variants often confined to specific endogamous populations. This rare variation is critical for deciphering the genetic basis of Mendelian and complex diseases. In conjunction with genetic data, individuals are annotated with phenotypic and other metadata, including socio-demographic information, self-reported health status, anthropometric measures, and biochemical assays relevant to metabolic and organ health.
Collectively, the GenomeIndia dataset is one of the largest and richly annotated genomic resources from South Asia, providing a foundation for genomics-informed public health and precision medicine.
The project has provided high-resolution insights into the genetic structure of Indian populations. Geography, language, and biogeography shape genetic structure across India. Individuals cluster according to ethnolinguistic groups, with significant differentiation observed across populations.
The analyses revealed substantial allele sharing within linguistic families, while tribal populations exhibited strong genetic isolation. Regression modelling demonstrated a strong relationship between genetics and geography, reinforcing the role of spatial and historical factors in shaping genetic diversity.
Four major ancestry components consistent with previous findings were identified: Ancestral North Indian, Ancestral South Indian, Ancestral Austroasiatic, and Ancestral Tibeto-Burman. Additional population-specific drift components were also observed, particularly among isolated groups.
The GenomeIndia dataset provides important insights into tribal populations, which remain underrepresented in global studies. We observe low effective population sizes, significant genetic drift, and profound homozygosity in small tribal groups, likely shaped by antiquity, isolation, and endogamy.
Rarefaction analyses revealed that tribal groups harbored a large number of high-frequency, population-specific novel variants. These findings justify the sampling design that combines large cohorts from diverse populations with targeted sampling of isolated groups.
Several tribal populations showed persistently low effective population sizes, reflecting isolation, stagnancy, and genetic drift. Comparison with global benchmarks revealed substantially higher homozygosity in the GenomeIndia cohort.
The analysis of runs of homozygosity has indicated that a large number of populations exhibit signatures of strong endogamy and founder effects. These patterns underscore the impact of socio-cultural practices on genetic structure and have important implications for disease risk.
The project has identified approximately 1.5 million protein-coding variants, including a large number of deleterious missense mutations and high-confidence loss-of-function variants. Isolation and endogamy have driven these variants to higher frequencies locally, highlighting the importance of studying underrepresented populations for disease risk profiling.
Long-term endogamy and isolation in South Asian populations can elevate globally rare deleterious variants to high local frequencies, reshaping disease risk landscapes that are invisible in global reference datasets. The divergence of allele frequencies in populations underscores the necessity of diverse and inclusive variant catalogs for accurate assessment of disease risk.
Our analysis has revealed extensive pharmacogenomic diversity, capturing a large number of clinically actionable variants across populations. Individuals carry multiple actionable pharmacogenomic variants affecting drug response in oncology, cardiovascular diseases, and psychiatric conditions.
The observed variability in pharmacogenomic markers emphasizes the need for population-specific guidelines to optimize drug efficacy and safety. These findings provide a strong foundation for integrating genomics into clinical decision-making and public health strategies.
Significant allele frequency differences limit the transferability of European-derived risk estimates, as discordant variant prevalence alters both effect contributions and statistical power. Predictive accuracy was highest in European populations and declined with increasing genetic distance.
This non-transferability emphasizes the need for population-informed genomic resources to ensure equitable medical interpretation and risk prediction.
The project has developed a high-resolution Indian imputation panel that significantly improves South Asian genotype inference. Comparative analyses demonstrated superior performance over widely used reference panels, particularly for rare variants.
Validation studies showed high concordance, sensitivity, and specificity, confirming the robustness of the imputation panel. This resource provides a critical foundation for large-scale genomic studies and the development of India-specific genotyping platforms.
The project has led to the development of GI-DB, a user-friendly database created from the GenomeIndia data of 9768 genomes which enables efficient exploration of genetic variation of the diverse Indian populations through an interactive dashboard. GI-DB, developed by CSIR-IGIB, supports flexible querying by gene, single variant, genomic region, and rsID, making it convenient for both exploratory and targeted analyses. Each variant is annotated with direct links to external resources such as dbSNP, gnomAD, ClinVar, Ensembl, UCSC Genome Browser, and ClinPGx, providing immediate functional and pharmacogenomic context. The database includes allele frequency information across linguistic groups and biogeographic regions, along with frequencies from global resources such as 1000 Genomes Project and gnomAD, enabling quick comparison with global population patterns. To help assess variant impact, GI-DB incorporates pathogenicity prediction scores such as REVEL and CADD, as well as conservation metrics like phastCons and phyloP. Each variant is also accompanied by detailed quality parameters such as allele depth, allele fraction, and genotype quality, allowing users to evaluate data reliability. Additionally, GI-DB provides an API for programmatic access, well-structured documentation, and an intuitive interface, facilitating seamless data retrieval, analysis, and integration into bioinformatics workflows.
The project has led to the development of a dashboard at IBDC for visualizing phenotype data distributions and summary statistics of the phenotype data. This dashboard allows users without access to the underlying data, or experience with handling large datasets to immediately view patterns in any phenotype of interest across GenomeIndia, or in a population of interest. It also allows viewing correlations between phenotypes of interest, and geographical distribution across the country. The dashboard is secure and is maintained by IBDC and CBR.