Hubbry Logo
search
logo

International HapMap Project

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available for research.

The International HapMap Project is a collaboration among researchers at academic centers, non-profit biomedical research groups and private companies in Canada, China (including Hong Kong), Japan, Nigeria, the United Kingdom, and the United States. It officially started with a meeting on October 27 to 29, 2002, and was expected to take about three years. It comprises three phases; the complete data obtained in Phase I were published on 27 October 2005.[1] The analysis of the Phase II dataset was published in October 2007.[2] The Phase III dataset was released in spring 2009 and the publication presenting the final results published in September 2010.[3]

Background

[edit]

Unlike with the rarer Mendelian diseases, combinations of different genes and the environment play a role in the development and progression of common diseases (such as diabetes, cancer, heart disease, stroke, depression, and asthma), or in the individual response to pharmacological agents.[4] To find the genetic factors involved in these diseases, one could in principle do a genome-wide association study: obtain the complete genetic sequence of several individuals, some with the disease and some without, and then search for differences between the two sets of genomes. At the time, this approach was not feasible because of the cost of full genome sequencing. The HapMap project proposed a shortcut.

Although any two unrelated people share about 99.5% of their DNA sequence, their genomes differ at specific nucleotide locations. Such sites are known as single nucleotide polymorphisms (SNPs), and each of the possible resulting gene forms is called an allele.[5] The HapMap project focuses only on common SNPs, those where each allele occurs in at least 1% of the population.

Each person has two copies of all chromosomes, except the sex chromosomes in males. For each SNP, the combination of alleles a person has is called a genotype. Genotyping refers to uncovering what genotype a person has at a particular site. The HapMap project chose a sample of 269 individuals and selected several million well-defined SNPs, genotyped the individuals for these SNPs, and published the results.[6]

The alleles of nearby SNPs on a single chromosome are correlated. Specifically, if the allele of one SNP for a given individual is known, the alleles of nearby SNPs can often be predicted, a process known as genotype imputation.[7] This is because each SNP arose in evolutionary history as a single point mutation, and was then passed down on the chromosome surrounded by other, earlier, point mutations. SNPs that are separated by a large distance on the chromosome are typically not very well correlated, because recombination occurs in each generation and mixes the allele sequences of the two chromosomes. A sequence of consecutive alleles on a particular chromosome is known as a haplotype.[8]

To find the genetic factors involved in a particular disease, one can proceed as follows. First a certain region of interest in the genome is identified, possibly from earlier inheritance studies. In this region one locates a set of tag SNPs from the HapMap data; these are SNPs that are very well correlated with all the other SNPs in the region. Using these, genotype imputation can be used to determine (impute) the other SNPs and thus the entire haplotype with high confidence. Next, one determines the genotype for these tag SNPs in several individuals, some with the disease and some without. By comparing the two groups, one determines the likely locations and haplotypes that are involved in the disease.

Samples used

[edit]

Haplotypes are generally shared between populations, but their frequency can differ widely. Four populations were selected for inclusion in the HapMap: 30 adult-and-both-parents Yoruba trios from Ibadan, Nigeria (YRI), 30 trios of Utah residents of northern and western European ancestry (CEU), 44 unrelated Japanese individuals from Tokyo, Japan (JPT) and 45 unrelated Han Chinese individuals from Beijing, China (CHB). Although the haplotypes revealed from these populations should be useful for studying many other populations, parallel studies are currently examining the usefulness of including additional populations in the project.

All samples were collected through a community engagement process with appropriate informed consent. The community engagement process was designed to identify and attempt to respond to culturally specific concerns and give participating communities input into the informed consent and sample collection processes.[9]

In phase III, 11 global ancestry groups have been assembled: ASW (African ancestry in Southwest USA); CEU (Utah residents with Northern and Western European ancestry from the CEPH collection); CHB (Han Chinese in Beijing, China); CHD (Chinese in Metropolitan Denver, Colorado); GIH (Gujarati Indians in Houston, Texas); JPT (Japanese in Tokyo, Japan); LWK (Luhya in Webuye, Kenya); MEX (Mexican ancestry in Los Angeles, California); MKK (Maasai in Kinyawa, Kenya); TSI (Tuscans in Italy); YRI (Yoruba in Ibadan, Nigeria).[10]

Phase ID Place Population Detail
I/II CEU United States Utah residents with Northern and Western European ancestry from the CEPH collection Detail
I/II CHB China Han Chinese in Beijing, China Detail
I/II JPT Japan Japanese in Tokyo, Japan Detail
I/II YRI Nigeria Yoruba in Ibadan, Nigeria Detail
III ASW United States African ancestry in the Southwest USA Detail
III CHD United States Chinese in metropolitan Denver, CO, United States Detail
III GIH United States Gujarati Indians in Houston, TX, United States Detail
III LWK Kenya Luhya in Webuye, Kenya Detail
III MKK Kenya Maasai in Kinyawa, Kenya Detail
III MXL United States Mexican ancestry in Los Angeles, CA, United States Detail
III TSI Italy Tuscans in Italy Detail

Three combined panels have also been created, which allow better identification of SNPs in groups outside the nine homogenous samples: CEU+TSI (Combined panel of Utah residents with Northern and Western European ancestry from the CEPH collection and Tuscans in Italy); JPT+CHB (Combined panel of Japanese in Tokyo, Japan and Han Chinese in Beijing, China) and JPT+CHB+CHD (Combined panel of Japanese in Tokyo, Japan, Han Chinese in Beijing, China and Chinese in Metropolitan Denver, Colorado). CEU+TSI, for instance, is a better model of UK British individuals than is CEU alone.[10]

Scientific strategy

[edit]

It was expensive in the 1990s to sequence patients' whole genomes. So the National Institutes of Health embraced the idea for a "shortcut", which was to look just at sites on the genome where many people have a variant DNA unit. The theory behind the shortcut was that, since the major diseases are common, so too would be the genetic variants that caused them. Natural selection keeps the human genome free of variants that damage health before children are grown, the theory held, but fails against variants that strike later in life, allowing them to become quite common (In 2002 the National Institutes of Health started a $138 million project called the HapMap to catalog the common variants in European, East Asian and African genomes).[11]

For the Phase I, one common SNP was genotyped every 5,000 bases. Overall, more than one million SNPs were genotyped. The genotyping was carried out by 10 centres using five different genotyping technologies. Genotyping quality was assessed by using duplicate or related samples and by having periodic quality checks where centres had to genotype common sets of SNPs.

The Canadian team was led by Thomas J. Hudson at McGill University in Montreal and focused on chromosomes 2 and 4p. The Chinese team was led by Huanming Yang in Beijing and Shanghai, and Lap-Chee Tsui in Hong Kong and focused on chromosomes 3, 8p and 21. The Japanese team was led by Yusuke Nakamura at the University of Tokyo and focused on chromosomes 5, 11, 14, 15, 16, 17 and 19. The British team was led by David R. Bentley at the Sanger Institute and focused on chromosomes 1, 6, 10, 13 and 20. There were four United States' genotyping centres: a team led by Mark Chee and Arnold Oliphant at Illumina Inc. in San Diego (studying chromosomes 8q, 9, 18q, 22 and X), a team led by David Altshuler and Mark Daly at the Broad Institute in Cambridge, USA (chromosomes 4q, 7q, 18p, Y and mitochondrion), a team led by Richard Gibbs at the Baylor College of Medicine in Houston (chromosome 12), and a team led by Pui-Yan Kwok at the University of California, San Francisco (chromosome 7p).

To obtain enough SNPs to create the Map, the Consortium funded a large re-sequencing project to discover millions of additional SNPs. These were submitted to the public dbSNP database. As a result, by August 2006, the database included more than ten million SNPs, and more than 40% of them were known to be polymorphic. By comparison, at the start of the project, fewer than 3 million SNPs were identified, and no more than 10% of them were known to be polymorphic.

During Phase II, more than two million additional SNPs were genotyped throughout the genome by David R. Cox, Kelly A. Frazer and others at Perlegen Sciences and 500,000 by the company Affymetrix.

Data access

[edit]

All of the data generated by the project, including SNP frequencies, genotypes and haplotypes, were placed in the public domain and are available for download.[12] This website also contains a genome browser which allows to find SNPs in any region of interest, their allele frequencies and their association to nearby SNPs. A tool that can determine tag SNPs for a given region of interest is also provided. These data can also be directly accessed from the widely used Haploview program.

Publications

[edit]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The International HapMap Project was an international collaboration launched in October 2002 to create a comprehensive haplotype map, or HapMap, of the human genome, cataloging common patterns of DNA sequence variation to enable researchers to identify genetic factors influencing health, disease susceptibility, and responses to drugs and environmental factors.[1][2] A haplotype refers to a set of DNA variations, such as single nucleotide polymorphisms (SNPs), that are inherited together on the same chromosome, allowing the project to focus on "tag SNPs" that efficiently represent larger blocks of genetic variation without genotyping every variant.[1] The initiative involved researchers from the United States, United Kingdom, Canada, Japan, China, and Nigeria, who genotyped DNA samples from 270 individuals of diverse ancestries, including Yoruba from Nigeria, Han Chinese, Japanese, and Utah residents of Northern and Western European descent.[1] The project unfolded in three phases, with Phase I completing in 2005 by identifying over 1.1 million SNPs across the genome, providing the initial framework for understanding haplotype structures in the sampled populations.[1] Phase II, released in 2007, expanded the map to more than 3.1 million SNPs by densely genotyping the same 270 individuals, enhancing resolution for association studies and revealing finer details of linkage disequilibrium patterns. Phase III, published in 2010, broadened the scope by including 1,301 samples from 11 global populations—adding groups from Africa, Asia, Europe, and the Americas—to capture a wider spectrum of human genetic diversity, through genotyping of approximately 1.6 million SNPs in 1,184 individuals and targeted resequencing in 692 individuals from these populations, enabling imputation of additional variants.[3] All data from the HapMap were made publicly available without restrictions, serving as a foundational resource for genome-wide association studies (GWAS) and accelerating discoveries in complex diseases like diabetes, cancer, and heart conditions.[1] The project's emphasis on ethical considerations, including informed consent, privacy protections, and avoiding stigmatization of population groups, set standards for large-scale genomic research, while its outputs have informed subsequent efforts like the 1000 Genomes Project and precision medicine initiatives.[1] By mapping approximately 10 million common SNPs overall, the HapMap demonstrated that a relatively small number of tag SNPs (around 250,000 to 500,000) could capture most common genetic variation, dramatically reducing the cost and complexity of genetic studies.[1]

Introduction and Background

Project Overview

The International HapMap Project was a multi-phase international collaboration spanning 2002 to 2010, involving researchers from academic institutions, non-profit biomedical organizations, and private companies in Canada, China, Japan, Nigeria, the United Kingdom, and the United States to develop a public resource cataloging common genetic variations across the human genome.[2] The effort was coordinated by the International HapMap Consortium, a partnership of scientists and funding agencies from these nations, with organizational oversight provided by multiple genotyping centers and a central data coordination center to manage high-throughput analysis and public data release.[4] Initial funding came primarily from public sources, including the National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) and the Wellcome Trust, supporting the project's infrastructure and genotyping activities.[5][2] At its core, the project aimed to delineate haplotypes—contiguous blocks of DNA inherited together due to low recombination rates, containing clusters of single nucleotide polymorphisms (SNPs)—and to map these SNPs as a foundational tool for genome-wide association studies (GWAS).[2] This resource was designed to accelerate research into genetic contributions to disease susceptibility, variability in drug response, and patterns of human evolution by enabling indirect association testing, where common variants could be tagged efficiently without genotyping every possible SNP.[6][2] The HapMap characterized over 3.1 million SNPs in Phase II across four populations and an additional 1.6 million SNPs in Phase III across 11 global populations, establishing a high-resolution map of human genetic variation that reduced the number of markers needed for comprehensive association studies from about 10 million common SNPs to 250,000–500,000 tag SNPs.[7][3][1] This scale provided unprecedented insight into haplotype structures and supported subsequent expansions in genomic research.[7]

Objectives and Significance

The primary objectives of the International HapMap Project were to catalog common patterns of human DNA sequence variation across the genome and to make this information freely available as a public resource. This involved characterizing the types of variants, their frequencies, and the correlations among them—particularly through haplotypes—in samples from diverse populations in Africa, Asia, and Europe. By doing so, the project aimed to provide a foundational framework for identifying genes that contribute to complex traits and diseases via indirect association studies, which link genetic markers to phenotypic outcomes without sequencing entire genomes. Additionally, it sought to enable cost-effective SNP selection for association studies by pinpointing tag SNPs that could efficiently capture the majority of common genetic variation within haplotype blocks.[2] The significance of the HapMap Project lay in its ability to overcome limitations of the Human Genome Project, which had produced a single reference sequence representing 99.9% of the genome but offered limited insight into the 0.1% variation that drives individual differences and disease susceptibility. By mapping these variations and exploiting patterns of linkage disequilibrium, the project accelerated genome-wide association studies (GWAS) through the use of tag SNPs, which represent haplotype blocks and thereby reduce the genotyping burden from millions of SNPs to a more manageable set of hundreds of thousands. This approach not only lowered costs and increased feasibility for large-scale genetic research but also enabled broader exploration of genetic contributions to multifactorial conditions like diabetes and heart disease.[2][6] Beyond its scientific advancements, the project exemplified international collaboration, uniting researchers from the United States, United Kingdom, Canada, China, Japan, and Nigeria to pool resources and expertise in haplotype mapping. It also set precedents for ethical data sharing in genomics by implementing rapid, open-access release policies that balanced scientific utility with protections for participant anonymity and equitable benefits across populations. These efforts laid crucial groundwork for personalized medicine, facilitating the translation of genetic insights into tailored diagnostics, therapies, and preventive strategies based on individual variation.[8][6]

History and Phases

Inception and Phase I

The International HapMap Project originated from discussions in the early 2000s aimed at accelerating the identification of genetic variants associated with common diseases. In 2001, an international working group proposed the development of a haplotype map of the human genome to catalog patterns of DNA sequence variation, following initial meetings such as the one held on July 18–19 in Washington, DC.[2] This proposal built on prior efforts like the International SNP Map Working Group and sought to address limitations in understanding linkage disequilibrium across diverse populations.[2] The project was officially launched on October 27–29, 2002, during a meeting in Washington, DC, as a multinational collaboration involving approximately 13 research groups from academic, non-profit, and private institutions across Canada, China, Japan, Nigeria, the United Kingdom, and the United States.[9] Funding totaled about $100 million over three years, provided by public sources including the U.S. National Institutes of Health (NIH), the Wellcome Trust, Genome Canada, and the Chinese Academy of Sciences, alongside private contributions.[9] To validate methods and address challenges such as variable linkage disequilibrium patterns and genotyping accuracy, pilot studies were conducted from 2002 to 2003, focusing on small genomic regions in samples from four populations: Yoruba in Ibadan, Nigeria; Han Chinese in Beijing, China; Japanese in Tokyo, Japan; and CEPH Utah residents with Northern and Western European ancestry.[2] Phase I of the project expanded beyond the pilots to genotype over 1.1 million single nucleotide polymorphisms (SNPs) across the entire euchromatic genome in 269 lymphoblastoid cell line samples from the same four populations, enabling the construction of a comprehensive haplotype map. Initial efforts prioritized dense coverage in select regions—approximately 10 segments of 500 kb each for method testing—before scaling to genome-wide analysis using tag SNPs to capture common variation efficiently.[2] Key milestones included a partial data release in December 2004, covering pilot and early genome-wide SNPs, followed by the full Phase I dataset release in October 2005.[10] The phase culminated in a seminal publication in Nature on October 27, 2005, detailing the haplotype map's structure, which revealed blocks of correlated variants and facilitated association studies for complex traits.

Phase II

Phase II of the International HapMap Project was launched in 2005 following the completion of Phase I, building on the same samples from four reference populations to achieve greater genomic resolution.[11] This phase expanded the haplotype map by genotyping an additional 2.1 million single nucleotide polymorphisms (SNPs), resulting in a total of over 3.1 million SNPs across the euchromatic regions of the human genome.[12] The effort focused on increasing SNP density, with an average spacing of 875 base pairs and 98.6% of the genome lying within 5 kb of at least one genotyped polymorphic SNP, enabling finer-scale mapping of haplotype blocks and linkage disequilibrium patterns.[12] Key advancements in Phase II included enhanced accuracy for imputing genotypes at ungenotyped SNPs, which improved the power of association studies by allowing researchers to infer missing data based on the denser haplotype reference. For instance, imputation accuracy reached a mean maximum r² of 0.86 for common variants (minor allele frequency ≥ 0.2) in certain populations using commercial genotyping arrays.[12] Additionally, the phase incorporated initial data on copy number variations (CNVs) in select genomic regions, derived from the same samples, providing a first-generation map of structural variants that complemented the SNP data and revealed insights into genome architecture.[13] These developments were detailed in a major publication in Nature on October 18, 2007, which analyzed the full Phase II dataset and highlighted its utility for detecting signals of natural selection and recombination hotspots.[7] The Phase II data were released to the public in July 2007 through the HapMap Data Coordination Center and integrated into databases such as NCBI's dbSNP, facilitating immediate application in genome-wide association studies (GWAS).[14] This resource enabled rapid progress in identifying genetic risk factors for common diseases, such as type 2 diabetes, where the denser map supported more precise locus fine-mapping and variant prioritization.

Phase III and Conclusion

Phase III of the International HapMap Project, initiated in 2008, aimed to broaden the representation of global human genetic diversity by genotyping 1.6 million common single nucleotide polymorphisms (SNPs) across 1,301 samples, including the original 270 from earlier phases.[3][15] This phase incorporated samples from 11 populations, encompassing African ancestries (such as Yoruba in Ibadan, Nigeria; Luhya in Webuye, Kenya; and Maasai in Kinyang, Kenya), East Asian (Han Chinese in Beijing and Japanese in Tokyo), South Asian (Gujarati Indians in Houston), European (Toscani in Italy and Utah residents with Northern and Western European ancestry), and admixed American groups (individuals of African ancestry in the southwestern United States and Mexican ancestry in Los Angeles).[3] By expanding beyond the initial focus on high-frequency variants, Phase III emphasized low-frequency alleles (minor allele frequency ≤5%) and structural variations, including over 11,000 copy number polymorphisms, to improve imputation accuracy and support studies of rare genetic contributions to traits and diseases.[3] The genotyping efforts, conducted using high-density platforms like the Illumina Human1M-Duo BeadChip and Affymetrix Genome-Wide Human SNP Array 6.0, resulted in a dataset that integrated seamlessly with Phases I and II, yielding haplotype information for more than 1.15 million SNPs with minor allele frequencies above 5% across the expanded sample set.[3] This comprehensive resource was publicly released in stages starting in 2009, with the final dataset made available in 2010 alongside a landmark publication in Nature detailing the findings and their implications for population genetics and genome-wide association studies.[3][16] With the completion of Phase III, the International HapMap Project achieved its core objectives of cataloging common patterns of human DNA sequence variation and providing a foundational public database for genetic research.[3] The consortium formally concluded active development in 2010, transitioning genotyping and analysis resources to successor initiatives such as the 1000 Genomes Project, which built upon HapMap data for deeper sequencing of rare variants.[11] A final data freeze was established to ensure stability, and while the project's website was retired around 2016, the full datasets—including all phases—remain archived and accessible through repositories like the NCBI and EBI for ongoing scientific use.[17]

Scientific Approach

Haplotype Mapping Concept

A haplotype is defined as a set of alleles at multiple linked loci on a chromosome that are inherited together from a single parent due to low rates of recombination between them.[18] These haplotypes often form discrete blocks in the genome, known as haplotype blocks, which are regions of strong linkage disequilibrium (LD) characterized by limited diversity in haplotype configurations and separated by recombination hotspots where LD breaks down more readily.[18] The rationale for haplotype mapping lies in its ability to simplify the study of human genetic variation, which is dominated by approximately 10 million common single nucleotide polymorphisms (SNPs) with minor allele frequencies greater than 5%.[18] By leveraging the non-random associations within haplotypes, researchers can identify a smaller set of tag SNPs—estimated at 1 to 5 million—that efficiently capture the information from the full set of common SNPs, thereby reducing genotyping costs and complexity in association studies.[18] Linkage disequilibrium quantifies the strength of these associations, measuring the correlation between alleles at different loci due to shared ancestry rather than independent assortment; the r² metric, the squared correlation coefficient between two loci, is commonly used, where values approaching 1 indicate strong LD and efficient tagging (e.g., a tag SNP with r² ≥ 0.8 can proxy nearby variants with high confidence).[18] In the International HapMap Project, this concept was applied to construct a genome-wide map by genotyping over 1 million SNPs across diverse populations, identifying numerous haplotype blocks that span much of the euchromatic genome.[18] These blocks facilitated the prioritization of SNPs for genetic association studies by focusing on tag SNPs within high-LD regions, enabling researchers to infer ungenotyped variants indirectly.[18] Notably, the decay of LD—and thus block length and tagging efficiency—varies across populations; for instance, LD extends over longer distances in European (CEU) and East Asian (CHB+JPT) samples due to historical population bottlenecks, allowing fewer tag SNPs to cover variation, whereas it decays more rapidly in Yoruba (YRI) African samples, requiring more tags to achieve similar coverage (e.g., only 1 in 5 SNPs has a perfect proxy in CEU compared to 2 in 5 in YRI).[18]

Genotyping and Data Analysis Methods

The International HapMap Project employed a variety of high-throughput genotyping technologies to assay single nucleotide polymorphisms (SNPs) across its phases, enabling the dense mapping of genetic variation in diverse populations. In Phase I, genotyping of over 1 million SNPs in 269 samples utilized multiple platforms, including Illumina BeadArrays, Sequenom MassARRAY, Affymetrix oligonucleotide arrays, and others such as Third Wave Invader assays and ParAllele molecular inversion probes, achieving an average accuracy of 99.7% and completeness of 99.3%.[18] Phase II expanded to 3.1 million SNPs in the same 270 individuals using Affymetrix GeneChip 500K arrays, Illumina HumanHap300 platforms, and Perlegen's amplicon-based resequencing, with per-genotype accuracy exceeding 99.5%.[19] Phase III further increased coverage to 1.6 million SNPs in 1,301 samples from 11 populations, primarily leveraging Illumina Infinium arrays for both common and rare variants.[3] These technologies allowed for scalable, cost-effective genotyping while minimizing errors through platform-specific clustering algorithms and validation. SNP discovery in the project relied on a combination of existing databases and targeted sequencing efforts, with whole-genome resequencing playing a supplementary role in validation rather than primary discovery. SNPs were primarily selected from the dbSNP database, prioritizing those with minor allele frequency (MAF) ≥5% and spacing of approximately 5 kb in Phase I, validated through PCR-based resequencing of pilot regions to confirm polymorphism rates and reduce false positives (estimated at 17% in dbSNP).[18] In early pilot phases, targeted sequencing of euchromatic regions identified novel SNPs for inclusion, while Phase III integrated targeted Sanger sequencing of ten 100-kb regions in 692 individuals to capture rare variants (MAF ≤5%), enhancing the map's resolution for imputation.[3] This approach ensured comprehensive coverage of common variation without exhaustive resequencing of all samples. The data analysis pipeline involved rigorous quality control (QC), haplotype phasing, and linkage disequilibrium (LD) estimation to construct the haplotype map. QC thresholds included SNP call rates >80%, fewer than one Mendelian error per SNP in parent-offspring trios, Hardy-Weinberg equilibrium P > 0.001, and MAF >1% in at least one population, filtering out low-quality variants to maintain dataset integrity. Phasing of unphased genotypes into haplotypes was performed using the PHASE software, a Bayesian coalescent-based algorithm that inferred haplotypes from trio data and unrelated individuals, achieving switch error rates of 1 per 8 Mb in European samples and lower in Asian samples. LD patterns were calculated using tools like Haploview to compute metrics such as D' and r², identifying haplotype blocks and recombination hotspots, with average maximum r² for common SNPs reaching 0.90-0.96 across populations. These steps produced consensus haplotype datasets released progressively, facilitating downstream applications. Statistical approaches emphasized imputation and structural variant detection to maximize the utility of the genotyped data. Genotype imputation employed hidden Markov models (HMMs) via software like MACH, which reconstructed untyped SNPs by leveraging phased haplotypes and reference panels, improving accuracy for rare variants (r² ≈0.86 in African samples for MAF ≥0.2) and reducing the need for direct genotyping. Genotyping error rates were minimized to below 0.3% through duplicate checks and platform calibrations, with allele-flipping errors estimated at 500-2,000 SNPs genome-wide in Phase II. Copy number variation (CNV) detection integrated intensity signals from array data using tools like QuantiSNP, identifying 541 candidate deletions in Phase I (150 common) and over 1,000 CNPs in Phase III, validated to confirm impacts on coding regions.[3] These methods collectively enabled the project's output of haplotype blocks, providing a foundational resource for genetic association studies.

Samples and Populations

Population Selection Criteria

The International HapMap Project selected populations based on criteria emphasizing unrelated individuals from ancestrally distinct groups to effectively capture patterns of genetic variation, including haplotypes and linkage disequilibrium structures, across major human ancestries.[2] Priority was given to populations with established reference panels, such as the Centre d'Etude du Polymorphisme Humain (CEPH) collection for individuals of northern and western European ancestry, to leverage existing high-quality genomic data and ensure comparability.[20] The selection aimed for balanced representation of African, European, and East Asian ancestries, recognizing that these groups encompass the majority of global human genetic diversity while minimizing initial complexity by avoiding highly admixed populations.[2] A key rationale for including African populations was their expected higher levels of genetic diversity, stemming from the out-of-Africa migration model, where non-African populations represent subsets of African variation due to historical bottlenecks.[2] This approach ensured the HapMap could identify common variants (minor allele frequency ≥5%) with high power for association studies worldwide, as pilot data demonstrated substantial haplotype similarity across continents despite frequency differences.[18] Ethical considerations, including community engagement and informed consent, were integrated into population choices to promote equitable participation.[20] In Phase I, the project focused on four populations: 90 Yoruba individuals from Ibadan, Nigeria (YRI; 30 parent-offspring trios); 90 CEPH-derived Utah residents with northern and western European ancestry (CEU; 30 trios); 45 Han Chinese from Beijing (CHB); and 45 Japanese from Tokyo (JPT), totaling 270 samples.[18] This selection provided a foundational dataset for genotyping over 1 million single nucleotide polymorphisms (SNPs). By Phase III, the project expanded to 11 populations to enhance resolution of less common variants and broader diversity, adding the Luhya from Webuye, Kenya (LWK); Maasai from Kinyawa, Kenya (MKK); Mexican ancestry from Los Angeles, California (MXL); Gujarati Indians from Houston, Texas (GIH); Chinese from Metropolitan Denver, Colorado (CHD); African ancestry from Southwest USA (ASW); and Toscani from Italy (TSI), resulting in 1,301 samples overall.[3] This evolution reflected a commitment to increasing global representativeness while maintaining focus on distinct ancestral groups for accurate haplotype inference.[15]

Sample Collection and Ethical Guidelines

The International HapMap Project obtained DNA samples primarily through lymphoblastoid cell lines (LCLs) derived from peripheral blood, which were immortalized using Epstein-Barr virus transformation and stored at the Coriell Institute for Medical Research.[4] These cell lines provided a renewable source of high-quality DNA for genotyping, with initial collections focusing on four populations: approximately 90 individuals from the Yoruba in Ibadan, Nigeria (comprising 30 parent-offspring trios); 45 unrelated individuals from Tokyo, Japan; 45 unrelated individuals from Beijing, China (Han Chinese); and 90 individuals from Utah residents with northern and western European ancestry (30 CEPH trios).[21][4] Blood samples for the new collections in Nigeria, Japan, and China were gathered under local oversight, while the CEPH samples were re-consented from existing repositories.[8] Ethical guidelines for the project were developed to ensure respect for participants, particularly in diverse global contexts, adhering to international standards such as those from the Council for International Organizations of Medical Sciences (CIOMS) and UNESCO.[4] Informed consent processes were culturally tailored and emphasized that the resulting data would be publicly available for research without any possibility of re-identifying donors or linking results to individuals; no medical or phenotypic information was collected alongside samples.[20][4] All collections required approvals from local Institutional Review Boards (IRBs) or ethics committees, which held final authority, and donors were explicitly informed that no individual research results would be returned to them or their communities.[4] Additionally, Coriell Institute policies prohibited commercialization of the samples, limiting their use to non-profit scientific research.[4] To address ethical complexities, the project established an Ethics and Community Working Group—initially known as the Populations/ELSI (Ethical, Legal, and Social Implications) Group—in 2002, co-chaired by experts including Ellen Wright Clayton and Bartha M. Knoppers, with members from participating countries to integrate ethicists, social scientists, and geneticists into decision-making.[8] Community engagement was prioritized in non-Western populations through public consultations, advisory groups, and liaison with local leaders; for instance, Community Advisory Groups were formed at collection sites to maintain ongoing dialogue with the Coriell Institute.[4] Challenges included navigating cultural sensitivities around blood donation and genetic research in Nigeria, where Yoruba communities required extended discussions on benefits and risks, and in China, where the SARS epidemic in 2003 compressed engagement timelines despite completing all planned activities.[8] In Nigeria, securing IRB approvals delayed community outreach by over six months, underscoring the need for prolonged trust-building in such contexts.[4]

Data Management and Access

Data Generation and Quality Control

The data generation for the International HapMap Project began with SNP discovery through targeted resequencing of genomic regions in diverse populations, identifying millions of candidate single nucleotide polymorphisms (SNPs) that were then validated and incorporated into public databases like dbSNP.[18] High-throughput genotyping followed, utilizing platforms such as Illumina and Affymetrix arrays to assay these SNPs across hundreds of samples from the project's reference populations, aiming for dense coverage with at least one common SNP (minor allele frequency ≥0.05) every 5 kilobases in early phases.[18] Computational methods, including statistical phasing algorithms like PHASE, were applied to infer haplotypes from the genotyped data, particularly leveraging family trios for accuracy in reconstructing linkage disequilibrium patterns.[18] In later phases, imputation techniques addressed missing genotypes by leveraging linkage disequilibrium structures, enabling the expansion to over 3.1 million SNPs in Phase II and an additional 1.6 million in Phase III, resulting in datasets exceeding 4 million SNPs per release.[7][3] Quality control was integral to each phase, ensuring high reliability through multi-lab concordance checks that achieved greater than 99% agreement in genotype calls across genotyping centers.[18] Mendelian inheritance error rates were rigorously assessed in parent-offspring trios, maintained below 1% by excluding SNPs with discrepancies exceeding this threshold, which helped validate familial consistency and reduce false positives.[7] Additional filters included minimum call rates above 80-95% for completeness, Hardy-Weinberg equilibrium deviations (P > 0.001) to detect genotyping artifacts, and minor allele frequency thresholds to focus on informative variants, with population stratification adjustments applied to mitigate ancestry-related biases in haplotype inference.[18] All QC metrics, including per-SNP error rates and panel-specific polymorphism rates, were comprehensively documented in release notes and supplementary materials, facilitating transparency and reproducibility for downstream analyses.[7][3] These processes evolved across phases: Phase I prioritized validation in 269 samples yielding 1,007,329 SNPs post-QC, while subsequent expansions incorporated advanced imputation models to handle missing data rates below 1% and integrated rare variant sequencing for enhanced resolution in diverse populations.[18][7] Overall, the workflow and controls ensured datasets with over 99.3% genotyping completeness and accuracy exceeding 99.7%, establishing a robust foundation for genetic variation mapping.[18]

Release Policies and Public Availability

The International HapMap Project employed a freeze-and-release model to ensure rapid public dissemination of its data without any proprietary holds or delays for intellectual property claims. Under this approach, genotyping results were periodically frozen at defined stages and released progressively as they were generated and quality-controlled. Phase I data, encompassing genotypes for over one million single nucleotide polymorphisms (SNPs) across 269 samples, was publicly released in October 2005. Phase II expanded coverage to more than 3.1 million SNPs and was released in October 2007. Phase III, which included data from 1,301 samples and focused on diverse populations with additional sequencing, was released in 2010. The project's data were made freely available at no cost under the International HapMap Project Public Access License, a permissive framework that prohibited patenting of the data itself or restrictions on its further use and distribution. This policy explicitly allowed unrestricted commercial applications, provided users agreed not to encumber the data with intellectual property claims that could limit access. In 2004, the consortium removed all remaining click-wrap licensing requirements, fully placing the data in the public domain to maximize its utility as a community resource. Publications utilizing the data were required to include proper acknowledgments and citations to the HapMap consortium. HapMap data were hosted primarily on the NCBI dbSNP database for comprehensive SNP cataloging and retrieval, with seamless integration into the Ensembl genome browser for comparative analysis and annotation. The dedicated HapMap website (hapmap.org, archived at hapmap.ncbi.nlm.nih.gov) served as the central portal, offering interactive data browsers for visualizing haplotypes and linkage disequilibrium patterns, bulk download options in various formats, and APIs for programmatic querying and integration into custom analyses. Furthermore, the data were incorporated into the UCSC Genome Browser, enabling users to overlay HapMap genotypes on reference assemblies for enhanced genomic exploration.

Impact and Legacy

Contributions to Genetic Research

The International HapMap Project revealed fundamental patterns in human genetic variation, demonstrating that haplotype block structures—regions of high linkage disequilibrium—vary significantly by ancestry. For instance, in European-descent populations (CEU), 87% of the genome sequence fell within blocks containing at least four SNPs, compared to only 67% in Yoruba from West Africa (YRI), reflecting differences in historical recombination rates and population histories.[18] Phase I of the project genotyped and analyzed over 1 million common single nucleotide polymorphisms (SNPs) across 269 individuals from four populations, providing the first comprehensive catalog of haplotype diversity for common variants with minor allele frequencies greater than 5%.[18] Subsequent phases expanded this to over 3.1 million SNPs in Phase II and integrated rare variants in Phase III, enhancing resolution for population-specific variation.[7][3] The HapMap data also uncovered early signatures of natural selection, including strong evidence at the LCT locus for lactase persistence in Europeans, where a long-range haplotype analysis yielded a highly significant P-value of 1.3 × 10^{-9}, indicating a recent selective sweep favoring adult milk digestion.[18] By identifying tag SNPs that efficiently capture common haplotype diversity, the HapMap enabled cost-effective genome-wide association studies (GWAS) for complex traits, reducing the need to genotype millions of markers. These tag SNPs were incorporated into commercial arrays and used in hundreds of GWAS by 2010, substantially boosting statistical power and replication success in mapping disease susceptibility.[22] For example, a landmark 2006 GWAS leveraged HapMap-derived tag SNPs on the Illumina HumanHap300 array to identify common variants in IL23R as key risk factors for Crohn's disease, with the protective allele conferring an odds ratio of 0.26 (95% CI 0.15–0.43).[23] The project's core publications—a 2005 Nature paper on Phase I, a 2007 Nature paper on Phase II, and a 2010 Nature paper on Phase III—have profoundly influenced genetic research, with the Phase I report alone garnering over 7,000 citations as of recent counts, underscoring its role as a foundational resource.[18][7][3]

Influence on Subsequent Projects and Applications

The International HapMap Project directly informed the design and implementation of successor initiatives, such as the 1000 Genomes Project launched in 2008, which expanded on HapMap's catalog of common genetic variants by focusing on rarer alleles through whole-genome sequencing of over 2,500 individuals from multiple populations.[24] HapMap's haplotype data served as a foundational reference panel for imputation in the 1000 Genomes pilot phases, enabling the identification of variants not captured in earlier genotyping efforts.[25] Similarly, HapMap contributed to the ENCODE Project by providing comprehensive genotyping of selected genomic regions, which facilitated the integration of haplotype information with functional annotation of non-coding elements.[26] For the Genotype-Tissue Expression (GTEx) Project, HapMap genotyping data were used to train models for predicting gene expression from genotypes, supporting the cataloging of expression quantitative trait loci across human tissues.[27] In applications beyond foundational mapping, HapMap data advanced personalized genomics, particularly in pharmacogenomics, by enabling the identification of haplotypes associated with drug response variations; for instance, the Pharmacogenomics Knowledge Base (PharmGKB) incorporates HapMap-derived variants to annotate genetic influences on medication efficacy and adverse effects.[28] The project's haplotype structure also supported evolutionary studies, including admixture mapping, where patterns of linkage disequilibrium from HapMap helped trace ancestral contributions in admixed populations and infer historical migration events.[29] Furthermore, HapMap's early adoption of unrestricted public data release in 2004 set a precedent for open-access policies in genomics, influencing subsequent consortia to prioritize rapid, barrier-free sharing to accelerate global research collaboration.[30] By 2025, HapMap's legacy endures through integration into modern genomic resources like the Genome Aggregation Database (gnomAD), where its variant data inform quality control and training for variant calling pipelines across diverse cohorts.[31] The project enabled thousands of genome-wide association studies (GWAS) by providing a high-density SNP map for efficient genotyping, with its data cited in over 5,000 publications by the early 2020s that advanced understanding of complex traits.[32] Critiques of HapMap's initial Eurocentric bias—stemming from its focus on limited population samples—have been addressed in later efforts like the 1000 Genomes Project, which incorporated broader global diversity to mitigate underrepresentation of non-European variants.[33]

References

User Avatar
No comments yet.