> For the complete documentation index, see [llms.txt](https://ourfuturehealth.gitbook.io/our-future-health/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ourfuturehealth.gitbook.io/our-future-health/data-types/genetic-data/genetic-ancestry.md). # Genetic ancestry ### **An overview of the ancestry estimation process** Global ancestry estimation (GAE) was applied to infer the mixture of populations from which an individual’s genome was inherited. This utilises genome-wide proportions of the genome that are genetically similar to populations or regions represented on a reference panel. A GAE workflow was developed by Genomics Ltd based on principal component analysis (PCA) and non-negative least squares (NNLS). #### **What initial investigations were performed?** An initial PCA analysis was performed with UK Biobank whole genome sequencing (WGS) data to assess how well variants from the Our Future Health custom array (OurFutureHealthv1) can detect ancestry differences, particularly at the sub-continental level. PCA was performed with a randomly selected set of European participants from UK Biobank ([UK Biobank Field 23374 (external link)](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23374)) where WGS data was filtered to retain only the Our Future Health array variants. The resulting PCA showed that it was essential to expand the Our Future Health array by imputation to identify genetic variation at the sub-continental level (Figure 1). HapMap3 variants and the UK Biobank variants used for PCA (UKB-PCA variants) were included as these were likely to be well imputed.

Figure 1 - PCA analysis performed on randomly selected European UK Biobank samples. F, Finalnd; E/N/W/S, East/North/South/West Europe; OFH, Our Future Health.

#### **How was a training dataset created to assess performance of the ancestry estimation method?** Training data was prepared using GraphTyper population level WGS variants from UK Biobank (UKB Field 23374). Variants were filtered for those on the Our Future Health array C2 manifest in addition to HapMap3 variants and UK Biobank variants used in the initial PCA analysis above (UKB-PCA variants). Samples were filtered to exclude those in the initial 200k WGS release to avoid overlap with the imputation reference panel. Further QC and variant processing measures were applied: * Retained high quality variants based on a GraphTyper AAscore > 0.5 and a “PASS” FILTER column (Halldorsson et al, 2022 ¹) * Multi-allelic variants were split into biallelic * Excluded variants with a MAF < 1% * Excluded variants and samples with a high level of missingness. This was only applied to variants as no samples were found to have missingness > 2.5% * Retained only unrelated samples * Obtained missing genotypes using Beagle with default parameters and no reference * Excluded high-LD regions defined by Genomics Ltd. * Retained variants that were well imputed (dosage R² greater or equal to 0.85). Imputation was performed on a set of test samples of diverse ancestry backgrounds according to the [Genomics Ltd workflow](/our-future-health/data-types/genetic-data/imputed-genotype-data.md). #### **How were regions of the reference panel defined?** A reference panel was derived from UK Biobank WGS data. The following sources were considered when defining regions and the samples they represent: * The geographic location of the country and its sample size * PC plots to determine if samples in a certain region were genetically distinct * The availability of self-reported ethnicity data (defined in UK Biobank fields 1647 and 20115). For example, only samples with “Irish” self-reported ethnicity were allowed to form the IRELAND region in the reference panel. * Published results in the literature As the number of Indigenous/Native American samples is limited in UK Biobank, there is only a single Indigenous/Native American region, labelled N\_C\_S\_AMERICA for “North, Central and South America”. These samples were determined via an initial continental level ancestry inference step, based on 4 principal components trained using 1000 Genomes data. Additional regions include 5 within-UK region, 7 European regions, 4 African regions and 8 Asian regions, giving 25 sub-continental regions in total. #### How was the reference panel built and refined? UK Biobank samples from the defined regions that had suitable self-reported ethnicity data were extracted. As different regions of the world are not evenly represented in UK Biobank, over-represented regions were randomly down-sampled to avoid effects of uneven sampling on the PCA projection (McVean 2009 ²) (Table 1). **Table 1 - Countries and regions under each geographical label, including sample sizes of the regions**

Label / Region	Countries or Areas	Size (pre‑filtering)	Size (post‑filtering)
N_AFRICA (North Africa)	Algeria, Egypt, Libya, Morocco, Tunisia	199	172
E_AFRICA (East Africa)	Eritrea, Ethiopia, Sudan, Somalia	103	90
W_AFRICA (West Africa)	Ghana, Liberia, Sierra Leone, Nigeria, Gambia, Guinea, Togo, Senegal, Côte d’Ivoire	250	238
C_S_AFRICA (Central and Southern Africa)	Angola, Congo, Kenya, Cameroon, Zambia, South Africa, Uganda, Zimbabwe, United Republic of Tanzania, Central African Republic, Burundi, Rwanda	250	211
CE_ASIA (Central East Asia)	China, Hong Kong Special Administrative Region, Macao Special Administrative Region, Taiwan	200	158
JAPAN_KOREA	Japan, Republic of Korea, Democratic People’s Republic of Korea	140	133
SE_ASIA (Southeast Asia)	Viet Nam, Thailand, Myanmar, Malaysia, Lao People’s Democratic Republic, Indonesia, Cambodia, Philippines, Singapore	200	173
N_C_S_AMERICA (North, Central and South America)	Belize, Bolivia (Plurinational State of), Chile, Colombia, Ecuador, Mexico, Peru	60	50
INDIA_PAKISTAN	India, Pakistan	200	169
BANGLADESH	Bangladesh	116	109
SRI_LANKA	Sri Lanka	154	146
C_ASIA (Central Asia)	Afghanistan, Uzbekistan, Kazakhstan, Kyrgyzstan, Tajikistan	58	49
M_EAST_W_ASIA (Middle East and West Asia)	Jordan, Lebanon, Syrian Arab Republic, Bahrain, Kuwait, Oman, Saudi Arabia, Yemen, Türkiye, Cyprus, Armenia, Azerbaijan, Georgia, Iran (Islamic Republic of), Iraq, Israel	200	178
NW_WALES (North-West Wales)	Isle of Anglesey, Gwynedd, Conwy	487	450
SW_WALES (South-West Wales)	Ceredigion, Pembrokeshire, Carmarthenshire	657	544
N_ENG_S_SCOT (Northern England and Southern Scotland)	Dumfries and Galloway, Scottish Borders, Cumbria, Northumberland, County Durham, Sunderland, Gateshead, South Tyneside, Newcastle upon Tyne, North Tyneside	800	676
NI_N_SCOT (Northern Ireland and Scotland)	Northern Ireland, Scotland less regions in N_ENG_S_SCOT	800	703
C_S_UK (Central and South UK)	UK regions not in NW_WALES, SW_WALES, N_ENG_S_SCOT, and NI_N_SCOT	800	747
FINLAND	Finland	87	81
IRELAND	Ireland	800	727
PORTUGAL_SPAIN	Spain, Portugal	324	297
SE_EUROPE (South-Eastern Europe)	Italy, Romania, Greece, Bulgaria, Albania, Bosnia and Herzegovina, Croatia	594	522
N_EUROPE	Norway, Sweden, Denmark, Iceland	310	295
E_EUROPE	Poland, Lithuania, Ukraine, Czechia, Slovakia, Russian Federation, Latvia, Hungary	663	619
CW_EUROPE (Central-Western Europe)	France, Netherlands (Kingdom of the), Belgium, Germany, Switzerland, Austria	800	730

Following QC and processing of the UK Biobank WGS genotype data, the following steps were applied: 1. Genotype data for samples in the initial panel were extracted and converted to an N-by-S genotype matrix where N is the number of samples, and S in the number of SNPs/variants. Each cell of the matrix took the value of 0, 1 or 2 for the number of copies of the ALT allele at the samples genotype. 2. Variants were LD pruned in PLINK with a window size of 1000 variants, a step size of 80 variants and an r² threshold of 0.1 (Bycroft et al, 2018 ³) 3. Variants with a MAF less than 0.005 were removed 4. PCA was performed and samples were projected onto the top 40 PCs 5. The centroid of the PC projections of the samples was computed for each region. The median of the projections on a PC was taken as the position of the centroid along that PC. 6. For each sample, the distance to the centroid of its region of origin was calculated 7. For each region, the mean and standard deviation of the distance to the regional centroid was calculated using all samples from that region. Samples with a distance greater than the mean + SD were excluded. This resulted in a refined reference panel with outliers removed. The refined panel was used to perform PCA and obtain final PCs and the projection of samples. #### How were informative PCs selected? PCs were selected based on their ability to distinguish between the regions represented on the reference panel. For each PC axis, the Kruskal-Wallis test was used to compare the distribution of projections among the regions in the panel. Informative PCs were defined as those with a Kruskal-Wallis p-value < 1x10^-20, and if at least one region’s interquartile range of the projections on the PC did not include 0. The mean within-region SD was calculated using the projection of the samples along each of the informative PCs separately and used to compute downstream likelihood ratio statistics. #### How were ancestry proportions estimated? Ancestry proportions were estimated for new samples by applying non-negative least squares (NNLS) strategies to informative PC projections. These methods included: 1. NNLS-admixture: gives an estimate of fractional ancestry proportions that are constrained to be non-negative and sum up to 1 2. NNLS-hard-calling: assigns a sample entirely to the closest regional centroid #### How were the NNLS methods validated? Samples in the refined reference panel were split into a training set (80%) and a testing set (20%), where the training set was used to build a new panel according to the steps described previously. Ancestry proportions were estimated for samples in the testing set. Precision and recall were computed, where the region of birth of the testing samples was treated as truth. In doing so, precision was defined as the proportion of samples called to a region or ancestry group that are correctly called, and recall as the proportion of samples that belong to a region or ancestry group that are correctly called. Precision and recall were computed for NNLS-admixture by assigning a sample to a region if the estimated contribution of the region was greater than 50%. This resulted in a high level of accuracy for both NNLS-hard-calling and NNLS-admixture methods at the continental level including the UK (Table 2). However, accuracy was found to be lower when looking across the 25 sub-continental regions of the refined panel, for example when both NNLS methods were applied to the central and south UK region (C\_S\_UK) (Table 3). This could be explained by the genetic makeup of the region being explained in part by other regions in the panel. **Table 2 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the continental level plus the UK.**

Region	NNLS‑ hard‑calling Recall	NNLS‑ hard‑calling Precision	NNLS‑ admixture Recall	NNLS‑ admixture Precision	Fraction called
AFR	0.96	0.99	0.96	0.99	1.00
N_C_S_AMERICA	1.00	1.00	1.00	1.00	1.00
ASIA	0.98	0.99	0.96	0.99	0.98
EUR (non‑UK)	0.95	0.93	0.96	0.86	1.00
UK	0.94	0.95	0.83	0.95	0.99

**Table 3 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the sub-continental level.**

Region	Hard‑call Recall	Hard‑call Precision	Admixture Recall	Admixture Precision	Fraction called
BANGLADESH	1.00	0.85	0.91	1.00	0.95
CE_ASIA	1.00	0.86	1.00	0.82	1.00
CW_EUROPE	0.66	0.84	0.14	0.88	0.45
C_ASIA	0.80	0.62	0.70	0.70	0.90
C_S_AFRICA	0.98	1.00	0.98	1.00	1.00
C_S_UK	0.65	0.49	0.23	0.55	0.64
E_AFRICA	1.00	1.00	0.94	1.00	0.94
E_EUROPE	0.94	0.98	0.94	0.98	0.96
FINLAND	0.94	1.00	0.94	1.00	0.94
INDIA_PAKISTAN	0.68	0.96	0.50	0.94	0.79
IRELAND	0.88	0.82	0.88	0.78	0.94
JAPAN_KOREA	1.00	1.00	1.00	1.00	1.00
M_EAST_W_ASIA	0.89	0.94	0.81	0.94	0.89
NI_N_SCOT	0.60	0.70	0.40	0.71	0.72
NW_WALES	0.64	0.98	0.63	0.98	0.76
N_AFRICA	0.83	0.97	0.80	0.97	0.97
N_C_S_AMERICA	1.00	1.00	1.00	1.00	1.00
N_ENG_S_SCOT	0.55	0.55	0.52	0.52	0.76
N_EUROPE	0.93	0.69	0.95	0.63	0.98
PORTUGAL_SPAIN	0.98	0.83	0.98	0.92	0.98
SE_ASIA	0.83	1.00	0.74	1.00	0.94
SE_EUROPE	0.99	0.93	0.95	0.92	0.95
SRI_LANKA	1.00	0.91	1.00	0.79	1.00
SW_WALES	0.81	0.87	0.82	0.88	0.86
W_AFRICA	1.00	0.98	1.00	0.98	1.00

**Validation with the Coriell Our Future Health array dataset** A subset of the Coriell 1,000 Genomes samples (1,998) were genotyped on the Our Future Health array, then phased and imputed using [Genomics Ltd's bespoke pipeline](/our-future-health/data-types/genetic-data/imputed-genotype-data.md). The places of birth of the samples were mapped to the regions of the refined panel, and their ancestry proportions were estimated using both NNLS approaches. An initial check was performed by comparing the African Ancestry in Southwest US (ASW) samples with respect to their African ancestry that was inferred with the NNLS-admixture approach and the Genomics Ltd's chromosome painting pipeline using FLARE ⁴. A high degree of concordance was found between the two approaches suggesting that the NNLS-admixture method is well calibrated to appropriately handle admixture (Figure 2). Additional Coriell samples were analysed using both NNLS methods where the majority of regions showed good performance.

Figure 2 - Comparison of African ancestry inferred for ASW Coriell samples by chromosome painting (FLARE) and NNLS-admixture

#### Further validation with The People of the British Isles (PoBI) samples The People of the British Isles (PoBI) ⁵ collection from the Coriell dataset were genotyped on the Illumina Human 1.2M-Duo genotyping chip. Genotype data were phased and imputed using the Genomics Ltd bespoke pipeline. After applying QC filter a total of 1,935 samples were retained. The regions of birth of the samples were mapped to the regions of the refined reference panel as closely as possible. NNLS-admixture and NNLS-hard-calling were applied to the PoBI samples where high recall rates were observed for both approaches. Due to the method of sample collection, it is unlikely for admixture to be a major contributor to the data patterns. This is reflected in the NNLS-hard-calling recall rates (Table 4) being slightly higher than those of the NNLS-admixture approach (Table 5). **Table 4 - The average NNLS-hard-calling estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.**

Region	NI_N_SCOT	NW_WALES (some C_S_UK)	N_ENG_S_SCOT	SW_WALES (some C_S_UK)	C_S_UK
BANGLADESH	0.00	0.00	0.00	0.00	0.00
CE_ASIA	0.00	0.00	0.00	0.00	0.00
CW_EUROPE	0.00	0.00	0.00	0.00	0.03
C_ASIA	0.00	0.00	0.00	0.00	0.00
C_S_AFRICA	0.00	0.00	0.00	0.00	0.00
C_S_UK	0.03	0.00	0.18	0.00	0.80
E_AFRICA	0.00	0.00	0.00	0.00	0.00
E_EUROPE	0.00	0.00	0.00	0.00	0.00
FINLAND	0.00	0.00	0.00	0.00	0.00
INDIA_PAKISTAN	0.00	0.00	0.00	0.00	0.00
IRELAND	0.05	0.00	0.01	0.00	0.00
JAPAN_KOREA	0.00	0.00	0.00	0.00	0.00
M_EAST_W_ASIA	0.00	0.00	0.00	0.00	0.00
NI_N_SCOT	0.61	0.00	0.11	0.00	0.01
NW_WALES	0.00	0.99	0.00	0.00	0.00
N_AFRICA	0.00	0.00	0.00	0.00	0.00
N_C_S_AMERICA	0.00	0.00	0.00	0.00	0.00
N_ENG_S_SCOT	0.31	0.00	0.69	0.00	0.15
N_EUROPE	0.00	0.00	0.00	0.00	0.01
PORTUGAL_SPAIN	0.00	0.00	0.00	0.00	0.00
SE_ASIA	0.00	0.00	0.00	0.00	0.00
SE_EUROPE	0.00	0.00	0.00	0.00	0.00
SRI_LANKA	0.00	0.00	0.00	0.00	0.00
SW_WALES	0.00	0.01	0.00	1.00	0.01
W_AFRICA	0.00	0.00	0.00	0.00	0.00

**Table 5 - The average NNLS-admixture estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.**

Region	NI_N_SCOT	NW_WALES (some UK_OTHER)	N_ENG_S_SCOT	SW_WALES (some UK_OTHER)	UK_OTHER
BANGLADESH	0.00	0.00	0.00	0.00	0.00
CE_ASIA	0.00	0.00	0.00	0.00	0.00
CW_EUROPE	0.00	0.00	0.02	0.00	0.09
C_ASIA	0.00	0.00	0.00	0.00	0.00
C_S_AFRICA	0.00	0.00	0.00	0.00	0.00
C_S_UK	0.03	0.00	0.08	0.00	0.30
E_AFRICA	0.00	0.00	0.00	0.00	0.00
E_EUROPE	0.01	0.00	0.00	0.00	0.00
FINLAND	0.01	0.00	0.01	0.00	0.00
INDIA_PAKISTAN	0.00	0.00	0.00	0.00	0.00
IRELAND	0.16	0.00	0.05	0.01	0.02
JAPAN_KOREA	0.00	0.00	0.00	0.00	0.00
M_EAST_W_ASIA	0.00	0.00	0.00	0.00	0.00
NI_N_SCOT	0.47	0.00	0.15	0.00	0.01
NW_WALES	0.03	0.99	0.02	0.01	0.03
N_AFRICA	0.00	0.00	0.00	0.00	0.00
N_C_S_AMERICA	0.00	0.00	0.00	0.00	0.00
N_ENG_S_SCOT	0.17	0.00	0.53	0.01	0.26
N_EUROPE	0.08	0.00	0.09	0.01	0.20
PORTUGAL_SPAIN	0.01	0.00	0.01	0.00	0.03
SE_ASIA	0.00	0.00	0.00	0.00	0.00
SE_EUROPE	0.00	0.00	0.01	0.00	0.01
SRI_LANKA	0.00	0.00	0.00	0.00	0.00
SW_WALES	0.02	0.01	0.02	0.96	0.04
W_AFRICA	0.00	0.00	0.00	0.00	0.00

#### How was genetic ancestry inferred with Our Future Health samples? Genetic ancestry was inferred for Our Future Health participants using a refined reference panel based on UK Biobank as described above. Genotype data for Our Future Health samples were generated with the custom OurFutureHealthv1 beadchip array assay and imputed with phased UK Biobank 200k WGS data as [previously described](/our-future-health/data-types/genetic-data/imputed-genotype-data.md). Our Future Health data were securely transferred to a private project space shared with Genomics Ltd in groups of 2,000 to 8,000 samples. Groups consisted of multiple batches that has been genotyped on the same date. Batches from more than one genotype date were combined if a group was less than 2,000 samples. Our Future Health data was pseudo-anonymised by replacing the sample IDs provided by the genotyping laboratory with a participant ID (PID) specific for Genomics Ltd. Original genotyping batch IDs were not made available. Ancestry estimation was performed for samples in the same grouping that they were also assigned for imputation of their genotype data. In doing so, both the NNLS-admixture and NNLS-hard-calling approaches were applied. For each sample, an NNLS-admixture proportion was returned across all the 25 sub-continental regions. In addition, the NNLS-hard-calling approach was used to assign a single ancestry label to each sample out of the 25 sub-continental regions. *** #### **References** 1. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022). 2. McVean, G. A Genealogical Interpretation of Principal Components Analysis. PLOS Genet. 5, e1000686 (2009). 3. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). 4. Browning, S. R., Waples, R. K. & Browning, B. L. Fast, accurate local ancestry inference with FLARE. Am. J. Hum. Genet. 110, 326–335 (2023). 5. Leslie, S. et al. The fine-scale genetic structure of the British population. Nature 519, 309–314 (2015). --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://ourfuturehealth.gitbook.io/our-future-health/data-types/genetic-data/genetic-ancestry.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.