Genetic ancestry
Information about the ancestry data in the Our Future Health Resource. This documentation includes the scope and structure of the data, and the process of ancestry estimation.
An overview of the ancestry estimation process
Global ancestry estimation (GAE) was applied to infer the mixture of populations from which an individual’s genome was inherited. This utilises genome-wide proportions of the genome that are genetically similar to populations or regions represented on a reference panel. A GAE workflow was developed by Genomics Ltd based on principal component analysis (PCA) and non-negative least squares (NNLS).
What initial investigations were performed?
An initial PCA analysis was performed with UK Biobank whole genome sequencing (WGS) data to assess how well variants from the Our Future Health custom array (OurFutureHealthv1) can detect ancestry differences, particularly at the sub-continental level. PCA was performed with a randomly selected set of European participants from UK Biobank (UK Biobank Field 23374 (external link)) where WGS data was filtered to retain only the Our Future Health array variants. The resulting PCA showed that it was essential to expand the Our Future Health array by imputation to identify genetic variation at the sub-continental level (Figure 1). HapMap3 variants and the UK Biobank variants used for PCA (UKB-PCA variants) were included as these were likely to be well imputed.

How was a training dataset created to assess performance of the ancestry estimation method?
Training data was prepared using GraphTyper population level WGS variants from UK Biobank (UKB Field 23374). Variants were filtered for those on the Our Future Health array C2 manifest in addition to HapMap3 variants and UK Biobank variants used in the initial PCA analysis above (UKB-PCA variants). Samples were filtered to exclude those in the initial 200k WGS release to avoid overlap with the imputation reference panel.
Further QC and variant processing measures were applied:
Retained high quality variants based on a GraphTyper AAscore > 0.5 and a “PASS” FILTER column (Halldorsson et al, 2022 1)
Multi-allelic variants were split into biallelic
Excluded variants with a MAF < 1%
Excluded variants and samples with a high level of missingness. This was only applied to variants as no samples were found to have missingness > 2.5%
Retained only unrelated samples
Obtained missing genotypes using Beagle with default parameters and no reference
Excluded high-LD regions defined by Genomics Ltd.
Retained variants that were well imputed (dosage R2 greater or equal to 0.85). Imputation was performed on a set of test samples of diverse ancestry backgrounds according to the Genomics Ltd workflow.
How were regions of the reference panel defined?
A reference panel was derived from UK Biobank WGS data. The following sources were considered when defining regions and the samples they represent:
The geographic location of the country and its sample size
PC plots to determine if samples in a certain region were genetically distinct
The availability of self-reported ethnicity data (defined in UK Biobank fields 1647 and 20115). For example, only samples with “Irish” self-reported ethnicity were allowed to form the IRELAND region in the reference panel.
Published results in the literature
As the number of Indigenous/Native American samples is limited in UK Biobank, there is only a single Indigenous/Native American region, labelled N_C_S_AMERICA for “North, Central and South America”. These samples were determined via an initial continental level ancestry inference step, based on 4 principal components trained using 1000 Genomes data. Additional regions include 5 within-UK region, 7 European regions, 4 African regions and 8 Asian regions, giving 25 sub-continental regions in total.
How was the reference panel built and refined?
UK Biobank samples from the defined regions that had suitable self-reported ethnicity data were extracted. As different regions of the world are not evenly represented in UK Biobank, over-represented regions were randomly down-sampled to avoid effects of uneven sampling on the PCA projection (McVean 2009 2) (Table 1).
Table 1 - Countries and regions under each geographical label, including sample sizes of the regions
N_AFRICA (North Africa)
Algeria, Egypt, Libya, Morocco, Tunisia
199
172
E_AFRICA (East Africa)
Eritrea, Ethiopia, Sudan, Somalia
103
90
W_AFRICA (West Africa)
Ghana, Liberia, Sierra Leone, Nigeria, Gambia, Guinea, Togo, Senegal, Côte d’Ivoire
250
238
C_S_AFRICA (Central and Southern Africa)
Angola, Congo, Kenya, Cameroon, Zambia, South Africa, Uganda, Zimbabwe, United Republic of Tanzania, Central African Republic, Burundi, Rwanda
250
211
CE_ASIA (Central East Asia)
China, Hong Kong Special Administrative Region, Macao Special Administrative Region, Taiwan
200
158
JAPAN_KOREA
Japan, Republic of Korea, Democratic People’s Republic of Korea
140
133
SE_ASIA (Southeast Asia)
Viet Nam, Thailand, Myanmar, Malaysia, Lao People’s Democratic Republic, Indonesia, Cambodia, Philippines, Singapore
200
173
N_C_S_AMERICA (North, Central and South America)
Belize, Bolivia (Plurinational State of), Chile, Colombia, Ecuador, Mexico, Peru
60
50
INDIA_PAKISTAN
India, Pakistan
200
169
BANGLADESH
Bangladesh
116
109
SRI_LANKA
Sri Lanka
154
146
C_ASIA (Central Asia)
Afghanistan, Uzbekistan, Kazakhstan, Kyrgyzstan, Tajikistan
58
49
M_EAST_W_ASIA (Middle East and West Asia)
Jordan, Lebanon, Syrian Arab Republic, Bahrain, Kuwait, Oman, Saudi Arabia, Yemen, Türkiye, Cyprus, Armenia, Azerbaijan, Georgia, Iran (Islamic Republic of), Iraq, Israel
200
178
NW_WALES (North-West Wales)
Isle of Anglesey, Gwynedd, Conwy
487
450
SW_WALES (South-West Wales)
Ceredigion, Pembrokeshire, Carmarthenshire
657
544
N_ENG_S_SCOT (Northern England and Southern Scotland)
Dumfries and Galloway, Scottish Borders, Cumbria, Northumberland, County Durham, Sunderland, Gateshead, South Tyneside, Newcastle upon Tyne, North Tyneside
800
676
NI_N_SCOT (Northern Ireland and Scotland)
Northern Ireland, Scotland less regions in N_ENG_S_SCOT
800
703
C_S_UK (Central and South UK)
UK regions not in NW_WALES, SW_WALES, N_ENG_S_SCOT, and NI_N_SCOT
800
747
FINLAND
Finland
87
81
IRELAND
Ireland
800
727
PORTUGAL_SPAIN
Spain, Portugal
324
297
SE_EUROPE (South-Eastern Europe)
Italy, Romania, Greece, Bulgaria, Albania, Bosnia and Herzegovina, Croatia
594
522
N_EUROPE
Norway, Sweden, Denmark, Iceland
310
295
E_EUROPE
Poland, Lithuania, Ukraine, Czechia, Slovakia, Russian Federation, Latvia, Hungary
663
619
CW_EUROPE (Central-Western Europe)
France, Netherlands (Kingdom of the), Belgium, Germany, Switzerland, Austria
800
730
Following QC and processing of the UK Biobank WGS genotype data, the following steps were applied:
Genotype data for samples in the initial panel were extracted and converted to an N-by-S genotype matrix where N is the number of samples, and S in the number of SNPs/variants. Each cell of the matrix took the value of 0, 1 or 2 for the number of copies of the ALT allele at the samples genotype.
Variants were LD pruned in PLINK with a window size of 1000 variants, a step size of 80 variants and an r2 threshold of 0.1 (Bycroft et al, 2018 3)
Variants with a MAF less than 0.005 were removed
PCA was performed and samples were projected onto the top 40 PCs
The centroid of the PC projections of the samples was computed for each region. The median of the projections on a PC was taken as the position of the centroid along that PC.
For each sample, the distance to the centroid of its region of origin was calculated
For each region, the mean and standard deviation of the distance to the regional centroid was calculated using all samples from that region. Samples with a distance greater than the mean + SD were excluded.
This resulted in a refined reference panel with outliers removed. The refined panel was used to perform PCA and obtain final PCs and the projection of samples.
How were informative PCs selected?
PCs were selected based on their ability to distinguish between the regions represented on the reference panel. For each PC axis, the Kruskal-Wallis test was used to compare the distribution of projections among the regions in the panel. Informative PCs were defined as those with a Kruskal-Wallis p-value < 1x10-20, and if at least one region’s interquartile range of the projections on the PC did not include 0. The mean within-region SD was calculated using the projection of the samples along each of the informative PCs separately and used to compute downstream likelihood ratio statistics.
How were ancestry proportions estimated?
Ancestry proportions were estimated for new samples by applying non-negative least squares (NNLS) strategies to informative PC projections. These methods included:
NNLS-admixture: gives an estimate of fractional ancestry proportions that are constrained to be non-negative and sum up to 1
NNLS-hard-calling: assigns a sample entirely to the closest regional centroid
How were the NNLS methods validated?
Samples in the refined reference panel were split into a training set (80%) and a testing set (20%), where the training set was used to build a new panel according to the steps described previously. Ancestry proportions were estimated for samples in the testing set. Precision and recall were computed, where the region of birth of the testing samples was treated as truth. In doing so, precision was defined as the proportion of samples called to a region or ancestry group that are correctly called, and recall as the proportion of samples that belong to a region or ancestry group that are correctly called.
Precision and recall were computed for NNLS-admixture by assigning a sample to a region if the estimated contribution of the region was greater than 50%. This resulted in a high level of accuracy for both NNLS-hard-calling and NNLS-admixture methods at the continental level including the UK (Table 2). However, accuracy was found to be lower when looking across the 25 sub-continental regions of the refined panel, for example when both NNLS methods were applied to the central and south UK region (C_S_UK) (Table 3). This could be explained by the genetic makeup of the region being explained in part by other regions in the panel.
Table 2 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the continental level plus the UK.
AFR
0.96
0.99
0.96
0.99
1.00
N_C_S_AMERICA
1.00
1.00
1.00
1.00
1.00
ASIA
0.98
0.99
0.96
0.99
0.98
EUR (non‑UK)
0.95
0.93
0.96
0.86
1.00
UK
0.94
0.95
0.83
0.95
0.99
Table 3 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the sub-continental level.
BANGLADESH
1.00
0.85
0.91
1.00
0.95
CE_ASIA
1.00
0.86
1.00
0.82
1.00
CW_EUROPE
0.66
0.84
0.14
0.88
0.45
C_ASIA
0.80
0.62
0.70
0.70
0.90
C_S_AFRICA
0.98
1.00
0.98
1.00
1.00
C_S_UK
0.65
0.49
0.23
0.55
0.64
E_AFRICA
1.00
1.00
0.94
1.00
0.94
E_EUROPE
0.94
0.98
0.94
0.98
0.96
FINLAND
0.94
1.00
0.94
1.00
0.94
INDIA_PAKISTAN
0.68
0.96
0.50
0.94
0.79
IRELAND
0.88
0.82
0.88
0.78
0.94
JAPAN_KOREA
1.00
1.00
1.00
1.00
1.00
M_EAST_W_ASIA
0.89
0.94
0.81
0.94
0.89
NI_N_SCOT
0.60
0.70
0.40
0.71
0.72
NW_WALES
0.64
0.98
0.63
0.98
0.76
N_AFRICA
0.83
0.97
0.80
0.97
0.97
N_C_S_AMERICA
1.00
1.00
1.00
1.00
1.00
N_ENG_S_SCOT
0.55
0.55
0.52
0.52
0.76
N_EUROPE
0.93
0.69
0.95
0.63
0.98
PORTUGAL_SPAIN
0.98
0.83
0.98
0.92
0.98
SE_ASIA
0.83
1.00
0.74
1.00
0.94
SE_EUROPE
0.99
0.93
0.95
0.92
0.95
SRI_LANKA
1.00
0.91
1.00
0.79
1.00
SW_WALES
0.81
0.87
0.82
0.88
0.86
W_AFRICA
1.00
0.98
1.00
0.98
1.00
Validation with the Coriell Our Future Health array dataset
A subset of the Coriell 1,000 Genomes samples (1,998) were genotyped on the Our Future Health array, then phased and imputed using Genomics Ltd's bespoke pipeline. The places of birth of the samples were mapped to the regions of the refined panel, and their ancestry proportions were estimated using both NNLS approaches.
An initial check was performed by comparing the African Ancestry in Southwest US (ASW) samples with respect to their African ancestry that was inferred with the NNLS-admixture approach and the Genomics Ltd's chromosome painting pipeline using FLARE 4. A high degree of concordance was found between the two approaches suggesting that the NNLS-admixture method is well calibrated to appropriately handle admixture (Figure 2). Additional Coriell samples were analysed using both NNLS methods where the majority of regions showed good performance.

Further validation with The People of the British Isles (PoBI) samples
The People of the British Isles (PoBI) 5 collection from the Coriell dataset were genotyped on the Illumina Human 1.2M-Duo genotyping chip. Genotype data were phased and imputed using the Genomics Ltd bespoke pipeline. After applying QC filter a total of 1,935 samples were retained. The regions of birth of the samples were mapped to the regions of the refined reference panel as closely as possible. NNLS-admixture and NNLS-hard-calling were applied to the PoBI samples where high recall rates were observed for both approaches. Due to the method of sample collection, it is unlikely for admixture to be a major contributor to the data patterns. This is reflected in the NNLS-hard-calling recall rates (Table 4) being slightly higher than those of the NNLS-admixture approach (Table 5).
Table 4 - The average NNLS-hard-calling estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.
BANGLADESH
0.00
0.00
0.00
0.00
0.00
CE_ASIA
0.00
0.00
0.00
0.00
0.00
CW_EUROPE
0.00
0.00
0.00
0.00
0.03
C_ASIA
0.00
0.00
0.00
0.00
0.00
C_S_AFRICA
0.00
0.00
0.00
0.00
0.00
C_S_UK
0.03
0.00
0.18
0.00
0.80
E_AFRICA
0.00
0.00
0.00
0.00
0.00
E_EUROPE
0.00
0.00
0.00
0.00
0.00
FINLAND
0.00
0.00
0.00
0.00
0.00
INDIA_PAKISTAN
0.00
0.00
0.00
0.00
0.00
IRELAND
0.05
0.00
0.01
0.00
0.00
JAPAN_KOREA
0.00
0.00
0.00
0.00
0.00
M_EAST_W_ASIA
0.00
0.00
0.00
0.00
0.00
NI_N_SCOT
0.61
0.00
0.11
0.00
0.01
NW_WALES
0.00
0.99
0.00
0.00
0.00
N_AFRICA
0.00
0.00
0.00
0.00
0.00
N_C_S_AMERICA
0.00
0.00
0.00
0.00
0.00
N_ENG_S_SCOT
0.31
0.00
0.69
0.00
0.15
N_EUROPE
0.00
0.00
0.00
0.00
0.01
PORTUGAL_SPAIN
0.00
0.00
0.00
0.00
0.00
SE_ASIA
0.00
0.00
0.00
0.00
0.00
SE_EUROPE
0.00
0.00
0.00
0.00
0.00
SRI_LANKA
0.00
0.00
0.00
0.00
0.00
SW_WALES
0.00
0.01
0.00
1.00
0.01
W_AFRICA
0.00
0.00
0.00
0.00
0.00
Table 5 - The average NNLS-admixture estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.
BANGLADESH
0.00
0.00
0.00
0.00
0.00
CE_ASIA
0.00
0.00
0.00
0.00
0.00
CW_EUROPE
0.00
0.00
0.02
0.00
0.09
C_ASIA
0.00
0.00
0.00
0.00
0.00
C_S_AFRICA
0.00
0.00
0.00
0.00
0.00
C_S_UK
0.03
0.00
0.08
0.00
0.30
E_AFRICA
0.00
0.00
0.00
0.00
0.00
E_EUROPE
0.01
0.00
0.00
0.00
0.00
FINLAND
0.01
0.00
0.01
0.00
0.00
INDIA_PAKISTAN
0.00
0.00
0.00
0.00
0.00
IRELAND
0.16
0.00
0.05
0.01
0.02
JAPAN_KOREA
0.00
0.00
0.00
0.00
0.00
M_EAST_W_ASIA
0.00
0.00
0.00
0.00
0.00
NI_N_SCOT
0.47
0.00
0.15
0.00
0.01
NW_WALES
0.03
0.99
0.02
0.01
0.03
N_AFRICA
0.00
0.00
0.00
0.00
0.00
N_C_S_AMERICA
0.00
0.00
0.00
0.00
0.00
N_ENG_S_SCOT
0.17
0.00
0.53
0.01
0.26
N_EUROPE
0.08
0.00
0.09
0.01
0.20
PORTUGAL_SPAIN
0.01
0.00
0.01
0.00
0.03
SE_ASIA
0.00
0.00
0.00
0.00
0.00
SE_EUROPE
0.00
0.00
0.01
0.00
0.01
SRI_LANKA
0.00
0.00
0.00
0.00
0.00
SW_WALES
0.02
0.01
0.02
0.96
0.04
W_AFRICA
0.00
0.00
0.00
0.00
0.00
How was genetic ancestry inferred with Our Future Health samples?
Genetic ancestry was inferred for Our Future Health participants using a refined reference panel based on UK Biobank as described above. Genotype data for Our Future Health samples were generated with the custom OurFutureHealthv1 beadchip array assay and imputed with phased UK Biobank 200k WGS data as previously described. Our Future Health data were securely transferred to a private project space shared with Genomics Ltd in groups of 2,000 to 8,000 samples. Groups consisted of multiple batches that has been genotyped on the same date. Batches from more than one genotype date were combined if a group was less than 2,000 samples. Our Future Health data was pseudo-anonymised by replacing the sample IDs provided by the genotyping laboratory with a participant ID (PID) specific for Genomics Ltd. Original genotyping batch IDs were not made available.
Ancestry estimation was performed for samples in the same grouping that they were also assigned for imputation of their genotype data. In doing so, both the NNLS-admixture and NNLS-hard-calling approaches were applied. For each sample, an NNLS-admixture proportion was returned across all the 25 sub-continental regions. In addition, the NNLS-hard-calling approach was used to assign a single ancestry label to each sample out of the 25 sub-continental regions.
References
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).
McVean, G. A Genealogical Interpretation of Principal Components Analysis. PLOS Genet. 5, e1000686 (2009).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Browning, S. R., Waples, R. K. & Browning, B. L. Fast, accurate local ancestry inference with FLARE. Am. J. Hum. Genet. 110, 326–335 (2023).
Leslie, S. et al. The fine-scale genetic structure of the British population. Nature 519, 309–314 (2015).
Last updated
