Genetic ancestry

Information about the ancestry data in the Our Future Health Resource. This documentation includes the scope and structure of the data, and the process of ancestry estimation.

An overview of the ancestry estimation process

Global ancestry estimation (GAE) was applied to infer the mixture of populations from which an individual’s genome was inherited. This utilises genome-wide proportions of the genome that are genetically similar to populations or regions represented on a reference panel. A GAE workflow was developed by Genomics Ltd based on principal component analysis (PCA) and non-negative least squares (NNLS).

What initial investigations were performed?

An initial PCA analysis was performed with UK Biobank whole genome sequencing (WGS) data to assess how well variants from the Our Future Health custom array (OurFutureHealthv1) can detect ancestry differences, particularly at the sub-continental level. PCA was performed with a randomly selected set of European participants from UK Biobank (UK Biobank Field 23374 (external link)) where WGS data was filtered to retain only the Our Future Health array variants. The resulting PCA showed that it was essential to expand the Our Future Health array by imputation to identify genetic variation at the sub-continental level (Figure 1). HapMap3 variants and the UK Biobank variants used for PCA (UKB-PCA variants) were included as these were likely to be well imputed.

Figure 1 - PCA analysis performed on randomly selected European UK Biobank samples. F, Finalnd; E/N/W/S, East/North/South/West Europe; OFH, Our Future Health.

How was a training dataset created to assess performance of the ancestry estimation method?

Training data was prepared using GraphTyper population level WGS variants from UK Biobank (UKB Field 23374). Variants were filtered for those on the Our Future Health array C2 manifest in addition to HapMap3 variants and UK Biobank variants used in the initial PCA analysis above (UKB-PCA variants). Samples were filtered to exclude those in the initial 200k WGS release to avoid overlap with the imputation reference panel.

Further QC and variant processing measures were applied:

  • Retained high quality variants based on a GraphTyper AAscore > 0.5 and a “PASS” FILTER column (Halldorsson et al, 2022 1)

  • Multi-allelic variants were split into biallelic

  • Excluded variants with a MAF < 1%

  • Excluded variants and samples with a high level of missingness. This was only applied to variants as no samples were found to have missingness > 2.5%

  • Retained only unrelated samples

  • Obtained missing genotypes using Beagle with default parameters and no reference

  • Excluded high-LD regions defined by Genomics Ltd.

  • Retained variants that were well imputed (dosage R2 greater or equal to 0.85). Imputation was performed on a set of test samples of diverse ancestry backgrounds according to the Genomics Ltd workflow.

How were regions of the reference panel defined?

A reference panel was derived from UK Biobank WGS data. The following sources were considered when defining regions and the samples they represent:

  • The geographic location of the country and its sample size

  • PC plots to determine if samples in a certain region were genetically distinct

  • The availability of self-reported ethnicity data (defined in UK Biobank fields 1647 and 20115). For example, only samples with “Irish” self-reported ethnicity were allowed to form the IRELAND region in the reference panel.

  • Published results in the literature

As the number of Indigenous/Native American samples is limited in UK Biobank, there is only a single Indigenous/Native American region, labelled N_C_S_AMERICA for “North, Central and South America”. These samples were determined via an initial continental level ancestry inference step, based on 4 principal components trained using 1000 Genomes data. Additional regions include 5 within-UK region, 7 European regions, 4 African regions and 8 Asian regions, giving 25 sub-continental regions in total.

How was the reference panel built and refined?

UK Biobank samples from the defined regions that had suitable self-reported ethnicity data were extracted. As different regions of the world are not evenly represented in UK Biobank, over-represented regions were randomly down-sampled to avoid effects of uneven sampling on the PCA projection (McVean 2009 2) (Table 1).

Table 1 - Countries and regions under each geographical label, including sample sizes of the regions

Label / Region
Countries or Areas
Size (pre‑filtering)
Size (post‑filtering)

N_AFRICA (North Africa)

Algeria, Egypt, Libya, Morocco, Tunisia

199

172

E_AFRICA (East Africa)

Eritrea, Ethiopia, Sudan, Somalia

103

90

W_AFRICA (West Africa)

Ghana, Liberia, Sierra Leone, Nigeria, Gambia, Guinea, Togo, Senegal, Côte d’Ivoire

250

238

C_S_AFRICA (Central and Southern Africa)

Angola, Congo, Kenya, Cameroon, Zambia, South Africa, Uganda, Zimbabwe, United Republic of Tanzania, Central African Republic, Burundi, Rwanda

250

211

CE_ASIA (Central East Asia)

China, Hong Kong Special Administrative Region, Macao Special Administrative Region, Taiwan

200

158

JAPAN_KOREA

Japan, Republic of Korea, Democratic People’s Republic of Korea

140

133

SE_ASIA (Southeast Asia)

Viet Nam, Thailand, Myanmar, Malaysia, Lao People’s Democratic Republic, Indonesia, Cambodia, Philippines, Singapore

200

173

N_C_S_AMERICA (North, Central and South America)

Belize, Bolivia (Plurinational State of), Chile, Colombia, Ecuador, Mexico, Peru

60

50

INDIA_PAKISTAN

India, Pakistan

200

169

BANGLADESH

Bangladesh

116

109

SRI_LANKA

Sri Lanka

154

146

C_ASIA (Central Asia)

Afghanistan, Uzbekistan, Kazakhstan, Kyrgyzstan, Tajikistan

58

49

M_EAST_W_ASIA (Middle East and West Asia)

Jordan, Lebanon, Syrian Arab Republic, Bahrain, Kuwait, Oman, Saudi Arabia, Yemen, Türkiye, Cyprus, Armenia, Azerbaijan, Georgia, Iran (Islamic Republic of), Iraq, Israel

200

178

NW_WALES (North-West Wales)

Isle of Anglesey, Gwynedd, Conwy

487

450

SW_WALES (South-West Wales)

Ceredigion, Pembrokeshire, Carmarthenshire

657

544

N_ENG_S_SCOT (Northern England and Southern Scotland)

Dumfries and Galloway, Scottish Borders, Cumbria, Northumberland, County Durham, Sunderland, Gateshead, South Tyneside, Newcastle upon Tyne, North Tyneside

800

676

NI_N_SCOT (Northern Ireland and Scotland)

Northern Ireland, Scotland less regions in N_ENG_S_SCOT

800

703

C_S_UK (Central and South UK)

UK regions not in NW_WALES, SW_WALES, N_ENG_S_SCOT, and NI_N_SCOT

800

747

FINLAND

Finland

87

81

IRELAND

Ireland

800

727

PORTUGAL_SPAIN

Spain, Portugal

324

297

SE_EUROPE (South-Eastern Europe)

Italy, Romania, Greece, Bulgaria, Albania, Bosnia and Herzegovina, Croatia

594

522

N_EUROPE

Norway, Sweden, Denmark, Iceland

310

295

E_EUROPE

Poland, Lithuania, Ukraine, Czechia, Slovakia, Russian Federation, Latvia, Hungary

663

619

CW_EUROPE (Central-Western Europe)

France, Netherlands (Kingdom of the), Belgium, Germany, Switzerland, Austria

800

730

Following QC and processing of the UK Biobank WGS genotype data, the following steps were applied:

  1. Genotype data for samples in the initial panel were extracted and converted to an N-by-S genotype matrix where N is the number of samples, and S in the number of SNPs/variants. Each cell of the matrix took the value of 0, 1 or 2 for the number of copies of the ALT allele at the samples genotype.

  2. Variants were LD pruned in PLINK with a window size of 1000 variants, a step size of 80 variants and an r2 threshold of 0.1 (Bycroft et al, 2018 3)

  3. Variants with a MAF less than 0.005 were removed

  4. PCA was performed and samples were projected onto the top 40 PCs

  5. The centroid of the PC projections of the samples was computed for each region. The median of the projections on a PC was taken as the position of the centroid along that PC.

  6. For each sample, the distance to the centroid of its region of origin was calculated

  7. For each region, the mean and standard deviation of the distance to the regional centroid was calculated using all samples from that region. Samples with a distance greater than the mean + SD were excluded.

This resulted in a refined reference panel with outliers removed. The refined panel was used to perform PCA and obtain final PCs and the projection of samples.

How were informative PCs selected?

PCs were selected based on their ability to distinguish between the regions represented on the reference panel. For each PC axis, the Kruskal-Wallis test was used to compare the distribution of projections among the regions in the panel. Informative PCs were defined as those with a Kruskal-Wallis p-value < 1x10-20, and if at least one region’s interquartile range of the projections on the PC did not include 0. The mean within-region SD was calculated using the projection of the samples along each of the informative PCs separately and used to compute downstream likelihood ratio statistics.

How were ancestry proportions estimated?

Ancestry proportions were estimated for new samples by applying non-negative least squares (NNLS) strategies to informative PC projections. These methods included:

  1. NNLS-admixture: gives an estimate of fractional ancestry proportions that are constrained to be non-negative and sum up to 1

  2. NNLS-hard-calling: assigns a sample entirely to the closest regional centroid

How were the NNLS methods validated?

Samples in the refined reference panel were split into a training set (80%) and a testing set (20%), where the training set was used to build a new panel according to the steps described previously. Ancestry proportions were estimated for samples in the testing set. Precision and recall were computed, where the region of birth of the testing samples was treated as truth. In doing so, precision was defined as the proportion of samples called to a region or ancestry group that are correctly called, and recall as the proportion of samples that belong to a region or ancestry group that are correctly called.

Precision and recall were computed for NNLS-admixture by assigning a sample to a region if the estimated contribution of the region was greater than 50%. This resulted in a high level of accuracy for both NNLS-hard-calling and NNLS-admixture methods at the continental level including the UK (Table 2). However, accuracy was found to be lower when looking across the 25 sub-continental regions of the refined panel, for example when both NNLS methods were applied to the central and south UK region (C_S_UK) (Table 3). This could be explained by the genetic makeup of the region being explained in part by other regions in the panel.

Table 2 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the continental level plus the UK.

Region
NNLS‑ hard‑calling Recall
NNLS‑ hard‑calling Precision
NNLS‑ admixture Recall
NNLS‑ admixture Precision
Fraction called

AFR

0.96

0.99

0.96

0.99

1.00

N_C_S_AMERICA

1.00

1.00

1.00

1.00

1.00

ASIA

0.98

0.99

0.96

0.99

0.98

EUR (non‑UK)

0.95

0.93

0.96

0.86

1.00

UK

0.94

0.95

0.83

0.95

0.99

Table 3 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the sub-continental level.

Region
Hard‑call Recall
Hard‑call Precision
Admixture Recall
Admixture Precision
Fraction called

BANGLADESH

1.00

0.85

0.91

1.00

0.95

CE_ASIA

1.00

0.86

1.00

0.82

1.00

CW_EUROPE

0.66

0.84

0.14

0.88

0.45

C_ASIA

0.80

0.62

0.70

0.70

0.90

C_S_AFRICA

0.98

1.00

0.98

1.00

1.00

C_S_UK

0.65

0.49

0.23

0.55

0.64

E_AFRICA

1.00

1.00

0.94

1.00

0.94

E_EUROPE

0.94

0.98

0.94

0.98

0.96

FINLAND

0.94

1.00

0.94

1.00

0.94

INDIA_PAKISTAN

0.68

0.96

0.50

0.94

0.79

IRELAND

0.88

0.82

0.88

0.78

0.94

JAPAN_KOREA

1.00

1.00

1.00

1.00

1.00

M_EAST_W_ASIA

0.89

0.94

0.81

0.94

0.89

NI_N_SCOT

0.60

0.70

0.40

0.71

0.72

NW_WALES

0.64

0.98

0.63

0.98

0.76

N_AFRICA

0.83

0.97

0.80

0.97

0.97

N_C_S_AMERICA

1.00

1.00

1.00

1.00

1.00

N_ENG_S_SCOT

0.55

0.55

0.52

0.52

0.76

N_EUROPE

0.93

0.69

0.95

0.63

0.98

PORTUGAL_SPAIN

0.98

0.83

0.98

0.92

0.98

SE_ASIA

0.83

1.00

0.74

1.00

0.94

SE_EUROPE

0.99

0.93

0.95

0.92

0.95

SRI_LANKA

1.00

0.91

1.00

0.79

1.00

SW_WALES

0.81

0.87

0.82

0.88

0.86

W_AFRICA

1.00

0.98

1.00

0.98

1.00

Validation with the Coriell Our Future Health array dataset

A subset of the Coriell 1,000 Genomes samples (1,998) were genotyped on the Our Future Health array, then phased and imputed using Genomics Ltd's bespoke pipeline. The places of birth of the samples were mapped to the regions of the refined panel, and their ancestry proportions were estimated using both NNLS approaches.

An initial check was performed by comparing the African Ancestry in Southwest US (ASW) samples with respect to their African ancestry that was inferred with the NNLS-admixture approach and the Genomics Ltd's chromosome painting pipeline using FLARE 4. A high degree of concordance was found between the two approaches suggesting that the NNLS-admixture method is well calibrated to appropriately handle admixture (Figure 2). Additional Coriell samples were analysed using both NNLS methods where the majority of regions showed good performance.

Figure 2 - Comparison of African ancestry inferred for ASW Coriell samples by chromosome painting (FLARE) and NNLS-admixture

Further validation with The People of the British Isles (PoBI) samples

The People of the British Isles (PoBI) 5 collection from the Coriell dataset were genotyped on the Illumina Human 1.2M-Duo genotyping chip. Genotype data were phased and imputed using the Genomics Ltd bespoke pipeline. After applying QC filter a total of 1,935 samples were retained. The regions of birth of the samples were mapped to the regions of the refined reference panel as closely as possible. NNLS-admixture and NNLS-hard-calling were applied to the PoBI samples where high recall rates were observed for both approaches. Due to the method of sample collection, it is unlikely for admixture to be a major contributor to the data patterns. This is reflected in the NNLS-hard-calling recall rates (Table 4) being slightly higher than those of the NNLS-admixture approach (Table 5).

Table 4 - The average NNLS-hard-calling estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.

Region
NI_N_SCOT
NW_WALES (some C_S_UK)
N_ENG_S_SCOT
SW_WALES (some C_S_UK)
C_S_UK

BANGLADESH

0.00

0.00

0.00

0.00

0.00

CE_ASIA

0.00

0.00

0.00

0.00

0.00

CW_EUROPE

0.00

0.00

0.00

0.00

0.03

C_ASIA

0.00

0.00

0.00

0.00

0.00

C_S_AFRICA

0.00

0.00

0.00

0.00

0.00

C_S_UK

0.03

0.00

0.18

0.00

0.80

E_AFRICA

0.00

0.00

0.00

0.00

0.00

E_EUROPE

0.00

0.00

0.00

0.00

0.00

FINLAND

0.00

0.00

0.00

0.00

0.00

INDIA_PAKISTAN

0.00

0.00

0.00

0.00

0.00

IRELAND

0.05

0.00

0.01

0.00

0.00

JAPAN_KOREA

0.00

0.00

0.00

0.00

0.00

M_EAST_W_ASIA

0.00

0.00

0.00

0.00

0.00

NI_N_SCOT

0.61

0.00

0.11

0.00

0.01

NW_WALES

0.00

0.99

0.00

0.00

0.00

N_AFRICA

0.00

0.00

0.00

0.00

0.00

N_C_S_AMERICA

0.00

0.00

0.00

0.00

0.00

N_ENG_S_SCOT

0.31

0.00

0.69

0.00

0.15

N_EUROPE

0.00

0.00

0.00

0.00

0.01

PORTUGAL_SPAIN

0.00

0.00

0.00

0.00

0.00

SE_ASIA

0.00

0.00

0.00

0.00

0.00

SE_EUROPE

0.00

0.00

0.00

0.00

0.00

SRI_LANKA

0.00

0.00

0.00

0.00

0.00

SW_WALES

0.00

0.01

0.00

1.00

0.01

W_AFRICA

0.00

0.00

0.00

0.00

0.00

Table 5 - The average NNLS-admixture estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.

Region
NI_N_SCOT
NW_WALES (some UK_OTHER)
N_ENG_S_SCOT
SW_WALES (some UK_OTHER)
UK_OTHER

BANGLADESH

0.00

0.00

0.00

0.00

0.00

CE_ASIA

0.00

0.00

0.00

0.00

0.00

CW_EUROPE

0.00

0.00

0.02

0.00

0.09

C_ASIA

0.00

0.00

0.00

0.00

0.00

C_S_AFRICA

0.00

0.00

0.00

0.00

0.00

C_S_UK

0.03

0.00

0.08

0.00

0.30

E_AFRICA

0.00

0.00

0.00

0.00

0.00

E_EUROPE

0.01

0.00

0.00

0.00

0.00

FINLAND

0.01

0.00

0.01

0.00

0.00

INDIA_PAKISTAN

0.00

0.00

0.00

0.00

0.00

IRELAND

0.16

0.00

0.05

0.01

0.02

JAPAN_KOREA

0.00

0.00

0.00

0.00

0.00

M_EAST_W_ASIA

0.00

0.00

0.00

0.00

0.00

NI_N_SCOT

0.47

0.00

0.15

0.00

0.01

NW_WALES

0.03

0.99

0.02

0.01

0.03

N_AFRICA

0.00

0.00

0.00

0.00

0.00

N_C_S_AMERICA

0.00

0.00

0.00

0.00

0.00

N_ENG_S_SCOT

0.17

0.00

0.53

0.01

0.26

N_EUROPE

0.08

0.00

0.09

0.01

0.20

PORTUGAL_SPAIN

0.01

0.00

0.01

0.00

0.03

SE_ASIA

0.00

0.00

0.00

0.00

0.00

SE_EUROPE

0.00

0.00

0.01

0.00

0.01

SRI_LANKA

0.00

0.00

0.00

0.00

0.00

SW_WALES

0.02

0.01

0.02

0.96

0.04

W_AFRICA

0.00

0.00

0.00

0.00

0.00

How was genetic ancestry inferred with Our Future Health samples?

Genetic ancestry was inferred for Our Future Health participants using a refined reference panel based on UK Biobank as described above. Genotype data for Our Future Health samples were generated with the custom OurFutureHealthv1 beadchip array assay and imputed with phased UK Biobank 200k WGS data as previously described. Our Future Health data were securely transferred to a private project space shared with Genomics Ltd in groups of 2,000 to 8,000 samples. Groups consisted of multiple batches that has been genotyped on the same date. Batches from more than one genotype date were combined if a group was less than 2,000 samples. Our Future Health data was pseudo-anonymised by replacing the sample IDs provided by the genotyping laboratory with a participant ID (PID) specific for Genomics Ltd. Original genotyping batch IDs were not made available.

Ancestry estimation was performed for samples in the same grouping that they were also assigned for imputation of their genotype data. In doing so, both the NNLS-admixture and NNLS-hard-calling approaches were applied. For each sample, an NNLS-admixture proportion was returned across all the 25 sub-continental regions. In addition, the NNLS-hard-calling approach was used to assign a single ancestry label to each sample out of the 25 sub-continental regions.


References

  1. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).

  2. McVean, G. A Genealogical Interpretation of Principal Components Analysis. PLOS Genet. 5, e1000686 (2009).

  3. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

  4. Browning, S. R., Waples, R. K. & Browning, B. L. Fast, accurate local ancestry inference with FLARE. Am. J. Hum. Genet. 110, 326–335 (2023).

  5. Leslie, S. et al. The fine-scale genetic structure of the British population. Nature 519, 309–314 (2015).

Last updated