> For the complete documentation index, see [llms.txt](https://ourfuturehealth.gitbook.io/our-future-health/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ourfuturehealth.gitbook.io/our-future-health/data-types/genetic-data/genetic-ancestry.md).

# Genetic ancestry

### **An overview of the ancestry estimation process**

Global ancestry estimation (GAE) was applied to infer the mixture of populations from which an individual’s genome was inherited. This utilises genome-wide proportions of the genome that are genetically similar to populations or regions represented on a reference panel. A GAE workflow was developed by Genomics Ltd based on principal component analysis (PCA) and non-negative least squares (NNLS).

#### **What initial investigations were performed?**

An initial PCA analysis was performed with UK Biobank whole genome sequencing (WGS) data to assess how well variants from the Our Future Health custom array (OurFutureHealthv1) can detect ancestry differences, particularly at the sub-continental level. PCA was performed with a randomly selected set of European participants from UK Biobank ([UK Biobank Field 23374 (external link)](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23374)) where WGS data was filtered to retain only the Our Future Health array variants. The resulting PCA showed that it was essential to expand the Our Future Health array by imputation to identify genetic variation at the sub-continental level (Figure 1). HapMap3 variants and the UK Biobank variants used for PCA (UKB-PCA variants) were included as these were likely to be well imputed.

<figure><img src="/files/cSXVlb61B2sxfgDQqus9" alt=""><figcaption><p>Figure 1 - PCA analysis performed on randomly selected European UK Biobank samples. F, Finalnd; E/N/W/S, East/North/South/West Europe; OFH, Our Future Health.</p></figcaption></figure>

#### **How was a training dataset created to assess performance of the ancestry estimation method?**

Training data was prepared using GraphTyper population level WGS variants from UK Biobank (UKB Field 23374). Variants were filtered for those on the Our Future Health array C2 manifest in addition to HapMap3 variants and UK Biobank variants used in the initial PCA analysis above (UKB-PCA variants). Samples were filtered to exclude those in the initial 200k WGS release to avoid overlap with the imputation reference panel.

Further QC and variant processing measures were applied:

* Retained high quality variants based on a GraphTyper AAscore > 0.5 and a “PASS” FILTER column (Halldorsson et al, 2022 <sup>1</sup>)
* Multi-allelic variants were split into biallelic
* Excluded variants with a MAF < 1%
* Excluded variants and samples with a high level of missingness. This was only applied to variants as no samples were found to have missingness > 2.5%
* Retained only unrelated samples
* Obtained missing genotypes using Beagle with default parameters and no reference
* Excluded high-LD regions defined by Genomics Ltd.
* Retained variants that were well imputed (dosage R<sup>2</sup> greater or equal to 0.85). Imputation was performed on a set of test samples of diverse ancestry backgrounds according to the [Genomics Ltd workflow](/our-future-health/data-types/genetic-data/imputed-genotype-data.md).&#x20;

#### **How were regions of the reference panel defined?**

A reference panel was derived from UK Biobank WGS data. The following sources were considered when defining regions and the samples they represent:

* The geographic location of the country and its sample size
* PC plots to determine if samples in a certain region were genetically distinct
* The availability of self-reported ethnicity data (defined in UK Biobank fields 1647 and 20115). For example, only samples with “Irish” self-reported ethnicity were allowed to form the IRELAND region in the reference panel.
* Published results in the literature

As the number of Indigenous/Native American samples is limited in UK Biobank, there is only a single Indigenous/Native American region, labelled N\_C\_S\_AMERICA for “North, Central and South America”. These samples were determined via an initial continental level ancestry inference step, based on 4 principal components trained using 1000 Genomes data. Additional regions include 5 within-UK region, 7 European regions, 4 African regions and 8 Asian regions, giving 25 sub-continental regions in total.

#### How was the reference panel built and refined?

UK Biobank samples from the defined regions that had suitable self-reported ethnicity data were extracted. As different regions of the world are not evenly represented in UK Biobank, over-represented regions were randomly down-sampled to avoid effects of uneven sampling on the PCA projection (McVean 2009 <sup>2</sup>) (Table 1).

**Table 1 - Countries and regions under each geographical label, including sample sizes of the regions**

<table data-first-column-sticky><thead><tr><th>Label / Region</th><th width="292">Countries or Areas</th><th width="136">Size (pre‑filtering)</th><th width="147">Size (post‑filtering)</th></tr></thead><tbody><tr><td>N_AFRICA (North Africa)</td><td>Algeria, Egypt, Libya, Morocco, Tunisia</td><td>199</td><td>172</td></tr><tr><td>E_AFRICA (East Africa)</td><td>Eritrea, Ethiopia, Sudan, Somalia</td><td>103</td><td>90</td></tr><tr><td>W_AFRICA (West Africa)</td><td>Ghana, Liberia, Sierra Leone, Nigeria, Gambia, Guinea, Togo, Senegal, Côte d’Ivoire</td><td>250</td><td>238</td></tr><tr><td>C_S_AFRICA (Central and Southern Africa)</td><td>Angola, Congo, Kenya, Cameroon, Zambia, South Africa, Uganda, Zimbabwe, United Republic of Tanzania, Central African Republic, Burundi, Rwanda</td><td>250</td><td>211</td></tr><tr><td>CE_ASIA (Central East Asia)</td><td>China, Hong Kong Special Administrative Region, Macao Special Administrative Region, Taiwan</td><td>200</td><td>158</td></tr><tr><td>JAPAN_KOREA</td><td>Japan, Republic of Korea, Democratic People’s Republic of Korea</td><td>140</td><td>133</td></tr><tr><td>SE_ASIA (Southeast Asia)</td><td>Viet Nam, Thailand, Myanmar, Malaysia, Lao People’s Democratic Republic, Indonesia, Cambodia, Philippines, Singapore</td><td>200</td><td>173</td></tr><tr><td>N_C_S_AMERICA (North, Central and South America)</td><td>Belize, Bolivia (Plurinational State of), Chile, Colombia, Ecuador, Mexico, Peru</td><td>60</td><td>50</td></tr><tr><td>INDIA_PAKISTAN</td><td>India, Pakistan</td><td>200</td><td>169</td></tr><tr><td>BANGLADESH</td><td>Bangladesh</td><td>116</td><td>109</td></tr><tr><td>SRI_LANKA</td><td>Sri Lanka</td><td>154</td><td>146</td></tr><tr><td>C_ASIA (Central Asia)</td><td>Afghanistan, Uzbekistan, Kazakhstan, Kyrgyzstan, Tajikistan</td><td>58</td><td>49</td></tr><tr><td>M_EAST_W_ASIA (Middle East and West Asia)</td><td>Jordan, Lebanon, Syrian Arab Republic, Bahrain, Kuwait, Oman, Saudi Arabia, Yemen, Türkiye, Cyprus, Armenia, Azerbaijan, Georgia, Iran (Islamic Republic of), Iraq, Israel</td><td>200</td><td>178</td></tr><tr><td>NW_WALES (North-West Wales)</td><td>Isle of Anglesey, Gwynedd, Conwy</td><td>487</td><td>450</td></tr><tr><td>SW_WALES (South-West Wales)</td><td>Ceredigion, Pembrokeshire, Carmarthenshire</td><td>657</td><td>544</td></tr><tr><td>N_ENG_S_SCOT (Northern England and Southern Scotland)</td><td>Dumfries and Galloway, Scottish Borders, Cumbria, Northumberland, County Durham, Sunderland, Gateshead, South Tyneside, Newcastle upon Tyne, North Tyneside</td><td>800</td><td>676</td></tr><tr><td>NI_N_SCOT (Northern Ireland and Scotland)</td><td>Northern Ireland, Scotland less regions in N_ENG_S_SCOT</td><td>800</td><td>703</td></tr><tr><td>C_S_UK (Central and South UK)</td><td>UK regions not in NW_WALES, SW_WALES, N_ENG_S_SCOT, and NI_N_SCOT</td><td>800</td><td>747</td></tr><tr><td>FINLAND</td><td>Finland</td><td>87</td><td>81</td></tr><tr><td>IRELAND</td><td>Ireland</td><td>800</td><td>727</td></tr><tr><td>PORTUGAL_SPAIN</td><td>Spain, Portugal</td><td>324</td><td>297</td></tr><tr><td>SE_EUROPE (South-Eastern Europe)</td><td>Italy, Romania, Greece, Bulgaria, Albania, Bosnia and Herzegovina, Croatia</td><td>594</td><td>522</td></tr><tr><td>N_EUROPE</td><td>Norway, Sweden, Denmark, Iceland</td><td>310</td><td>295</td></tr><tr><td>E_EUROPE</td><td>Poland, Lithuania, Ukraine, Czechia, Slovakia, Russian Federation, Latvia, Hungary</td><td>663</td><td>619</td></tr><tr><td>CW_EUROPE (Central-Western Europe)</td><td>France, Netherlands (Kingdom of the), Belgium, Germany, Switzerland, Austria</td><td>800</td><td>730</td></tr></tbody></table>

Following QC and processing of the UK Biobank WGS genotype data, the following steps were applied:

1. Genotype data for samples in the initial panel were extracted and converted to an N-by-S genotype matrix where N is the number of samples, and S in the number of SNPs/variants. Each cell of the matrix took the value of 0, 1 or 2 for the number of copies of the ALT allele at the samples genotype.
2. Variants were LD pruned in PLINK with a window size of 1000 variants, a step size of 80 variants and an r<sup>2</sup> threshold of 0.1 (Bycroft et al, 2018 <sup>3</sup>) &#x20;
3. Variants with a MAF less than 0.005 were removed
4. PCA was performed and samples were projected onto the top 40 PCs
5. The centroid of the PC projections of the samples was computed for each region. The median of the projections on a PC was taken as the position of the centroid along that PC.
6. For each sample, the distance to the centroid of its region of origin was calculated
7. For each region, the mean and standard deviation of the distance to the regional centroid was calculated using all samples from that region. Samples with a distance greater than the mean + SD were excluded.

This resulted in a refined reference panel with outliers removed. The refined panel was used to perform PCA and obtain final PCs and the projection of samples.

#### How were informative PCs selected?

PCs were selected based on their ability to distinguish between the regions represented on the reference panel. For each PC axis, the Kruskal-Wallis test was used to compare the distribution of projections among the regions in the panel. Informative PCs were defined as those with a Kruskal-Wallis p-value < 1x10<sup>-20</sup>, and if at least one region’s interquartile range of the projections on the PC did not include 0. The mean within-region SD was calculated using the projection of the samples along each of the informative PCs separately and used to compute downstream likelihood ratio statistics.

#### How were ancestry proportions estimated?

Ancestry proportions were estimated for new samples by applying non-negative least squares (NNLS) strategies to informative PC projections. These methods included:

1. NNLS-admixture: gives an estimate of fractional ancestry proportions that are constrained to be non-negative and sum up to 1
2. NNLS-hard-calling: assigns a sample entirely to the closest regional centroid

#### How were the NNLS methods validated?

Samples in the refined reference panel were split into a training set (80%) and a testing set (20%), where the training set was used to build a new panel according to the steps described previously. Ancestry proportions were estimated for samples in the testing set. Precision and recall were computed, where the region of birth of the testing samples was treated as truth. In doing so, precision was defined as the proportion of samples called to a region or ancestry group that are correctly called, and recall as the proportion of samples that belong to a region or ancestry group that are correctly called.&#x20;

Precision and recall were computed for NNLS-admixture by assigning a sample to a region if the estimated contribution of the region was greater than 50%. This resulted in a high level of accuracy for both NNLS-hard-calling and NNLS-admixture methods at the continental level including the UK (Table 2). However, accuracy was found to be lower when looking across the 25 sub-continental regions of the refined panel, for example when both NNLS methods were applied to the central and south UK region (C\_S\_UK) (Table 3). This could be explained by the genetic makeup of the region being explained in part by other regions in the panel.

**Table 2 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the continental level plus the UK.**

<table><thead><tr><th width="156.03125" valign="top">Region</th><th width="128" valign="top">NNLS‑ hard‑calling Recall</th><th width="126.3046875" valign="top">NNLS‑ hard‑calling Precision</th><th width="121.03125" valign="top">NNLS‑ admixture Recall</th><th width="120.6171875" valign="top">NNLS‑ admixture Precision</th><th width="142.625" valign="top">Fraction called</th></tr></thead><tbody><tr><td valign="top">AFR</td><td valign="top">0.96</td><td valign="top">0.99</td><td valign="top">0.96</td><td valign="top">0.99</td><td valign="top">1.00</td></tr><tr><td valign="top">N_C_S_AMERICA</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td></tr><tr><td valign="top">ASIA</td><td valign="top">0.98</td><td valign="top">0.99</td><td valign="top">0.96</td><td valign="top">0.99</td><td valign="top">0.98</td></tr><tr><td valign="top">EUR (non‑UK)</td><td valign="top">0.95</td><td valign="top">0.93</td><td valign="top">0.96</td><td valign="top">0.86</td><td valign="top">1.00</td></tr><tr><td valign="top">UK</td><td valign="top">0.94</td><td valign="top">0.95</td><td valign="top">0.83</td><td valign="top">0.95</td><td valign="top">0.99</td></tr></tbody></table>

**Table 3 - Precision and recall rates for NNLS-hard-calling and NNLS-admixture at the sub-continental level.**

<table><thead><tr><th width="175.11328125" valign="top">Region</th><th width="107.96484375" valign="top">Hard‑call Recall</th><th width="112.42578125" valign="top">Hard‑call Precision</th><th width="123.76171875" valign="top">Admixture Recall</th><th width="118.3984375" valign="top">Admixture Precision</th><th valign="top">Fraction called</th></tr></thead><tbody><tr><td valign="top">BANGLADESH</td><td valign="top">1.00</td><td valign="top">0.85</td><td valign="top">0.91</td><td valign="top">1.00</td><td valign="top">0.95</td></tr><tr><td valign="top">CE_ASIA</td><td valign="top">1.00</td><td valign="top">0.86</td><td valign="top">1.00</td><td valign="top">0.82</td><td valign="top">1.00</td></tr><tr><td valign="top">CW_EUROPE</td><td valign="top">0.66</td><td valign="top">0.84</td><td valign="top">0.14</td><td valign="top">0.88</td><td valign="top">0.45</td></tr><tr><td valign="top">C_ASIA</td><td valign="top">0.80</td><td valign="top">0.62</td><td valign="top">0.70</td><td valign="top">0.70</td><td valign="top">0.90</td></tr><tr><td valign="top">C_S_AFRICA</td><td valign="top">0.98</td><td valign="top">1.00</td><td valign="top">0.98</td><td valign="top">1.00</td><td valign="top">1.00</td></tr><tr><td valign="top">C_S_UK</td><td valign="top">0.65</td><td valign="top">0.49</td><td valign="top">0.23</td><td valign="top">0.55</td><td valign="top">0.64</td></tr><tr><td valign="top">E_AFRICA</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">0.94</td><td valign="top">1.00</td><td valign="top">0.94</td></tr><tr><td valign="top">E_EUROPE</td><td valign="top">0.94</td><td valign="top">0.98</td><td valign="top">0.94</td><td valign="top">0.98</td><td valign="top">0.96</td></tr><tr><td valign="top">FINLAND</td><td valign="top">0.94</td><td valign="top">1.00</td><td valign="top">0.94</td><td valign="top">1.00</td><td valign="top">0.94</td></tr><tr><td valign="top">INDIA_PAKISTAN</td><td valign="top">0.68</td><td valign="top">0.96</td><td valign="top">0.50</td><td valign="top">0.94</td><td valign="top">0.79</td></tr><tr><td valign="top">IRELAND</td><td valign="top">0.88</td><td valign="top">0.82</td><td valign="top">0.88</td><td valign="top">0.78</td><td valign="top">0.94</td></tr><tr><td valign="top">JAPAN_KOREA</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td></tr><tr><td valign="top">M_EAST_W_ASIA</td><td valign="top">0.89</td><td valign="top">0.94</td><td valign="top">0.81</td><td valign="top">0.94</td><td valign="top">0.89</td></tr><tr><td valign="top">NI_N_SCOT</td><td valign="top">0.60</td><td valign="top">0.70</td><td valign="top">0.40</td><td valign="top">0.71</td><td valign="top">0.72</td></tr><tr><td valign="top">NW_WALES</td><td valign="top">0.64</td><td valign="top">0.98</td><td valign="top">0.63</td><td valign="top">0.98</td><td valign="top">0.76</td></tr><tr><td valign="top">N_AFRICA</td><td valign="top">0.83</td><td valign="top">0.97</td><td valign="top">0.80</td><td valign="top">0.97</td><td valign="top">0.97</td></tr><tr><td valign="top">N_C_S_AMERICA</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td><td valign="top">1.00</td></tr><tr><td valign="top">N_ENG_S_SCOT</td><td valign="top">0.55</td><td valign="top">0.55</td><td valign="top">0.52</td><td valign="top">0.52</td><td valign="top">0.76</td></tr><tr><td valign="top">N_EUROPE</td><td valign="top">0.93</td><td valign="top">0.69</td><td valign="top">0.95</td><td valign="top">0.63</td><td valign="top">0.98</td></tr><tr><td valign="top">PORTUGAL_SPAIN</td><td valign="top">0.98</td><td valign="top">0.83</td><td valign="top">0.98</td><td valign="top">0.92</td><td valign="top">0.98</td></tr><tr><td valign="top">SE_ASIA</td><td valign="top">0.83</td><td valign="top">1.00</td><td valign="top">0.74</td><td valign="top">1.00</td><td valign="top">0.94</td></tr><tr><td valign="top">SE_EUROPE</td><td valign="top">0.99</td><td valign="top">0.93</td><td valign="top">0.95</td><td valign="top">0.92</td><td valign="top">0.95</td></tr><tr><td valign="top">SRI_LANKA</td><td valign="top">1.00</td><td valign="top">0.91</td><td valign="top">1.00</td><td valign="top">0.79</td><td valign="top">1.00</td></tr><tr><td valign="top">SW_WALES</td><td valign="top">0.81</td><td valign="top">0.87</td><td valign="top">0.82</td><td valign="top">0.88</td><td valign="top">0.86</td></tr><tr><td valign="top">W_AFRICA</td><td valign="top">1.00</td><td valign="top">0.98</td><td valign="top">1.00</td><td valign="top">0.98</td><td valign="top">1.00</td></tr></tbody></table>

**Validation with the Coriell Our Future Health array dataset**

A subset of the Coriell 1,000 Genomes samples (1,998) were genotyped on the Our Future Health array, then phased and imputed using [Genomics Ltd's bespoke pipeline](/our-future-health/data-types/genetic-data/imputed-genotype-data.md). The places of birth of the samples were mapped to the regions of the refined panel, and their ancestry proportions were estimated using both NNLS approaches.

An initial check was performed by comparing the African Ancestry in Southwest US (ASW) samples with respect to their African ancestry that was inferred with the NNLS-admixture approach and the Genomics Ltd's chromosome painting pipeline using FLARE <sup>4</sup>. A high degree of concordance was found between the two approaches suggesting that the NNLS-admixture method is well calibrated to appropriately handle admixture (Figure 2). Additional Coriell samples were analysed using both NNLS methods where the majority of regions showed good performance.

<figure><img src="/files/ahxHUFb0eCbXelIngGPg" alt=""><figcaption><p>Figure 2 - Comparison of African ancestry inferred for ASW Coriell samples by chromosome painting (FLARE) and NNLS-admixture</p></figcaption></figure>

#### Further validation with The People of the British Isles (PoBI) samples

The People of the British Isles (PoBI) <sup>5</sup> collection from the Coriell dataset were genotyped on the Illumina Human 1.2M-Duo genotyping chip. Genotype data were phased and imputed using the Genomics Ltd bespoke pipeline. After applying QC filter a total of 1,935 samples were retained. The regions of birth of the samples were mapped to the regions of the refined reference panel as closely as possible. NNLS-admixture and NNLS-hard-calling were applied to the PoBI samples where high recall rates were observed for both approaches. Due to the method of sample collection, it is unlikely for admixture to be a major contributor to the data patterns. This is reflected in the NNLS-hard-calling recall rates (Table 4) being slightly higher than those of the NNLS-admixture approach (Table 5).

**Table 4 - The average NNLS-hard-calling estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.**

<table><thead><tr><th width="167.40625">Region</th><th width="123.953125">NI_N_SCOT</th><th width="124.7890625">NW_WALES (some C_S_UK)</th><th width="152.296875">N_ENG_S_SCOT</th><th width="124.0546875">SW_WALES (some C_S_UK)</th><th>C_S_UK</th></tr></thead><tbody><tr><td>BANGLADESH</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>CE_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>CW_EUROPE</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.03</td></tr><tr><td>C_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>C_S_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>C_S_UK</td><td>0.03</td><td>0.00</td><td>0.18</td><td>0.00</td><td>0.80</td></tr><tr><td>E_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>E_EUROPE</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>FINLAND</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>INDIA_PAKISTAN</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>IRELAND</td><td>0.05</td><td>0.00</td><td>0.01</td><td>0.00</td><td>0.00</td></tr><tr><td>JAPAN_KOREA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>M_EAST_W_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>NI_N_SCOT</td><td>0.61</td><td>0.00</td><td>0.11</td><td>0.00</td><td>0.01</td></tr><tr><td>NW_WALES</td><td>0.00</td><td>0.99</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>N_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>N_C_S_AMERICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>N_ENG_S_SCOT</td><td>0.31</td><td>0.00</td><td>0.69</td><td>0.00</td><td>0.15</td></tr><tr><td>N_EUROPE</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.01</td></tr><tr><td>PORTUGAL_SPAIN</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>SE_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>SE_EUROPE</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>SRI_LANKA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>SW_WALES</td><td>0.00</td><td>0.01</td><td>0.00</td><td>1.00</td><td>0.01</td></tr><tr><td>W_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr></tbody></table>

**Table 5 - The average NNLS-admixture estimates of the PoBI samples, each column shows the mean estimate over all the samples from a region.**

<table><thead><tr><th width="166.3125">Region</th><th>NI_N_SCOT</th><th>NW_WALES (some UK_OTHER)</th><th width="153.55078125">N_ENG_S_SCOT</th><th width="124.86328125">SW_WALES (some UK_OTHER)</th><th>UK_OTHER</th></tr></thead><tbody><tr><td>BANGLADESH</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>CE_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>CW_EUROPE</td><td>0.00</td><td>0.00</td><td>0.02</td><td>0.00</td><td>0.09</td></tr><tr><td>C_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>C_S_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>C_S_UK</td><td>0.03</td><td>0.00</td><td>0.08</td><td>0.00</td><td>0.30</td></tr><tr><td>E_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>E_EUROPE</td><td>0.01</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>FINLAND</td><td>0.01</td><td>0.00</td><td>0.01</td><td>0.00</td><td>0.00</td></tr><tr><td>INDIA_PAKISTAN</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>IRELAND</td><td>0.16</td><td>0.00</td><td>0.05</td><td>0.01</td><td>0.02</td></tr><tr><td>JAPAN_KOREA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>M_EAST_W_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>NI_N_SCOT</td><td>0.47</td><td>0.00</td><td>0.15</td><td>0.00</td><td>0.01</td></tr><tr><td>NW_WALES</td><td>0.03</td><td>0.99</td><td>0.02</td><td>0.01</td><td>0.03</td></tr><tr><td>N_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>N_C_S_AMERICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>N_ENG_S_SCOT</td><td>0.17</td><td>0.00</td><td>0.53</td><td>0.01</td><td>0.26</td></tr><tr><td>N_EUROPE</td><td>0.08</td><td>0.00</td><td>0.09</td><td>0.01</td><td>0.20</td></tr><tr><td>PORTUGAL_SPAIN</td><td>0.01</td><td>0.00</td><td>0.01</td><td>0.00</td><td>0.03</td></tr><tr><td>SE_ASIA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>SE_EUROPE</td><td>0.00</td><td>0.00</td><td>0.01</td><td>0.00</td><td>0.01</td></tr><tr><td>SRI_LANKA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr><tr><td>SW_WALES</td><td>0.02</td><td>0.01</td><td>0.02</td><td>0.96</td><td>0.04</td></tr><tr><td>W_AFRICA</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td><td>0.00</td></tr></tbody></table>

#### How was genetic ancestry inferred with Our Future Health samples?

Genetic ancestry was inferred for Our Future Health participants using a refined reference panel based on UK Biobank as described above. Genotype data for Our Future Health samples were generated with the custom OurFutureHealthv1 beadchip array assay and imputed with phased UK Biobank 200k WGS data as [previously described](/our-future-health/data-types/genetic-data/imputed-genotype-data.md). Our Future Health data were securely transferred to a private project space shared with Genomics Ltd in groups of 2,000 to 8,000 samples. Groups consisted of multiple batches that has been genotyped on the same date. Batches from more than one genotype date were combined if a group was less than 2,000 samples. Our Future Health data was pseudo-anonymised by replacing the sample IDs provided by the genotyping laboratory with a participant ID (PID) specific for Genomics Ltd. Original genotyping batch IDs were not made available.

Ancestry estimation was performed for samples in the same grouping that they were also assigned for imputation of their genotype data. In doing so, both the NNLS-admixture and NNLS-hard-calling approaches were applied. For each sample, an NNLS-admixture proportion was returned across all the 25 sub-continental regions. In addition, the NNLS-hard-calling approach was used to assign a single ancestry label to each sample out of the 25 sub-continental regions.

***

#### **References**

1. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).
2. McVean, G. A Genealogical Interpretation of Principal Components Analysis. PLOS Genet. 5, e1000686 (2009).
3. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
4. Browning, S. R., Waples, R. K. & Browning, B. L. Fast, accurate local ancestry inference with FLARE. Am. J. Hum. Genet. 110, 326–335 (2023).
5. Leslie, S. et al. The fine-scale genetic structure of the British population. Nature 519, 309–314 (2015).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ourfuturehealth.gitbook.io/our-future-health/data-types/genetic-data/genetic-ancestry.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
