Genetic kinship
Our Future Health cohort includes individuals from a wide range of ancestral backgrounds. As population structure can influence kinship estimates, ancestry must be incorporated into kinship analysis. We followed the methodology outlined in Bycroft et al. (2018)(PDF). To account for the diverse background of our samples, SNPs that are the least contributing to ancestry derived from Principal Component Analysis (PCA-1) were used.
Kinship analysis was implemented using the KING software.
The workflow was as follows:

The above plot can be summarized into:
Kinship-1 → PCA-1 → Kinship-2 → PCA-2
Detailed steps for kinship and ancestry analysis:
Select high quality SNPs:
Non-autosomes are excluded
Variants with high missingness >2% are excluded
Variants in extended high LD regions are excluded
Variants pruned for LD using a r2 threshold of 0.1
Rare variants with MAF <0.01 are excluded
Variants that are multi-allelic, mismapped or subject to any other known issue excluded
HWE exclusions (1e-6)
Run Kinship-1 and define a set of unrelated (>3rd degree) individuals based on KING
Run PCA-1 using the unrelated samples obtained at step 2 above.
Using the First 3 PCs extract SNPs whose loadings are lowest contributing to the ancestry estimation. We used 50% of the SNPs.
Project the related samples (ones excluded following Step 2 above) to unrelated PCA calculations
Calculate PC-adjusted heterozygosity
Exclude samples based on PC-adjusted heterozygosity and missingness
Using the SNPs and samples selected from above steps, run Kinship-2.
PCA
PCA implementation
fastPCA implementation of Plink2 is used for PCA calculations. This has been run in parallel with kinship. The details of the algorithm is as below:
Using unrelated samples obtained from step 8 above, calculate PCA-2:
40 PCs are calculated for unrelated individuals using plink2
PCs for the excluded related individuals are projected on to the PCA matrix
Projected PCs are rescaled to match the calculated PCs
See Genetic Ancestry section for the PCA plots
Kinship plot

Kinship file
Kinship file contains the following columns:
Field
Type
Description
ID1
string
Sample_id for individual 1 in related pair
ID2
string
Sample_id for individual 2 in related pair
HetHet
numeric
Fraction of markers for which the pair both have a heterozygous genotype (output from KING software)
IBS0
numeric
Fraction of markers for which the pair shares zero alleles (output from KING software)
Kinship
numeric
Estimate of the kinship coefficient for this pair based on the set of markers used in the kinship inference (Output from KING software). The set of markers is indicated by the field: used.in.kinship.inference.
What should I be aware of when working with the kinship data in this release?
A small number of samples (78) were estimated to have an implausibly large number of third-degree (or closer) relatives. We further identified 129 sample pairs with IBS0 = 0, a pattern typically compatible with a parent–offspring relationship, but for which their kinship coefficients were below the first-degree threshold, falling instead within the second-degree range. These pairs were therefore classified as second-degree relatives in the genotype data release. Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC.
While no significant batch effects were observed in PCA plots, an analysis of the observed versus expected frequency of third degree related pairs as a function of cohort size showed an unexpected deviation. Briefly, genotyping data was sorted either by consent or genotyping date and aggregated in bins of 52,000 samples each. The number of related pairs in each bin was counted and compared against a distribution obtained by randomly shuffling (see below). While the frequency of first and second degree related pairs closely matched random distribution, the frequency of third degree related pairs showed a shift from the baseline suggesting an elevated number of third-degree related pairs in the first two and the last analysis bins. This shift could result from several factors, including a bias from diverse recruitment and insufficient filtering of rare ancestral SNPs in an ethically diverse cohort, or a genotyping batch effect. We will continue to investigate this effect and provide an update with the next data release.

Derivative counts of excess relatedness per bin

Last updated
