Genetic kinship

Our Future Health cohort includes individuals from a wide range of ancestral backgrounds. As population structure can influence kinship estimates, ancestry must be incorporated into kinship analysis. We followed the methodology outlined in Bycroft et al. (2018)(PDF). To account for the diverse background of our samples, SNPs that are the least contributing to ancestry derived from Principal Component Analysis (PCA-1) were used.

Kinship analysis was implemented using the KING software.

The workflow was as follows:

The above plot can be summarized into:

Kinship-1 → PCA-1 → Kinship-2 → PCA-2

Detailed steps for kinship and ancestry analysis:

  1. Select high quality SNPs:

  • Non-autosomes are excluded

  • Variants with high missingness >2% are excluded

  • Variants in extended high LD regions are excluded

  • Variants pruned for LD using a r2 threshold of 0.1

  • Rare variants with MAF <0.01 are excluded

  • Variants that are multi-allelic, mismapped or subject to any other known issue excluded

  • HWE exclusions (1e-6)

  1. Run Kinship-1 and define a set of unrelated (>3rd degree) individuals based on KING

  2. Run PCA-1 using the unrelated samples obtained at step 2 above.

  3. Using the First 3 PCs extract SNPs whose loadings are lowest contributing to the ancestry estimation. We used 50% of the SNPs.

  4. Project the related samples (ones excluded following Step 2 above) to unrelated PCA calculations

  5. Calculate PC-adjusted heterozygosity

  6. Exclude samples based on PC-adjusted heterozygosity and missingness

  7. Using the SNPs and samples selected from above steps, run Kinship-2.

PCA

PCA implementation

fastPCA implementation of Plink2 is used for PCA calculations. This has been run in parallel with kinship. The details of the algorithm is as below:

  1. Using unrelated samples obtained from step 8 above, calculate PCA-2:

    1. 40 PCs are calculated for unrelated individuals using plink2

    2. PCs for the excluded related individuals are projected on to the PCA matrix

    3. Projected PCs are rescaled to match the calculated PCs

See Genetic Ancestry section for the PCA plots

Kinship plot

Kinship file

Kinship file contains the following columns:

Field

Type

Description

ID1

string

Sample_id for individual 1 in related pair

ID2

string

Sample_id for individual 2 in related pair

HetHet

numeric

Fraction of markers for which the pair both have a heterozygous genotype (output from KING software)

IBS0

numeric

Fraction of markers for which the pair shares zero alleles (output from KING software)

Kinship

numeric

Estimate of the kinship coefficient for this pair based on the set of markers used in the kinship inference (Output from KING software). The set of markers is indicated by the field: used.in.kinship.inference.

What should I be aware of when working with the kinship data in this release?

  • A small number of samples (78) were estimated to have an implausibly large number of third-degree (or closer) relatives. We further identified 129 sample pairs with IBS0 = 0, a pattern typically compatible with a parent–offspring relationship, but for which their kinship coefficients were below the first-degree threshold, falling instead within the second-degree range. These pairs were therefore classified as second-degree relatives in the genotype data release. Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC.

  • While no significant batch effects were observed in PCA plots, an analysis of the observed versus expected frequency of third degree related pairs as a function of cohort size showed an unexpected deviation. Briefly, genotyping data was sorted either by consent or genotyping date and aggregated in bins of 52,000 samples each. The number of related pairs in each bin was counted and compared against a distribution obtained by randomly shuffling (see below). While the frequency of first and second degree related pairs closely matched random distribution, the frequency of third degree related pairs showed a shift from the baseline suggesting an elevated number of third-degree related pairs in the first two and the last analysis bins. This shift could result from several factors, including a bias from diverse recruitment and insufficient filtering of rare ancestral SNPs in an ethically diverse cohort, or a genotyping batch effect. We will continue to investigate this effect and provide an update with the next data release.

Derivative counts of excess relatedness per bin

Last updated