> For the complete documentation index, see [llms.txt](https://ourfuturehealth.gitbook.io/our-future-health/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ourfuturehealth.gitbook.io/our-future-health/data-types/genetic-data/genetic-kinship.md). # Genetic kinship Our Future Health cohort includes individuals from a wide range of ancestral backgrounds. As population structure can influence kinship estimates, ancestry must be incorporated into kinship analysis. We followed the methodology outlined in [Bycroft et al. (2018)(PDF).](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-018-0579-z/MediaObjects/41586_2018_579_MOESM1_ESM.pdf) To account for the diverse background of our samples, SNPs that are the least contributing to ancestry derived from Principal Component Analysis (PCA-1) were used. Kinship analysis was implemented using the KING software. The workflow was as follows:

The above plot can be summarized into:

Kinship-1 → PCA-1 → Kinship-2 → PCA-2

Detailed steps for kinship and ancestry analysis: 1. Select high quality SNPs: * Non-autosomes are excluded * Variants with high missingness >2% are excluded * Variants in extended high LD regions are excluded * Variants pruned for LD using a r2 threshold of 0.1 * Rare variants with MAF <0.01 are excluded * Variants that are multi-allelic, mismapped or subject to any other known issue excluded * HWE exclusions (1e-6) 2. Run Kinship-1 and define a set of unrelated (>3rd degree) individuals based on KING 3. Run PCA-1 using the unrelated samples obtained at step 2 above. 4. Using the First 3 PCs extract SNPs whose loadings are lowest contributing to the ancestry estimation. We used 50% of the SNPs. 5. Project the related samples (ones excluded following Step 2 above) to unrelated PCA calculations 6. Calculate PC-adjusted heterozygosity 7. Exclude samples based on PC-adjusted heterozygosity and missingness 8. Using the SNPs and samples selected from above steps, run Kinship-2. ### PCA PCA implementation *fastPCA* implementation of *Plink2* is used for PCA calculations. This has been run in parallel with kinship. The details of the algorithm is as below: 1. Using unrelated samples obtained from step 8 above, calculate PCA-2: 1. 40 PCs are calculated for unrelated individuals using plink2 2. PCs for the excluded related individuals are projected on to the PCA matrix 3. Projected PCs are rescaled to match the calculated PCs See **Genetic Ancestry** section for the PCA plots ### Kinship plot

### Kinship file Kinship file contains the following columns:


Field	Type	Description
ID1	string	Sample_id for individual 1 in related pair
ID2	string	Sample_id for individual 2 in related pair
HetHet	numeric	Fraction of markers for which the pair both have a heterozygous genotype (output from KING software)
IBS0	numeric	Fraction of markers for which the pair shares zero alleles (output from KING software)
Kinship	numeric	Estimate of the kinship coefficient for this pair based on the set of markers used in the kinship inference (Output from KING software). The set of markers is indicated by the field: used.in.kinship.inference.

### What should I be aware of when working with the kinship data in this release? * A small number of samples (78) were estimated to have an implausibly large number of third-degree (or closer) relatives. We further identified 129 sample pairs with IBS0 = 0, a pattern typically compatible with a parent–offspring relationship, but for which their kinship coefficients were below the first-degree threshold, falling instead within the second-degree range. These pairs were therefore classified as second-degree relatives in the genotype data release**.** Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC. * While no significant batch effects were observed in PCA plots, an analysis of the observed versus expected frequency of third degree related pairs as a function of cohort size showed an unexpected deviation. Briefly, genotyping data was sorted either by consent or genotyping date and aggregated in bins of 52,000 samples each. The number of related pairs in each bin was counted and compared against a distribution obtained by randomly shuffling (see below). While the frequency of first and second degree related pairs closely matched random distribution, the frequency of third degree related pairs showed a shift from the baseline suggesting an elevated number of third-degree related pairs in the first two and the last analysis bins. This shift could result from several factors, including a bias from diverse recruitment and insufficient filtering of rare ancestral SNPs in an ethically diverse cohort, or a genotyping batch effect. We will continue to investigate this effect and provide an update with the next data release.

#### Derivative counts of excess relatedness per bin

--- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://ourfuturehealth.gitbook.io/our-future-health/data-types/genetic-data/genetic-kinship.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.