Genotype array data

Information about the genotype array data in Our Future Health resource, including the scope and structure of these data and how the data were generated and processed

The genotyping process

How were participant blood samples processed for genotyping?

Each participant was asked to donate a blood sample which was collected in 2 x 6 ml EDTA tubes during their clinic appointment. Each of these blood sample tubes was labelled with a unique barcoded sample ID. Samples were receipted on arrival at the sample processing laboratory, operated by UK Biocentre, at which point package and sample quality was recorded and the primary and secondary sample tube were assigned based on volume and condition. Samples which were reported as ‘Damaged’, ‘Leaking’, ‘Low Volume (<2ml)’, ‘Empty’ or ‘Haemolysed’ were excluded from further processing.

Each blood sample tube was centrifuged and fractionated into 1 x 500 µl buffy coat aliquot and up to 3 x 850 µl plasma aliquots. All sample plasma aliquots and the secondary tube buffy coat aliquot were placed in our biobank (also operated by UK Biocentre) at −80°C for long-term storage.

The primary tube buffy coat proceeded to DNA extraction using the Chemagic or Kingfisher magnetic bead technologies. Sample DNA was quantified using the DropSense or Lunatic platform.

DNA sample aliquots were then normalised for genotyping. Sample DNA concentration was required to be above 30 ng/µl for selection for genotyping. Samples with a concentration between 30 ng/µl and 100 ng/µl were transferred neat into the normalisation plate and samples >100 ng/µl were diluted to 100 ng/µl with nuclease-free water. DNA normalisation was recorded in the blood sample processing Laboratory Information Management System (LIMS).

Normalised samples were added to 96-well plates. Each plate contained samples from 94 Our Future Health participants, with one participant plated twice to provide for a technical replicate. For the technical replicate samples well positions were physically separated on the plate and selected to ensure that each technical replicate sample was genotyped on separate 48-sample beadchip arrays. The final well (H12) was left empty for a control sample (to be added by the genotyping laboratory). The control sample was selected based on availability of whole-genome sequence data for genotype concordance calculations, and the availability of DNA stock from which to ensure continued uninterrupted supply of the sample. Sample NA12878 from the 1000 Genomes Project was selected (female sample of European ancestry) with stock DNA provided by The Coriell Institute.

A pseudonymised sample manifest was provided to Our Future Health and the genotyping laboratory at the time of DNA sample handover from the blood processing laboratory to the genotyping laboratory.

What genotype array was used?

Genotyping was conducted using a custom Illumina Infinium Excalibur beadchip array assay, designed by Our Future Health in collaboration with Illumina, and designated ‘OurFutureHealthv1’. Each beadchip array assays approximately 700,000 variants for each of 48 DNA samples. The assay includes both whole-genome amplification (WGA), and a PCR amplification step for pharmacogenomic target preparation called targeted gene amplification (TGA).

In 2019, Our Future Health set up the Genotype Assay Design Working Group to discuss, decide and source the genetic variants that the genotyping assay should cover. In 2022, Our Future Health awarded a contract for a genotyping assay to Illumina to develop a custom 48-sample beadchip array to efficiently and accurately genotype ~700k variants on each Our Future Health participant. During 2022 and early 2023, Our Future Health worked closely with Illumina, the Working Group, Board of Trustees, Founders Board, and Charity Affiliates to further develop the genotype variant custom content requirements of the programme. This included: broad and up to date sets of variants associated with disease or other phenotypes, or which are clinically relevant; pharmacogenetic variants including the Illumina ePGx product; and variants for blood type predictions. All these sets of variants, and the genotype array backbone, included considerations of health equity to ensure that the genotype array is both comprehensive and relevant to the diversity of the UK population.

From the proposed custom content, variants to be included on the array were selected by first including high priority variants flagged for direct assay by the Working Group, Founders Board or Charity Affiliates. Direct assays were also used for PGx content, blood type prediction content, and clinically-relevant content. All of this content, and the backbone variants, were considered “fixed” for subsequent selection of additional custom content.

All custom content variants submitted as tag-eligible (i.e. could be imputed rather than directly assayed) then underwent direct assay probe design to enable subsequent design decisions. Variants which were non-designable were excluded from further consideration. Variants were also examined for accuracy of imputation using in silico imputation experiments. These imputation experiments took the fixed content of the array as ‘known’ and imputed to a reference panel based on whole genome sequences from the 1000 Genomes Project. Direct assays were prioritised for variants which imputed more poorly (by non-reference concordance or imputation R-squared statistic) in samples from one or more of the three most common UK continental ancestries (African, European, and South Asian superpopulations). Among poorly-imputed variants, those which were monomorphic in fewer continental ancestries were given the highest priority. Variants which were non-imputable (i.e., could not be imputed at all due to absence from the reference panel) were included on the array unless they were ultra-rare. Specifically, we removed variants which were not present in dbSNP, or that were either not present or had a minor allele frequency <0.0001 in both TOPmed and gnomAD.

How were the samples genotyped on the array?

Normalised DNA samples in 96-well plates were transferred from the blood sample processing laboratory to the genotyping laboratory, which is operated by Eurofins. The acceptance procedure for transfer of samples from the blood sample processing laboratory to the genotyping laboratory included visual inspection for plate seal integrity and damage, and validation checks of the digital manifest.

Each 96-well plate of participant samples was genotyped using two 48-sample beadchip arrays, with a fixed plate-to-array mapping. The genotyping laboratory was not informed which positions were selected for the technical replicate samples on each plate. The well positions of the two technical replicate samples were physically separated on the plate (one on an edge and the other among the interior wells) and were placed to ensure that each was genotyped on different 48-sample beadchip assays.

Genotyping laboratory operating procedures followed the laboratory protocol specified by the array manufacturer, Illumina. Samples were logged and tracked throughout the genotyping process via a Laboratory Information Management System (LIMS).

How were the genotypes called from the raw genotype data?

Genotype calling was conducted at the genotyping laboratory using the Illumina Array Analysis CLI (ACLI) v2.1.0 software and the default settings for genotype calling specified by Illumina.

How was call rate calculated?

Call rate was calculated during genotype calling using the Illumina ACLI v2.1.0 software and based on the number of called genetic variants divided by the number of total genetic variants across all chromosomes.

How was genetic sex estimated?

Genetic sex was estimated during genotype calling using the sex calling algorithm included in the Illumina ACLI v2.1.0 software. The algorithm used intensity data from X and Y chromosome variants. Samples where the median Y intensity was high were estimated ‘Male’. If the median Y intensity was low and the median X intensity was high then the sample was estimated as ‘Female’. If these criteria were not met then the sample was estimated as ‘Unknown’.

Genotype data processing

How did we process the data for this release?

Upon receipt of called genotype data from the genotyping laboratory, sample-level data were accepted for release unless any of the following exclusion criteria were met:

if the call rate calculated using all variants was <97%
if self-reported sex registered at birth was missing, due to not submitting a questionnaire or responding ‘Prefer not to answer’
if self-reported sex registered at birth and genetic sex were discordant (except for participants who reported ‘Intersex’, which was not considered discordant with any genetic sex)
if the targeted gene amplification (TGA) control probe values were outside the manufacturer's recommended range (indicating possible failure of the PCR amplification for pharmacogenomic content)
if on the same plate as the sample, the technical replicate sample pair genotype concordance was <99% and the control sample genotype concordance to whole genome sequence data was <99%
if on the same plate as the sample, >4% of samples were discordant in self-reported sex registered at birth and genetic sex, among those which were neither missing self-reported sex registered at birth nor called as ‘Unknown’ genetic sex
if on the same plate as the sample, >=90 samples (out of 96) were excluded due to call rate, TGA or sex discordance checks
if the sample was the 1000 Genomes Project control sample
if the sample was one of the pair of technical replicate samples on a plate with the lowest call rate of the two, or was the sample closest to the edge of the plate if call rates were identical

Note: the Illumina ACLI v2.1.0 software calls genetic sex as ‘Unknown’ if the autosomal call rate is <97%. Because samples with ‘Unknown’ genetic sex results are quarantined unless the participant also self-reported ‘Intersex’ registered at birth, almost all samples in the current data release also have an autosomal call rate of ≥97%, in addition to the ≥97% whole-genome call rate noted above.

After quarantine of samples failing any of the above exclusion criteria, individual sample-level SNV VCF files were merged into a single population VCF (pVCF) using bcftools v1.17. Multi-allelic variants were split into separate biallelic variants and the variant IDs were changed to CHR:POS:REF:ALT (chromosome, position, reference allele and alternate allele, or CPRA) format IDs for the new biallelic records. The pVCF file was converted to BGEN file format and both file types were then split by chromosome to create 25 separate files for each file type (22 autosomal chromosomes, sex chromosomes 'X', 'Y' and mitochondrial 'MT'). Prior to release, each file set was split across regions for each chromosome across 160 separate files. The chromosome and genomic coordinates for the variants within each pVCF and BGEN file can be found in the regional index BED file.

After constructing the pVCF of genotype data for release, we conducted several targeted statistical checks on the data. This involved assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. The objective of these checks was to ensure there were no substantial deviations from expected patterns or large sets of outlier samples which would prevent release of the data. Identification or exclusion of outlier samples/variants should be conducted by researchers in their own set of comprehensive statistical QC that is appropriately tailored to their research question.

How did we de-identify the data to minimise risks of identifying participants?

The original, pseudonymised laboratory ID of each genotyped sample was replaced in the pVCF, BGEN and sample files with the participant ID (PID) for the participant who gave the sample. These PIDs are randomly generated to de-identify participants, and are the primary IDs used to link the genotype data with the data in the participant and questionnaire tables.

What exclusions were applied to the data?

Primary exclusions based on call rate, genetic sex and Targeted Gene Amplification (TGA) control probe values are described under ‘How did we process data for this release?’ No additional exclusions or transformations were applied to the genotype data at the sample-level or variant-level. In particular, variants have not been filtered for allele frequency, missingness, or other QC statistics. Likewise, samples have not been excluded for heterozygosity, patterns of genetic relatedness to other samples, or other metrics.

How do I interpret the genotype file structure?

The pVCF files are bgzip compressed and follow the standard VCF 4.1 file specification, with genotypes present in the GT sub-field of the FORMAT field. This is described in the genotype tab of the Our Future Health data dictionary which can be found on the Data and cohort page of our website (external link). Variants are listed with numeric chromosome notation, except for X, Y and MT which use character notation. The variant IDs in the pVCF ID field are given in CHR:POS:REF:ALT (chromosome, position, reference allele and alternate allele) format and follow human genome reference assembly GRCh38.

The BGEN files (version 1.2 format) are binary files for storing compressed genotype information. They are not human-readable. The BGEN-associated sample files (.sample) are tabular human-readable files, and contain estimated genetic sex for each sample, and are formatted to the standard BGEN-associated sample file specification. BGEN index files (.bgi) are provided for faster and more efficient processing of genetic variants in the BGEN files. The BGEN file specification is described on www.well.ox.ac.uk/~gav/bgen_format/ (external link).

What metadata is available to help document the genetic data release?

The genotype tab of the Our Future Health data dictionary summarises the fields in the sample QC and pVCF files. The CPRA variant list provides a list of the genetic variants included in the data set in CHR:POS:REF:ALT (chromosome, position, reference allele and alternate allele) format. Both files can be found on the Data and cohort page of our website (external link). The sample QC metrics file of the release contains information on which batch each sample was genotyped in. This file is available within the Our Future Health TRE when downloading genotype data files into an approved project.

PreviousChange log for Clinic measurements appointment processes NextLinked health records data

Last updated 8 months ago

hashtagThe genotyping process

hashtagHow were participant blood samples processed for genotyping?

hashtagWhat genotype array was used?

hashtagHow were the samples genotyped on the array?

hashtagHow were the genotypes called from the raw genotype data?

hashtagHow was call rate calculated?

hashtagHow was genetic sex estimated?

hashtagGenotype data processing

hashtagHow did we process the data for this release?

hashtagHow did we de-identify the data to minimise risks of identifying participants?

hashtagWhat exclusions were applied to the data?

hashtagHow do I interpret the genotype file structure?

hashtagWhat metadata is available to help document the genetic data release?