Release 5
Information about the data released on 12 December 2023
This release includes 704,088 participants with completed baseline health questionnaires and 66,524 participants with initial genotype array data.
What data is included in Release 5?
Release 5 includes data collected from our baseline health questionnaire and an indicator showing whether we have received a blood sample from each participant.
All participants have completed and submitted the baseline questionnaire, and for 66,524 of these individuals we have generated genotype array data which is part of the genotyping data release.
Participant data
The participant table includes information from 704,088 participants who have registered and consented to join the Our Future Health programme, on or before 3 November 2023. Each record in the participant table corresponds to exactly one record in the questionnaire data.
Questionnaire data
The questionnaire table includes data from those participants who have completed either version 1 or version 2 of the Our Future Health baseline questionnaire. This includes participants who joined during the initial pilot in 2021 and after the main recruitment period began in the summer of 2022. Participants who started the questionnaire on or after 20 November 2022 will have completed version 2 of the questionnaire. A questionnaire is considered complete when a participant has answered all sections and have submitted their responses.
Genotype data
The initial genotyping data release contains information on 700,138 variants for 66,524 participants. This data was obtained using a custom Illumina Infinium Excalibur beadchip array, designed by Our Future Health in collaboration with Illumina.
Each participant is represented by a single sample in each file in the genotyping data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables. There are three categories of files included in the current release: two sets of files containing participant genotypes and one file containing sample-level information. Each set is provided as a separate data ‘entity’ within the Trusted Research Environment. The two sets of genotype files, pVCF and BGEN files, each contain the same genotypes for the same participants and genetic variants. The pVCF contains additional genotyping metadata that is not present in the BGEN file. We provide both types of files for convenience and to improve the experience of researchers using the data. The sample-level information file contains information useful for quality control (QC) purposes, such as batch, estimated genetic sex and call rate.
SNV pVCF
pVCF
VCF 4.1
snv_pvcf
ofh_snv.v1.vcf.gz
pVCF containing SNV genotypes and metadata
SNV pVCF
pVCF index file
-
snv_pvcf
ofh_snv.v1.vcf.gz.tbi
pVCF-associated index file
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v1.bgen
BGEN file containing SNV genotypes
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v1.sample
BGEN-associated sample file
BGEN
BGEN index file
-
snv_bgen
ofh_snv.v1.bgen.bgi
BGEN-associated index file
Sample QC
QC metrics
-
sample_qc_metrics
ofh_sample_qc_ metrics.v1.tsv
Plain-text tabular file with sample-level information
File names in this data release include the following components:
an indicator that the data comes from Our Future Health participants (“ofh_”)
an indicator of file contents (“snv” or “sample_qc_metrics”)
the version number (“.v1”) of the genotyping data release, to be incremented with each release
a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values.
What information do the genotype files contain?
Both the pVCF and BGEN files contain genotypes from all participants in the release and include genotype calls for all 22 autosomal chromosomes, X, Y and MT (mitochondrial). All genotypes are for single-nucleotide polymorphisms (SNPs) or small insertion-deletions (INDELs) aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. The pVCF file metadata includes GenCall score, Log R Ratio, and B-allele Frequency, all of which are available in the FORMAT field. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file.
For more information on the fields present in each genotype file, refer to the genotyping data tab of the Our Future Health data dictionary. For more information on the exact genetic variants present in each genotype file, refer to the CPRA variant list. Both these files can be found on the Data and cohort page of our website.
What information does the sample QC file contain?
The sample QC file contains basic sample-level information useful for QC purposes or batch effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on the Data and cohort page of our website (external link).
Participant and Questionnaire Data
Are there any known data quality issues?
Technical data loss
Due to a suspected system issue that occurred in October 2022, we are aware that some questionnaires submitted around that time have missing data for some questions. We are analysing the impact and will provide further information in future releases.
Incorrect branching logic
Sex-specific questions and answer options:
Some questions or answer options are only shown to participants who answer the question “What sex were you registered with at birth?” with Female (2) or Intersex (1) or Prefer not to say (-3), but not to participants who select Male (1). Similarly, other questions and answer options are only shown to participants who select Male (1) or Intersex (1) or Prefer not to say (-3), but not Female (2). A small number of participants (<100) who selected sex as Male were prompted to provide responses to female-specific questions. Both versions of the questionnaire are affected.
Walking related questions:
If participants respond with Unable to walk (-2) to the question “Thinking about the last 4 weeks, in a typical WEEK, on how many days did you walk for at least 10 minutes at a time?” (ACTIVITY_WALK_DAYS_1_1
and ACTIVITY_WALK_DAYS_2_1
), they should skip the following fields: ACTIVITY_WALK_MINS_1_1
/ACTIVITY_WALK_MINS_2_1
, ACTIVITY_WALK_PACE_1_1
, HEALTH_RESP_SHORT_1_1
, HEALTH_PAIN_LEG_1_1
. However, this logic is currently not working as intended.
Contraceptive methods:
Participants can provide multiple responses to the question “What have you used for contraception?” (GYN_CONTRACEPT_METHODS_1_M
). If a participant selects Combined Pill (1) or Mini Pill (5), they are supposed to be asked about the age at which they first and last took the contraceptive pill (fields GYN_CONTRACEPT_PILL_FIRST_AGE_1_1
and GYN_CONTRACEPT_PILL_LAST_AGE_1_1
, respectively). This logic should apply to all participants who select option 1 or 5, even if combined with other contraceptive methods. However, when they select options 1 or 5 in combination with any other contraceptive method, participants were not asked these age-related follow-up questions.
Work related questions:
Participants can provide multiple responses to the question “Which of the following describes your current situation?” (WORK_STATUS_2_M
). Choosing options In paid employment or self-employed (1), Looking after home and/or family (3), Doing unpaid or voluntary work (6), On paid leave (e.g. parental leave, long term sick leave, furlough) (8), will prompt additional questions about their work (including the following fields: WORK_YRS_1_1
, WORK_WK_HRS_1_1
, WORK_WK_TRAVEL_1_1
, WORK_TRANSPORT_1_M
, WORK_DISTANCE_1_1
, WORK_WALK_STAND_1_1
, WORK_MANUAL_LABOUR_1_1
, WORK_SHIFTS_1_1
). Due to an error in branching logic, if a participant selects any of the answer values 1, 3 6 or 8 in combination with a response that includes skip logic, (meaning they won’t be asked about their work situation), for example, Retired (2), or Full or part-time student (7), all work-related fields mentioned above were skipped. This issue is specific to version 2 of the questionnaire.
What has changed since previous releases?
Changes introduced in data release 5
Field names:
Up to data release 4, in the questionnaire data, there are two fields that could be read as referring to an incorrect unit of measurement. The field EDUCATION_YEARS
contains responses to the question “At what age did you complete your continuous full time education?”, where response values refer to age. The field ALCOHOL_10YRS_AGE
contains responses to the question “Compared to 10 years ago, do you drink?” where responses include ordinal categorical options, and do not refer to age or years specifically. In data release 5, most field names were updated. See ‘How do I interpret the structured field names?’ for a description on how field names should be interpreted.
Changes introduced in data release 4
Extreme values:
In previous releases (up to release version 3), there were instances of extreme values reported for self-reported weight in kilograms. These outliers were the result of an error in the logical ranges for minimum and maximum values permitted in the questionnaire. This issue was resolved within the questionnaire on 18 January 18, 2023. However, the historical incorrect records persisted in the previously released datasets. Starting from this release, self-reported weight values are now constrained within the range of 20 to 400 kilograms and values outside of this range have been removed (converted to null). We are also aware of extreme values in other variables, such as age at first child, age at last child, and number of children (see above). Various checks are applied to the data, but we cannot exclude reporting errors in questionnaire responses due to poor recall, mistakes in data entry, participant biases, or other input of incorrect information.
Missing Data:
Due to an error, participants who completed version 1 of the questionnaire were not shown the question “How old were you when you last smoked cigarettes on most days?”. In previous data releases, this question corresponded to the column SMOKING_CIG_AGE_1_1
which was always empty due to the error mentioned. This field has now been removed and as of data release 4, will no longer form part of the participant and questionnaire datasets.
Genotype Data
Are there any known data quality issues?
This initial data release does not include sample QC results other than limited outputs from the genotype calling process including call rate, genetic sex, and related aggregate genotyping QC metrics. No variant QC results are provided. We have, however, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include the outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.
We note the following issues with the initial data release which we aim to address in a future release:
Estimated call rate is based on all chromosomes, including both the X and Y chromosomes. Median call rates for females may be lower than for males due to Y chromosome missingness.
The date string in the batch identifier within the sample QC metrics file is the date of genotype calling and not the date on which the genotyping assay started in the laboratory. Calling of some samples occurred considerably later than the laboratory assay was completed, so this date cannot be used for statistical adjustment of between-day variation introduced at genotyping stage.
We have identified an issue with genotype data files in which both PLINK2 (Oct 3rd 2023 version) and QCTOOL (v2.0.8) return errors for certain commands. We are still investigating the extent and root cause and any necessary steps required to resolve the issue. Meanwhile, researchers are advised to analyse autosomes and sex chromosomes separately or individually as a work-around for most cases. We aim to resolve these issues in future releases and will update this documentation accordingly.
Our intention for this initial genotype data release was that all participants with genotype data would also have participant and questionnaire data available. We have, however, identified that records for a small number (<10) of participants with genotype data were removed from the Participant and Questionnaire tables during the process of generating this data release. We were unable to remove the corresponding genotype data for these participants before the genotype data release files were created. Instead, for these participants the genotype data files have participant IDs (PIDs) replaced with non-meaningful values: negative numbers for BGEN-related files, and strings starting with “w” in the pVCF-related files. In the sample-level files, all other fields have also been replaced by non-meaningful values (e.g. “0”). The genotypes themselves have not been replaced or removed in the pVCF (‘ofh_snv.v1.vcf.gz’) or BGEN (‘ofh_snv.v1.bgen’) genotype files. This issue will be corrected in the next data release. Researchers should exclude the participants described above from all analyses to ensure the integrity and consistency of finding
Will future releases be compatible with this data release?
Our Future Health is aiming to recruit millions of participants for genotyping. Reformatting the genetic data releases may be required to provide data at this scale. For example, in future genetic data releases we may split files by chromosome or region within chromosome. Future genetic data releases may also be provided in additional or different file formats.
Several data enhancements are also planned. We will add imputed genotypes, greatly expanding the number of genotypes available in the data release. We will also expand the QC information available for both samples and variants.
Participant and Questionnaire data
We are expecting to make changes for future releases, for example we:
made every effort to avoid errors in the data production process but will aim to address any identified issues in the next release
will provide additional information on questionnaire journeys with regards to branching logic
will make improvements to how the smoking section is currently presented
Last updated