Release 6
Information about the data released on 19 March 2024
This release includes 802,998 participants with completed baseline health questionnaires and 66,520 participants with initial genotype array data.
What data is included in Release 6?
All participants included in this release have completed and submitted the baseline questionnaire, and for 66,520 of these individuals we have generated genotype array data.
Participant data
The participant table includes information from 802,998 participants who have registered and consented to join the Our Future Health programme, on or before 17 January 2024. Each record in the participant table corresponds to exactly one record in the questionnaire data.
Questionnaire data
The questionnaire table includes data from those participants who have completed either v1, v2 or v2.1 of the Our Future Health baseline questionnaire. This includes participants who joined during the initial pilot in 2021 and after the main recruitment period began in the summer of 2022.
Participants who started the questionnaire on or after 20 November 2022 will have completed v2 of the questionnaire.
Participants who started the questionnaire on or after 21 December 2023 will have completed v2.1 of the questionnaire.
A questionnaire is considered complete when a participant has answered all sections and have submitted their responses.
Genotype data
The genotyping data release contains information on 700,138 variants for 66,520 participants. This data was obtained using a custom Illumina Infinium Excalibur beadchip array, designed by Our Future Health in collaboration with Illumina.
Participant and Questionnaire Data
For information on how we:
de-identify data
manage re-identification risk
version control
tailor questionnaire journeys
store the data in the TRE
name our variables
see our Participant data and Questionnaire data pages.
What are the inclusion criteria for this release?
For the current release, all participants must have submitted a complete questionnaire, on or before 17 January 2024.
What has changed since previous releases?
Minor versioning
In previous releases, we discovered several data quality issues resulting from errors in dynamic logic, as described above. These issues have been addressed and incorporated into a new minor version as of release 6 (v2.1). For more information on what constitutes a minor version, please see the section on Version changes and developments in the Questionnaire data page.
Handling questions with multi-select answers
In earlier releases, for questions allowing multi-select answers (multiple different answers), we stored them in two distinct ways. We stored all response values in a single array column. Second, we 'exploded' each response array so that each answer in the sequence had its own separate column. However, the answers were exploded out in the order they were selected, meaning a single column could still contain multiple different answers.
We have now removed the exploded-out columns and have opted to keep only the array column for storing multiple responses. This streamlining ensures a more concise and efficient data structure, simplifying the handling of multiple responses within the dataset.
Please refer to the documentation from the previous releases to review the incremental changes made between each release.
Information on conditions for dynamic questions
For v2.1 we have created logic files detailing the conditions necessary for participants to answer each question. This is especially useful for dynamic questions, that are only shown to a subset of participants. The logic files are also the only current location that contain a 1:1 mapping of field names to human-readable questionnaire identifiers. Both the codebooks and human-readable versions of the questionnaire can be found in the Do all participants respond to every question? section of the Questionnaire data page.
Are there any known data quality issues for the participant and questionnaire data?
Invalid age and year combinations
For some participants, responses relating to their age, year, or number of years may be invalid, when comparing them to their age (derived from the month and year, with an imputed day retained for security reasons) or year of birth. This discrepancy has occurred since participants may modify their date of birth after submitting a complete questionnaire. Currently, there is no post hoc validation in place to revalidate age and year-related questions or prompt participants to provide updated information when their responses become invalid. While this issue affects only a negligible number of cases, we intend to address it in future releases by implementing a solution.
Multiple submissions by the same participants
We currently do not have any procedures in place to prevent individuals from registering multiple times using different contact details. We aware of a small number of instances where the same individual may have submitted multiple questionnaires. This has not been fully investigated, but would likely affect less than 0.1% of submissions.
Errors in questionnaire logic
Questions about sex-specific conditions
A small number of records exhibit inconsistent combinations of self-reported sex, and sex-specific questions (where participants are dynamically shown certain questions, dependent on their sex), and sex-specific answer options (where questions are shown to all participants, but certain answer options are specific to each sex).
This inconsistency may arise when participants modify their response to the question "What sex were you registered with at birth?" (DEMOG_SEX_1_1
or DEMOG_SEX_2_1
) after already providing answers to sex-specific questions. Consequently, the erroneous responses to questions intended for the opposite sex are retained, instead of being removed in accordance with the updated questionnaire journey. The discrepancy affects less than 0.1% of questionnaire submissions across all versions.
Questions about walking
For the question "Thinking about the last 4 weeks, in a typical WEEK
, on how many days did you walk for at least 10 minutes at a time?" (ACTIVITY_WALK_DAYS_1_1
and ACTIVITY_WALK_DAYS_2_1
), if participants respond with the answer "Unable to walk" (-2), they should skip the following questions:
"How many minutes did you usually spend walking on a typical DAY?" (
ACTIVITY_WALK_MINS_1_1
orACTIVITY_WALK_MINS_2_1
)"How would you describe your usual walking pace?" (
ACTIVITY_WALK_PACE_1_1
)"Do you get short of breath walking with people of your own age on level ground?" (
HEALTH_RESP_SHORT_1_1
)"Do you get a pain in either leg on walking?" (
HEALTH_PAIN_LEG_1_1
)
However, this logic is currently not working as intended where participants who are unable to walk are progressing to the questions listed above. This issue affects all versions of the questionnaire.
Regular but inconsistent smoking behaviours
Participants who respond to the question "Do you smoke cigarettes now?" (SMOKE_STATUS_2_1
) with "Yes, some days" (2), "Yes, but rarely" (3) or "No, not at all" (0), and respond to the question "Did you ever smoke cigarettes on most or all days?" (SMOKE_PREV_REG_2_1
) with "No" (0) or "Prefer not to answer" (-3) should proceed do the following questions:
"Compared to 10 years ago do you smoke..." (
SMOKE_CHG_2_1
)"In the time that you have smoked, have you ever stopped for more than 6 months?" (
SMOKE_CHG_ABST_2_1
)"When you stopped smoking for more than 6 months, why did you stop?" (
SMOKE_CHG_ABST_REASON_1_M
)
However due to an error in logic, the majority of participants who meet the above criteria fail to progress appropriately, resulting in significant data loss. This issue affects v2 and v2.1.
All issues listed below have been fixed in the latest version of the questionnaire (2.1).
All v2.1 questionnaires submissions will now contain the correct logic.
Affected questionnaires submitted via v1 and v2 will persist with the incorrect logic.
Questions about work status
Participants can provide multiple responses to the question "Which of the following describes your current situation?" (WORK_STATUS_2_M
). Choosing any of the answer options "In paid employment or self-employed" (1), "Looking after home and/or family" (3), "Doing unpaid or voluntary work (6), "On paid leave (e.g. parental leave, long term sick leave, furlough)" (8), will prompt additional questions about their work:
"How many years have you worked in your current job?" (
WORK_YRS_1_1
)"In a typical WEEK, how many hours do you spend at work?" (
WORK_WK_HRS_1_1
)"How many times a WEEK do you travel from home to your main work?" (
WORK_WK_TRAVEL_1_1
)"What types of transport do you use to get to and from work?"(
WORK_TRANSPORT_1_M
)"About how many miles is it between your home and your work?" (
WORK_DISTANCE_1_1
)"Does your work involve walking or standing for most of the time?" (
WORK_WALK_STAND_1_1
)"Does your work involve heavy manual or physical work?" (
WORK_MANUAL_LABOUR_1_1
)"Does your work involve shift work?" (
WORK_SHIFTS_1_1
)
Due to an error in dynamic logic, if a participant selects any of the answer values 1, 3 6 or 8 in combination with a response that includes skip logic, (meaning that participants won’t be asked about their work situation), for example, "Retired" (2), or "Full or part-time student" (7), all work-related fields mentioned above were skipped. This issue is specific to v2.
Contraceptive methods
Participants can provide multiple responses to the question "What have you used for contraception?" (GYN_CONTRACEPT_METHODS_1_M
). If a participant selects "Combined Pill" (1) and or "Mini Pill (5)", they are supposed to be subsequently shown questions regarding their age at which they first and last took the contraceptive pill:
"About how old were you when you first went on the contraceptive pill?" (
GYN_CONTRACEPT_PILL_FIRST_AGE_1_1
)"How old were you when you last used the contraceptive pill?" (
GYN_CONTRACEPT_PILL_LAST_AGE_1_1
)
Due to an error in dynamic logic, if a participant selects any of the answer values 1 and or 5 in combination with a response that includes skip logic, (i.e. any other response), the age related questions above were subsequently skipped. This issue is specific to v2 of the questionnaire.
Duplication of follow-up questions on reasons for change in smoking
Participants who reported that their smoking habits had not decreased in the last 10 years (selecting answers "More nowadays" (1) or "About the same" (2), according to the question "Compared to 10 years ago do you smoke..." (SMOKE_CHG_2_1
) were being erroneously asked the following independent question pairs:
"Why did you reduce your smoking?" (
SMOKE_CHG_REDUCE_REASON_2_M
)"In the time that you have smoked, have you ever stopped for more than 6 months?" (
SMOKE_CHG_REDUCE_ABST_1_1
)
resulting in a duplication of data. This bug did not result in data loss, since the relevant participants were correctly presented with the required pair of follow-up questions:
"In the time that you have smoked, have you ever stopped for more than 6 months?" (
SMOKE_CHG_ABST_2_1
)"When you stopped smoking for more than 6 months, why did you stop?" (
SMOKE_CHG_ABST_REASON_1_M
)
Note that though the question text is the same for one of these questions, the data is contextually different, and stored independently in the data release.
This issue is specific to questionnaire v2, but was fixed in v2.1
Questions about vaping
Participants who
selected a combination of the options "Cigarettes" (0) and "Electronic delivery devices that can be vaped, such as e-cigarettes (e.g. UWELL, Vype, Vuse, Vapouriz, WizMix)" (1) in response to the question "Have you ever REGULARLY used any of these tobacco products?" (SMOKE_REG_1_M)
AND answered "No" (0) for the question "In the time that you have smoked, have you ever stopped for more than 6 months?" (
SMOKE_CHG_ABST_2_1
)
erroneously skipped some follow-up questions in the smoking section, including the following vaping-related questions:
"How often, on average, did you use e-cigarettes (vaping) during the past 12 months?" (
SMOKE_VAPE_AVG_1_1
)"What type of e-liquids/cartridges do you or did you use in your e-cigarettes?" (
SMOKE_VAPE_TYPE_1_M
)
This issue is specific to questionnaire v2, but was fixed in v2.1
Genotype data
There are three categories of files included in the current release: two sets of files containing participant genotypes and one file containing sample-level information. Each set is provided as a separate data ‘entity’ within the Trusted Research Environment. Each participant is represented by a single sample in each file in the genotyping data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.
The two sets of genotype files, pVCF and BGEN files, contain the same genotypes for the same participants and genetic variants. Each file set is split by chromosome, across 25 separate files (22 autosomal chromosomes, two sex chromosomes 'X', 'Y' and mitochondrial 'MT'). Each chromosome-specific file has it own pVCF index file (.tbi), or BGEN index file (.bgi) and accompanying BGEN .sample file (.sample). The pVCF contains additional genotyping metadata that is not present in the BGEN file. We provide both types of files for convenience and to improve the experience of researchers using the data. The sample-level information file contains information useful for quality control (QC) purposes, such as batch, estimated genetic sex and call rate.
SNV pVCF
pVCF
VCF 4.1
snv_pvcf
ofh_snv.v2.chrZ.vcf.gz
25
pVCF containing SNV genotypes and metadata
SNV pVCF
pVCF index file
-
snv_pvcf
ofh_snv.v2.chrZ.vcf.gz.tbi
25
pVCF-associated index file
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v2.chrZ.bgen
25
BGEN file containing SNV genotypes
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v2.chrZ.sample
25
BGEN-associated sample file
BGEN
BGEN index file
-
snv_bgen
ofh_snv.v2.chrZ.bgen.bgi
25
BGEN-associated index file
Sample QC
QC metrics
-
sample_qc_metrics
ofh_sample_qc_ metrics.v2.tsv
1
Plain-text tabular file with sample-level information
File names in this data release include the following components:
an indicator that the data comes from Our Future Health participants (“ofh_”)
an indicator of file contents (“snv” or “sample_qc_metrics”)
the version number (“.v1”) of the genotyping data release, to be incremented with each release
the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can can be any of 1-22, 'X', 'Y' or 'MT'
a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values.
What information do the genotype files contain?
Both the pVCF and BGEN file sets contain genotypes from all participants in the release and include genotype calls for all 22 autosomal chromosomes, X, Y and MT (mitochondrial). All genotypes are for single-nucleotide polymorphisms (SNPs) or small insertion-deletions (INDELs) aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. The pVCF file metadata includes GenCall score, Log R Ratio, and B-allele Frequency, all of which are available in the FORMAT field. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file.
For more information on the fields present in each genotype file, please refer to the genotyping data tab of the Our Future Health data dictionary. For more information on the exact genetic variants present in each genotype file, refer to the CPRA variant list. Both these files can be found on the Data and cohort page of our website.
What information does the sample QC file contain?
The sample QC file contains basic sample-level information useful for QC purposes or batch effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on the Data and cohort page (external link) of our website.
Are there any known data quality issues for the genotype data?
This data release does not include sample QC results, other than limited outputs from the genotype calling process including call rate, genetic sex, and related aggregate genotyping QC metrics. No variant QC results are provided. We have, however, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include the outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.
We note the following issues with the current data release which we aim to address in a future release:
Estimated call rate is based on all chromosomes, including both the X and Y chromosomes. Median call rates for females may be lower than for males due to Y chromosome missingness.
The date string in the batch identifier within the sample QC metrics file is the date of genotype calling and not the date on which the genotyping assay started in the laboratory. Genotype calling using the intensity data files of some samples occurred considerably later than the laboratory assay was completed, so this date cannot be used for statistical adjustment of between-day variation introduced at genotyping stage.
The presence of non-haploid genotypes ('0/1', '0/0' and '1/1') was observed for Y and MT chromosome variants, affecting ~0.25% of Y chromosome genotype calls and a smaller proportion for MT, for both female and male samples, arising from low or noisy probe intensities for some genetic variants. These non-haploid genotype calls for haploid chromosomes should be treated as missing (no call). In future releases, non-haploid calls for Y and MT genetic variants will be set to missing in both the pVCF and BGEN files prior to data release.
A small number of genetic variants (~200) were found to have been incorrectly reported in the pVCF and BGEN files, resulting from multi-mapping probes (where a probe sequence maps to multiple locations in the genome) or multi-base SNP targets which were misaligned to the reference genome during genotype calling. These genetic variants should be excluded from analysis. We provide a list of these variants by way of an indicator column "inaccurate annotation" in the CPRA variant list file to facilitate their exclusion. You can download this file from the Data and cohort page on our website (external link). We aim to resolve this issue in future releases of genotype data.
What metadata is available to help document the data release?
On the Data and cohort page of our website (external link) we provide a:
data dictionary – which defines the raw data fields and metadata information, such as labels, descriptions and units of measurements
coding file – which contains the granular details of categorical or raw coded values
CPRA variant list – which contains a list of genetic variant IDs which map to the genetic variants available in our genotype files
If using Microsoft Excel to browse these files, for an optimal viewing experience, ensure the encoding settings are set to UTF-8.
On the What type of questions did we ask? section of the Questionnaire data page we also provide:
human readable versions of both version 1 and version 2 of the questionnaire - which are text copies of the baseline health questionnaire.
a questionnaire logic workbook – which represents dynamic logic implemented for v2.1 of the baseline health questionnaire and can be used in conjunction with v2 of the human readable questionnaire
Will future releases be compatible with this data release?
Participant and Questionnaire data
We are expecting to make changes for future releases, for example we:
made every effort to avoid errors in the data production process but will aim to address any identified issues in the next release.
will provide additional information on questionnaire journeys.
will make improvements to how the smoking section is currently presented.
we will consider one-hot encoding for questions with multiple answers that are currently only stored as an array.
Genetic data
Reformatting the genetic data releases may be required to provide data at this scale. For example, in future genetic data releases we may split files by region within chromosome. Future genetic data releases may also be provided in additional or different file formats.
We will add imputed genotypes, greatly expanding the number of genotypes available in the data release. We will also expand the QC information available for both samples and variants.
Last updated