


The Era of Health Big Data: How the UK Biobank is Transforming Medical Research

The following essay is from Leicester Grammar School (Leicester, LE8 9FL, UK). Author: Cunyi George Xu


Introduction to the UK Biobank

英国生物样本库(UK Biobank, UKB)成立于2006年,是一个大规模的健康研究数据库,得到维康信托、医学研究委员会、癌症研究基金会和心脏病基金会等组织的支持。这个数据库的特别之处在于,它不仅收集了来自50万名志愿者的详细健康数据,还将这些数据与各种健康结果关联起来,便于科学家们进行研究。因为数据获取方便、覆盖范围广、信息深度高,英国生物样本库已成为全球最重要的健康研究资源之一。该项目的目标是推动医学发展,加深我们对癌症、心脏病、中风等严重疾病的了解,从而改善疾病的预防、诊断和治疗方法,让更多人受益。

The UK Biobank (UKB), established in 2006, is a large-scale health research database supported by organizations such as the Wellcome Trust, Medical Research Council, Cancer Research UK, and the British Heart Foundation. This database is unique in that it collects detailed health data from 500,000 volunteers and links these data to various health outcomes, facilitating research for scientists. With its accessible data, broad coverage, and high information depth, the UK Biobank has become one of the world’s most important resources for health research. The project aims to advance medical science and deepen our understanding of serious diseases like cancer, heart disease, and stroke, ultimately improving prevention, diagnosis, and treatment to benefit more people.

自2006年起,UK Biobank共招募了50万名年龄在40至69岁之间的志愿者,对他们的健康状况进行长达30年的跟踪研究。UK Biobank已通过采集生物样本和问卷调查等方式收集了这些志愿者的详细数据,并将其上传至“研究分析平台”。该平台定期更新数据。志愿者的基线数据、生物标志物、在线问卷结果、成像信息、基因数据和健康结果等,都会按计划上传,并在UK Biobank官网上提供即将发布的数据详情及日期。已注册的科研人员还会在数据发布时收到通知邮件。下面将对这些数据进行简要介绍。

Since 2006, UK Biobank has recruited 500,000 volunteers aged between 40 and 69 to participate in a 30-year follow-up study of their health. UK Biobank collects detailed data from these volunteers through biological samples and questionnaires, uploading them to the Research Analysis Platform. This platform is regularly updated with new data. Information such as baseline data, biomarkers, online questionnaire responses, imaging data, genetic information, and health outcomes is uploaded according to schedule. The UK Biobank website provides details and release dates for upcoming data updates, and registered researchers receive email notifications when new data are released. A brief overview of these data follows.


Data Types in the UK Biobank

基线数据 Baseline Data

在2006年至2010年期间,UK Biobank在英国招募了50万名参与者,他们同意接受长期健康跟踪。这项数据收集为全球科学家提供了宝贵的信息资源,帮助推进了各种疾病的预防、诊断和治疗工作。基线评估在苏格兰、英格兰和威尔士的22个中心进行,整个过程分为五个部分,参与者需要用2-3小时完成基线评估。评估内容包括:签署书面同意书、填写触摸屏问卷、与护士进行面对面咨询、测量握力、肺活量和骨密度,以及采集血液、尿液和唾液样本等。

Between 2006 and 2010, UK Biobank recruited 500,000 participants across the UK who agreed to long-term health monitoring. This data collection has provided a valuable resource for scientists worldwide, supporting advancements in the prevention, diagnosis, and treatment of various diseases. Baseline assessments were conducted at 22 centers across Scotland, England, and Wales, and the entire process took 2-3 hours for participants to complete. The assessment included signing written consent forms, completing a touchscreen questionnaire, having a face-to-face consultation with a nurse, measuring grip strength, lung function, and bone density, and collecting blood, urine, and saliva samples.

生物标志物数据 Biomarker Data

UK Biobank分析了红细胞、血清和尿液样本中的多种生化标志物,这些样本来自50万名参与者及2万名进行重复评估的参与者。根据对各种疾病研究的科学价值,共选择了34种生物标志物进行检测。这些标志物包括已知的疾病风险因素(如与血管疾病相关的血脂和与癌症相关的性激素)、诊断指标(如用于检测糖尿病的糖化血红蛋白和用于关节炎诊断的类风湿因子),以及对肾脏和肝脏功能等表型特征的评估。

UK Biobank analyzed a range of biochemical markers in samples of red blood cells, serum, and urine from 500,000 participants, along with an additional 20,000 participants who underwent repeat assessments. A total of 34 biomarkers were selected for testing based on their scientific relevance to various diseases. These biomarkers include known disease risk factors (such as blood lipids related to vascular disease and sex hormones linked to cancer), diagnostic indicators (such as glycated hemoglobin for diabetes and rheumatoid factor for arthritis), as well as markers for evaluating phenotypic traits like kidney and liver function.

在线问卷数据 Online Questionnaire Data

UK Biobank定期向约33万名参与者发送在线问卷,通常每年两次,以便收集更详细的健康信息。这些参与者的电子邮件地址已记录在案,没有电子邮件的参与者也可以通过指定网站获取问卷。这些问卷的目的是获取新的或更深入的数据,特别是关于纵向跟踪和健康结果的信息,比如慢性疼痛和心理健康状况,这些通常难以从电子健康记录中获得。问卷内容涵盖饮食习惯、认知功能、职业历史、心理健康和疼痛等多个方面。

UK Biobank regularly sends online questionnaires to approximately 330,000 participants, usually twice a year, to gather more detailed health information. The email addresses of these participants have been recorded, and those without an email address can access the questionnaires through a designated website. These questionnaires aim to collect new or more in-depth data, particularly on longitudinal tracking and health outcomes, such as chronic pain and mental health, which are often difficult to obtain from electronic health records. The questionnaires cover various topics, including dietary habits, cognitive function, occupational history, mental health, and pain.

成像数据 Imaging Data

2014年,UK Biobank启动了全球最大的多模态成像研究,计划为10万名参与者进行脑部、心脏和腹部的磁共振成像(MRI)、双能X线吸收检测(DXA)和颈动脉超声检查。这项规模空前的成像研究,将帮助科研人员探索生活方式和遗传因素与人体结构和功能之间的关系。该研究中,脑部扫描可以获取白质高强度数据,腹部扫描可以测量内脏脂肪,而心脏扫描则能提供左心室射血分数等信息。这些数据有助于更好地理解生活方式和遗传因素如何通过生物学机制影响疾病风险。

In 2014, UK Biobank launched the world’s largest multimodal imaging study, aiming to conduct MRI scans of the brain, heart, and abdomen, dual-energy X-ray absorptiometry (DXA), and carotid ultrasound exams for 100,000 participants. This unprecedented study will help researchers explore the connections between lifestyle, genetic factors, and the structural and functional measurements derived from imaging. For example, brain scans provide data on white matter intensity, abdominal scans measure visceral fat, and heart scans yield information on left ventricular ejection fraction. These insights will help deepen our understanding of how lifestyle and genetic factors influence disease risk through biological mechanisms.

基因数据 Genetic Data

UK Biobank的全基因组测序数据已覆盖所有50万名参与者,并向经过批准的研究人员开放,是全球规模最大的全基因组数据集。这些数据将革新科学家们研究健康和疾病遗传因素的方式,并补充了现有的基因分型和外显子组数据。现在,研究人员可以直接在UK Biobank的研究分析平台上访问这50万名参与者的全基因组数据,并在该平台进行数据处理。

UK Biobank’s whole-genome sequencing data now covers all 500,000 participants and is available to approved researchers, making it the largest whole-genome dataset in the world. This data will transform how scientists study the genetic factors underlying health and disease, complementing existing genotyping and exome sequencing data. Researchers can now directly access and process the whole-genome data for these 500,000 participants on the UK Biobank Research Analysis Platform.


The Features and Advantages of UK Biobank

UK Biobank对理解健康与疾病的关系,以及改善公共卫生状况起到了重要作用。截至2023年底,UK Biobank已收到来自全球90多个国家和地区的研究申请,注册研究人员超过38,000人,其中84%来自英国以外,发表了约10,000篇论文,被引用约250万次,发表数量每年都在迅速增加。那么,是什么让UK Biobank受到越来越多研究人员的重视呢?相比其他数据库,UK Biobank具有许多独特优势,使其在探索基因、生活方式与健康的关系、加深对疾病成因的理解、提升人类生活质量等方面发挥了重要作用。UK Biobank主要包括突出的3个优势。

UK Biobank plays a key role in understanding the connections between health and disease and improving public health. By the end of 2023, UK Biobank had received research applications from over 90 countries and regions worldwide, with more than 38,000 registered researchers (84% from outside the UK), approximately 10,000 published papers, and around 2.5 million citations. The number of publications continues to grow rapidly each year. So, why is UK Biobank attracting increasing attention from researchers? Compared to other databases, UK Biobank has many unique advantages, making it invaluable for studying the relationships between genetics, lifestyle, and health, deepening our understanding of disease causes, and enhancing quality of life. UK Biobank’s primary advantages can be summarized in three key aspects.

首先,UK Biobank拥有庞大的样本量和丰富的数据资源。UK Biobank从英格兰、苏格兰和威尔士招募了50万名年龄在40至69岁的志愿者,这相当于英国总人口的约0.73%。这些志愿者的数据信息涵盖基因数据、多模态影像数据、生物标志物数据以及健康相关数据。这些大规模且全面的数据为精准医疗的发展提供了重要支持,并在将科研成果应用于临床的过程中扮演着不可或缺的角色。UK Biobank的数据在探索多种疾病(如结直肠癌和糖尿病)的成因、预防、治疗和诊断方面具有重大影响。例如,2014年UK Biobank启动了成像研究计划,计划为10万名参与者进行多模态的脑部磁共振成像扫描。2018年,在英国痴呆症研究平台的资助下,研究团队又为3,000名参与者进行了重复成像。这些丰富的多模态成像数据与表型和基因数据相结合,为神经科学和神经疾病的研究提供了宝贵资源,推动了对脑部健康及神经系统疾病的深入了解。

Firstly, UK Biobank boasts a large sample size and extensive data resources. It recruited 500,000 volunteers aged 40 to 69 from England, Scotland, and Wales, representing about 0.73% of the UK population. The data collected from these volunteers include genetic information, multimodal imaging data, biomarkers, and health-related data. This large-scale and comprehensive dataset provides essential support for the development of precision medicine and plays a crucial role in translating research findings into clinical applications. UK Biobank’s data has a significant impact on exploring the causes, prevention, treatment, and diagnosis of various diseases, such as colorectal cancer and diabetes.

For example, in 2014, UK Biobank launched an imaging study aiming to conduct multimodal brain MRI scans for 100,000 participants. In 2018, with funding from the Dementias Platform UK, repeat imaging scans were conducted for 3,000 participants. The extensive multimodal imaging data collected, combined with phenotypic and genetic data, provides an invaluable resource for research in neuroscience and neurological diseases, advancing our understanding of brain health and neurological conditions.

其次,UK Biobank的数据具有持续性,这为研究带来了独特的优势。以在线问卷数据为例,自2011年起,UK Biobank逐步扩展问卷内容:最初是饮食习惯,2014年加入了认知功能的内容,2015年增加了职业史以探讨工作环境与健康的关联,2016年又开始关注心理健康问题。到2023年,问卷进一步扩展,涵盖了睡眠质量、社交互动、注意力、想象力和记忆等内容。2024年,UK Biobank新增了慢性疼痛的问卷,以深入了解慢性疼痛的类型、持续时间和严重程度。在志愿者加入后,数据的收集并非一次性完成。UK Biobank会不断构建新的研究内容,持续更新数据。这项前瞻性流行病学研究自2006年启动,预计将持续到2036年,对参与者的健康和医疗状况进行长期跟踪。数据的连续性带来了显著的优势,不仅支持长期的跟踪和趋势分析,帮助研究人员观察健康状况和生活方式随时间的变化,还能够揭示潜在的因果关系。这种持续的数据收集提高了数据的完整性和可靠性,支持深入的多变量分析,揭示不同因素之间的相互作用。

Secondly, UK Biobank’s data is continuous, which offers unique advantages for research. Taking the online questionnaire data as an example: since 2011, UK Biobank has gradually expanded its questionnaire content. It began with dietary habits, then added cognitive function in 2014, occupational history in 2015 to explore the links between work environment and health, and mental health issues in 2016. By 2023, the questionnaire had expanded further to include sleep quality, social interactions, attention, visualization, and memory. In 2024, UK Biobank introduced a new questionnaire on chronic pain to deepen understanding of its types, duration, and severity. Data collection for volunteers is not a one-time process; UK Biobank continuously builds on new research areas and updates its data. This prospective epidemiological study began in 2006 and is expected to continue until 2036, providing long-term tracking of participants’ health and medical conditions. The continuity of data offers significant advantages, such as supporting long-term tracking and trend analysis, helping researchers observe how health conditions and lifestyles change over time, and identifying causal relationships. This ongoing data collection enhances data completeness and reliability, enabling in-depth multivariable analysis and uncovering interactions between different factors.

最重要的是,UK Biobank的数据质量非常高。这得益于严格的参与者筛选和数据采集标准、多层次的数据来源、长期的高频追踪,以及严格的数据质量控制。UK Biobank主要招募40-69岁人群,数据采集过程经过标准化处理,以确保数据的准确性和一致性。UK Biobank的数据通过多种途径收集,包括生物样本、基因组测序、影像学数据和问卷调查,数据内容覆盖遗传、环境和生活方式等多个方面。这种多维度的数据收集极大地提高了数据的丰富性和完整性,为科学家们研究健康和疾病的复杂关系提供了坚实基础。

Most importantly, the data quality of UK Biobank is exceptionally high. This is due to rigorous participant selection and data collection standards, multi-layered data sources, frequent long-term follow-up, and strict data quality control. UK Biobank primarily recruits participants aged 40 to 69, and the data collection process is standardized to ensure accuracy and consistency. Data is gathered through multiple methods, including biological samples, genomic sequencing, imaging data, and questionnaires, covering various aspects such as genetics, environment, and lifestyle. This multi-dimensional data collection significantly enhances data richness and completeness, providing a solid foundation for scientists to study the complex relationships between health and disease.


Steps for Conducting Scientific Research Using the UK Biobank

UK Biobank作为一个大型公共健康数据库,目前对所有符合伦理和科学标准的研究人员开放。申请UK Biobank数据大致分为三个步骤:首先,科研人员需要申请一个账户;接着提交数据使用申请,描述研究目的和所需的数据类型;最后,签署数据使用协议并支付相关费用。整个申请流程通常需要大约3至4个月完成。具体细节如下。

UK Biobank, as a large public health database, is currently open to all researchers who meet ethical and scientific standards. The process of applying for access to UK Biobank data consists of three main steps: first, researchers need to create an account; next, they submit a data access application detailing their research purpose and required data types; finally, they sign a data use agreement and pay the associated fees. The entire application process typically takes around 3 to 4 months to complete. The applications details are as follows.

数据申请基于具体研究项目。登录AMS系统,点击“Applications”,选择“Start New Application”,根据要求填写研究计划,包括项目名称、研究问题、背景和方法概述、所需数据集类型、研究的预期价值、关键词和计划摘要。提交申请后,UK Biobank的审核过程通常需要约两个月。

审核通过后,申请人会收到电子邮件通知,需填写数据使用协议并支付费用。目前UK Biobank的数据使用费用分为三档:3000英镑、6000英镑和9000英镑,对应一次三年的数据使用许可。第一档费用包含问卷数据、体格测量数据和健康结果表型数据;第二档费用增加了实验数据,如生物化学和血液学数据;第三档费用则包含成像数据、全基因组测序数据和全外显子组测序数据。

First, go to the UK Biobank website ( and click “Researcher login” at the top of the page. Select “Sign up to access UK Biobank resources,” then enter your name and email. After receiving an activation email, continue filling out your personal information, including your institution, email, and publication record. Once submitted, your information will go through a review process, typically taking around two weeks.

Data applications are based on specific research projects. Log in to the AMS system, click “Applications,” and select “Start New Application.” Complete the research proposal as required, including the project title, research question, brief background and methodology, data types needed, expected research value, keywords, and project summary. After submission, UK Biobank’s review process usually takes about two months.

Once approved, applicants will receive an email notification and will need to sign a data use agreement and pay the applicable fees. UK Biobank currently offers three pricing tiers: £3,000, £6,000, and £9,000 for a three-year data access period. The first tier includes questionnaire data, physical measurements, and health outcome phenotypes; the second tier adds experimental data, such as biochemistry and hematology; and the third tier includes imaging data, whole-genome sequencing, and whole-exome sequencing.


Final Remarks

UK Biobank通过将遗传信息、生物学特征和健康记录相结合,并向研究人员开放数据,再加上大规模的人群研究,创造了巨大的科研和社会价值。这不仅为未来开展人群研究提供了宝贵的经验,也展示了这种国家级合作项目的潜力。UK Biobank的成功经验值得世界各国借鉴和学习。

UK Biobank combines genetic information, biological traits, and health records and shares this data with researchers, creating tremendous scientific and social value through large-scale population studies. This approach not only provides valuable insights for future population-based research but also highlights the potential of nationwide collaborative projects. The success of UK Biobank serves as a valuable model for other countries to learn from and adopt.
