英国生物库中大脑成像表型的全基因组关联研究

科技作者 / 姓名 / 2025-06-25 04:29

　　英国生物银行大脑成像协议包括六种涵盖结构，扩散和功能成像的六种不同的模式，总结在补充表1中。对于本研究，我们主要使用

　　英国生物银行大脑成像协议包括六种涵盖结构，扩散和功能成像的六种不同的模式，总结在补充表1中。对于本研究，我们主要使用了2017年2月发布的〜10,000名参与者成像数据的数据（以及2018年1月在2018年1月发布的约5,000名受试者的数据提供了较大的复制样品）。

　　这六种方式的原始数据已为英国生物库处理以创建一组IDPS4,5。这些可从英国生物库中获得，正是我们在这项研究中使用的2017 - 2018年数据发行的这些IDP。

　　除了从英国生物库直接获得的IDP外，我们还创建了两套IDP。首先，我们使用FreeSurfer V6.0.037,38（https://surfer.nmr.mgh.harvard.edu）来对皮层表面（皮质灰质的内部和外部2D表面）进行建模，并建模几种皮层结构。我们同时将T1和T2 Flair图像用作FreeSurfer建模的输入（或者在不可用时T1）。FreeSurfer估计了大量的结构表型，包括皮层结构的体积，在皮质表面上鉴定的包裹的表面积以及这些区域内的灰质皮质厚度。这些区域的定义是通过将含有规范皮质层化的地图集映射到单个受试者的皮质表面模型上，从而实现了该表面的分析。在这里，我们使用了两个与FreeSurfer共同使用的地图集：Desikan-Killiany-Tourville Atlas（表示为DKT39）和Destrieux Atlas（表示A2009S40）。DKT parcellation是基于回的，而Destrieux的目的是基于表面的曲率对Gyri和Sulci进行建模。从每个地图集的每个包裹上平均皮质厚度，并且估计每个包裹的皮质区域，为每个包裹创建两个IDP。最后，估计皮层下体积，以创建一组体积IDP。

　　其次，我们将降低方法应用于大量功能连接IDP。功能连接IDP表示许多不同的大脑区域之间的网络边缘，包括总共1,695个不同的区域对大脑连接（http://www.fmrib.ox.ac.ac.uk/ukbiobank/）。除了这是大量可以解释关联结果的IDP外，这些单独的IDP往往比其他大多数其他结构性IDP都更加嘈杂。因此，虽然我们确实对这1,695个连接IDP进行了GWAS，但我们还使用数据驱动的功能标识将完整的连接IDPS仅减少到六个新的摘要IDP中。我们通过应用ICA41，应用于所有受试者的所有功能连接IDP，以找到IDP的线性组合，这些ICA41通过应用ICA41进行了降低，这些尺寸降低了。我们在不使用遗传数据的情况下进行了ICA特征估计，并且在组件IDP权重（与受试者权重相对）之间最大程度地独立性。我们使用了半分的可重复性（跨对象）来优化初始维度降低（发现来自单数值分解的14个特征向量是最佳的），并且最终数量的ICA组件（6个ICA组件都是最佳的，具有ICA重量矢量的可重复性大于r = 0.9）。然后将产生的六个ICA功能视为新的IDP，代表原始功能连接IDP的六个独立集（或更准确地，线性组合）。这六个新的IDP被添加到GWAS分析中。六个ICA功能解释了整个网络连接功能中总差异的4.9％，并在补充图18中可视化。有关静止状态数据的ICA分析的更多详细信息，连同突出显示的大脑区域的浏览功能可以在FMRIB UK Biobank资源网页（http://www.fmrib.ox.ac.ac.uk/ukbiobank/）上找到。

　　我们将所有3,144个IDP组织成9组（补充表12），每个INDP都有不同的缺失值模式（并非所有受试者都具有所有模态的可用，高质量的数据4）。对于本研究的GWA，由于各组之间观察到的相关水平较低，我们没有试图将缺失的IDP算作。

　　在表型类别之间，IDP值的分布差异很大，一些表型表现出很大的偏斜（补充图19），这可能会使用于测试关联测试的线性回归的假设无效。为了改善这一点，我们在关联测试之前对每个IDP进行了分位归一致。这种转变还有助于避免异常值的不当影响。我们还（分别）测试了一个替代过程，其中将离群的删除过程应用于未转换的IDP。对于几乎所有关联测试，这给出了非常相似的结果，但发现降低了少数关联的重要性。因此，没有遵循这种可能用于IDP预处理的替代方法（数据未显示）。

　　没有使用统计方法来预先确定样本量。实验不是随机的，研究人员在实验和结果评估过程中并未对分配视而不见。

　　我们使用了英国生物银行在其2017年7月发行版中提供的估算的遗传数据集6。这包括从单倍型参考联盟（HRC）参考面板中估算的> 9200万个常染色体变体43和合并的UK10K + 1000个基因组参考面板。我们首先确定了一组12,623名参与者，这些参与者也由英国生物银行成像。然后，我们应用过滤器以删除低于0.1％的次要等位基因频率（MAF）的变体，并以低于0.3的插图信息得分，从而将SNP的数量降低至18,174,817。然后，我们仅保留这些样本（受试者），估计使用英国Biobank6中心提供的样本质量控制信息（使用in.white.white.british.ancestry.subset在ukb_sqc_v2.txt中的文件中）；种群结构可能与遗传关联研究很严重44，这种样品过滤是标准的。这将样品数量减少到8,522。英国生物银行数据集包含许多近亲（第三个表亲或更近的近亲）。因此，我们创建了一个名义上无关的主题的8,428个子集，遵循类似于先前所述的程序。在8,428个样品中的所有（SNP）变体上运行GWAS之后，我们应用了三个变体过滤器，以删除使用强壮 - 韦恩伯格平衡p值的变体 <10−7, remove variants with MAF <0.1% and keep only those variants in the HRC reference panel. This resulted in a dataset with 11,734,353 SNPs.

　　We used two separate datasets to replicate the associated variants found in this study. The first set of 930 subjects was a subset of the 1,279 subjects with imaging data that we did not use for the main GWAS, who had primarily been excluded because they were not in the recent British ancestry subset. An examination of these samples according the genetic principal components (PCs) revealed that many of those samples are mostly of European ancestry (Supplementary Fig. 20). We selected 930 samples with a first genetic PC <14 from Supplementary Fig. 20 and these constituted the replication sample. In January 2018 a further tranche of 4,588 samples with imaging data was released by UK Biobank. Of these subjects, we selected 3,956 subjects that both had genetic data available and also had been imaged in the same imaging centre as the discovery sample. We applied the same pre-processing pipeline as for the discovery set. We then restricted this to 3,456 subjects that were of recent British ancestry and replication tests were then conducted on these 3,456 subjects.

　　There are a number of potential confounding variables when carrying out GWASs of brain IDPs. We used three sets of covariates in our analyses relating to (a) imaging confounds (b) measures of genetic ancestry, and (c) non-brain imaging body measures.

　　We identified a set of variables that were likely to represent imaging confounds, for example those associated with biases in noise or signal level, corruption of data by head motion or overall head size changes. For many of these we generated various versions (for example, using quantile normalization and also outlier removal, to generate two versions of a given variable, as well as including the squares of these to help model nonlinear effects of the potential confounds). This was done in order to generate a rich set of covariates and hence reduce as much as possible potential confounding effects on analyses such as the GWAS, which are particularly of concern when the subject numbers are so high4,45.

　　Age and sex are can be variables of biological interest, but can also be sources of imaging confounds, and here were included in the confound regressors. Head motion is summarized from resting and task-based fMRI as the mean displacement (in mm) between one time point and the next, averaged over all time points and across the brain. Head motion can be a confounding factor for all modalities and not just those comprising timeseries of volumes, but is readily estimable only from the timeseries modalities. Nevertheless, the amount of head motion is expected to be reasonably similar across all modalities (for example, correlation between head motion in resting and task fMRI is r = 0.52) and so it is worth using fMRI-derived head motion estimates as confound regressors for all modalities.

　　The exact location of the head and the radio-frequency receiver coil in the scanner can affect data quality and IDPs. To help to account for variations in position in different scanned participants, several variables have been generated that describe aspects of the positioning (see http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25756, http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25757, http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25758, and http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25759). The intention is that these can be useful as ‘confound variables’; for example, these might be regressed out of brain IDPs before carrying out correlations between IDPs and non-imaging variables. TablePosition is the Z-position of the coil (and the scanner table on which the coil sits) within the scanner (the Z axis points down the centre of the magnet). BrainCoGZ is somewhat similar, being the Z-position of the centre of the brain within the scanner (derived from the brain mask estimated from the T1-weighted structural image). BrainCoGX is the X-position (left–right) of the centre of the brain mask within the scanner. BrainBackY is the Y-position (front–back relative to the head) of the back of brain mask within the scanner.

　　UK Biobank brain imaging aims to maintain as fixed an acquisition protocol as possible during the 5–6 years that the scanning of 100,000 participants will take. There have been a number of minor software upgrades (the imaging study seeks to minimize any major hardware or software changes). Detailed descriptions of every protocol change, along with thorough investigations of the effects of these on the resulting data, will be the subject of a future paper. Here, we attempted to model any long-term (over scan date) changes or drifts in the imaging protocol or software or hardware performance, by generating a number of data-driven confounds. The first step was to form a temporary working version of the full subjects × IDPs matrix with outliers limited (see below) and no missing data, using a variant of low-rank matrix imputation with soft thresholding on the eigenvalues46. Next, the data were temporally regularized (approximate scale factor of several months with respect to scan date, see https://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=53, Instance 2) with spline-based smoothing. We then applied PCA and kept the top 10 components, to generate a basis set that reflects the primary modes of slowly changing drifts in the data.

　　To describe the full set of imaging confounds we use a notation where subscript i indicates quantile normalization of variables, and m indicates median-based outlier removal (discarding values greater than five times the median absolute deviation from the overall median). If no subscript is included, no normalization or outlier removal was carried out. Certain combinations of normalization and powers were not included, either because of very high redundancy with existing combinations, or because a particular combination was not well-behaved. The full set of variables used to create the confounds matrix are: a, age at time of scanning, demeaned (cross-subject mean subtracted); s, sex, demeaned; q, four confounds relating to the position of the radio-frequency coil and the head in the scanner (see above), all demeaned; d, ten drift confounds (see above); m, two measures of head motion (one from resting fMRI, one from task-based fMRI); and h, volumetric scaling factor needed to normalize for head size47.

　　The full matrix of imaging confounds is then:

　　Any missing values in this matrix are set to zero after all columns have had their mean subtracted. This results in a full-rank matrix of 53 columns (ratio of maximum to minimum eigenvalues is 42.6). Additional discussion on the dangers and interpretation of imaging confounds in big imaging data studies, particularly in the context of disease studies, has been published45.

　　Genetic ancestry is a well-known potential confound in GWAS. We ameliorated this by filtering out samples that were not of recent British ancestry. However, a set of 40 genetic principal components (PCs) has been provided by UK Biobank6, and we used these PCs as covariates in all of our analyses. The matrix of imaging confounds, together with a matrix of 40 genetic principal components, was regressed out of each IDP before the analyses reported here.

　　There exist a number of substantial correlations between IDPs and non-genetic variables collected on the UK Biobank subjects4. We therefore also carried out some analyses involving variables relating to blood pressure (diastolic and systolic), height, weight, head bone mineral density, head bone mineral content and two principal components from the broader set of bone mineral variables available (https://biobank.ctsu.ox.ac.uk/crystal/docs/DXA_explan_doc.pdf). Supplementary Fig. 21 shows the association of these eight variables against the IDPs and shows significant associations. These are variables that are likely to have a genetic basis, at least in part. Genetic variants associated with these variables might then produce false positive associations for IDPs. To investigate this possibility, we ran GWASs for these eight traits (conditioned on the imaging confounds and genetic PCs) (Supplementary Fig. 22). We also ran a parallel set of IDP GWASs with these ‘body confounds’ regressed out of the IDPs.

　　We used a linear mixed model implemented in the SBAT (sparse Bayesian association test) software (https://jmarchini.org/sbat/) to calculate additive genetic heritabilities for the P = 3,144 traits. To estimate genetic correlations we used a multi-trait mixed model. If Y is an N × P matrix of P phenotypes (columns) measured on N individuals (rows) then we use the model:

　　where U is an N × P matrix of random effects and ε is an N × P matrix of residuals, and these are modelled using Matrix normal distributions as follows:

　　In this model, K is the N × N kinship matrix between individuals, B is the P × P matrix of genetic covariances between phenotypes and E is the P × P matrix of residual covariances between phenotypes. We estimate the covariance matrices B and E using a new C++ implementation of an EM algorithm48 included in the SBAT software (https://jmarchini.org/sbat/).

　　For the marginal heritabilities and genetic correlation analysis we used a realized relationship matrix (RRM) for the kinship matrix (K). This RRM was calculated from the 8,428 nominally unrelated individuals using fastLMM (https://github.com/MicrosoftGenomics/FaST-LMM). We used the subset of imputed SNPs that were both assayed by the genotyping chips and included in the HRC reference panel, and so will essentially be hard-called genotypes. In addition, all SNPs with duplicate rsids (reference SNP cluster IDs) were removed. Plink (http://www.cog-genomics.org/plink/2.0/) was used for file conversion before input into fastLMM.

　　To estimate genetic correlations, we fit the model to several of the groupings of IDPs detailed in Supplementary Table 12. The estimated covariance matrices B and E were used to estimate the genetic correlation of pairs of IDPs. The genetic correlation between the ith and jth IDPs in a jointly analysed group of IDPs is estimated as

　　We used a multi-trait mixed model to test each SNP for association with different groupings of traits (Supplementary Table 7). The model has the form Y = Gα + U + ε, where G is an N × 1 vector of SNP dosages and α is a 1 × P vector of effect sizes. We fit the model using estimates of B and E from the ‘null’ model with α = 0 and a leave one chromosome out (LOCO) approach for RRM calculation. We ran this test on the main set of 8,428 samples and on the replication samples. For the replication analysis we used the estimates of B and E from the main set of 8,428 samples. This test was implemented in SBAT software.

　　We used BGENIE v1.2 (https://jmarchini.org/bgenie/) to carry out GWASs of imputed variants against each of the processed IDPs. This program was designed to carry out the large number of IDP GWAS required in this analysis. It avoids repeated reading of the genetic data file for each IDP and uses efficient linear algebra libraries and threading to achieve good performance. The program has already been used by several studies to analyse genetic data from the UK Biobank49,50. We fit an additive model of association at each variant, using expected genotype count (dosage) from the imputed genetic data. We ran associated tests on the main set of 8,428 samples and the replication samples.

　　Most GWAS analyse only one or a few different phenotypes, and often uncover just a handful of associated genetic loci, which can be interrogated in detail. Owing to the large number of associations uncovered in this study, we developed an automated method to identify, distinguish and count individual associated loci from the 3,144 GWASs (one GWAS for each IDP). For each GWAS we first identified all variants with –log10(P) > 7.5. We applied an iterative process that starts by identifying the most strongly associated variant, storing it as a lead variant, and then removing it, and all variants within 0.25 cM from the list of variants (equivalent to approximately 250 kb in physical distance). The process was then repeated until the list of variants was empty. We applied this process to each GWAS using two filters on MAF: (a) MAF >0.1％，（b）MAF> 1％。我们将跨表型的相关铅SNP分组为簇。该过程首先将SNP彼此分组，这主要产生了明智的簇，但是基于对群集图和SNP之间的连锁级别不平衡的视觉检查，使用了一些手策划来合并或分裂簇。对于扩展数据表1中的某些集群，我们报告了发现与铅SNP处于高连接不平衡的SNP。

　　我们通过Bonferroni因子（–LOG10（3,144）= 3.5）调整了全基因组的显着性阈值（-LOG10（P）> 7.5），该因子（–LOG10（3,144）= 3.5），该因子说明了经过测试的IDP的数量，从而使IDP的阈值（错误地假定IDP）可能是独立的，但是当我们是独立的，但我们是catiount的，但我们是如此，但我们是如此，但我们是始终的。

　　我们使用连锁不平衡得分回归51来估计分析中研究的IDP与十种疾病，人格或脑相关性状之间的遗传相关性。我们收集了神经质人格特质（https://www.thessgac.org/data），自闭症（https：//wwwww.med.unc.unc.unc.edu/pgc/）和睡眠持续时间（http://wwwww.t2diaibetsgenesgener and trair and traive and，我们也收集了神经质人格特质（https://www.thessgac.org/data）的摘要统计数据（https：//www.med.unc.unc.edu/pgc/）赤字多动障碍，精神分裂症，重度抑郁症和躁郁症（https://www.med.unc.edu/pgc/），阿尔茨海默氏病（http://web.pasteur.fille.fr/.fr/en/recher.fr/en/recherche/recherche/recherche/recherche/re7444/igap/igap/igap/igap_downloal.pprplloal.plpp）http://cerebrebascularportal.org/informational/downloads）和肌萎缩性侧面硬化症（http://databrowser.projecter.projectmine.com/）。补充表13中提供了每项研究中的样本和相应研究的DOI的数量。

　　对于每个IDP – Thrait对，我们使用LDSCore回归软件（v1.0.0; https：//github.com/bulik/ldsc）来计算IDP与特征之间的遗传相关性，并使用1000个基因组项目进行的链接不平衡测量值（由LDSCore Replations of LDSCore Replations的维护者提供）。我们过滤了SNP，仅包括那些具有插入信息≥0.9和MAF≥0.1％的人。源研究仅提供了主要抑郁症，精神分裂症和注意力缺陷多动障碍的信息评分，因此，对于这三个分析，我们将信息阈值应用于我们研究的SNP以及源研究。对于剩下的六项研究，将信息过滤器应用于我们自己的研究中的SNP。由于功能边缘IDP的遗传力较低，因此从此分析中删除了所有这些。由于性状之间的遗传相关性的计算仅在两个性状本身都是遗传性的情况下才真正有意义，因此我们仅使用具有Z分数的IDP，以显着非零的遗传力大于4。总共使用了897个IDP。为了说明IDP之间的相关性，我们使用了原始表型相关矩阵来模拟Z分数（以及相关的尾巴概率），并使用来自具有相同相关矩阵的多元正态分布的样品。

　　我们使用LDSCore回归软件将遗传力富集分析分析分为不同的功能类别（https://github.com/bulik/ldsc）。We used 24 functional categories: coding, UTR, promoter, intron, histone marks H3K4me1, H3K4me3, H3K9ac5 and two versions of H3K27ac, open chromatin DNase I hypersensitivity site (DHS) regions, combined chromHMM/Segway predictions, regions conserved in mammals, super-enhancers and active enhancers from the FANTOM5样品面板。对于每个IDP，总结了每个功能类别的富集，因为H2的比例除以该类别中常见变体的比例。对于每个IDP和每个注释，我们使用了LDSCore回归软件报告的双面富集P值。我们将这些p值标记为富集或耗尽，具体取决于富集估计值大于或小于1。我们将这些P值相应地分为23组IDP。

　　本研究中使用的大多数软件和代码都是公开可用的，包括用于为GWAS准备IDP的自定义MATLAB脚本（http://www.fmrib.ox.ac.ac.ac.uk/ukbiobank/gwaspaper/）。可在https://jmarchini.org/software/上获得最新版本的Bgenie和SBAT的预编译二进制文件。该软件目前已免费获得学术机构的研究人员使用。希望使用这些包裹的商业组织必须询问牛津大学的许可证。大脑图像处理很大程度上是使用FSL（FMRIB的软件库，https://fsl.fmrib.ox.ac.ac.ac.ac.uk/fsl/fsl/fslwiki）进行（https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/fslnets）。

　　有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。

分享到

声明：本文为用户投稿或编译自英文资料，不代表本站观点和立场，转载时请务必注明文章作者和来源，不尊重原创的行为将受到本站的追责；转载稿件或作者投稿可能会经编辑修改或者补充，有异议可投诉至本站。

英国生物库中大脑成像表型的全基因组关联研究

最新文章

热文导读