癌症跨11种肿瘤类型的癌症过渡期间的表观遗传调节

商业作者 / 姓名 / 2025-06-25 02:15
"
  所有用于MM,OV,BRCA,PDAC,UCEC,CRC,CESC/AD,SKCM和HNSCC的样品,以及GBM的2个NAT和CCRCC的1个NAT,并在与机构审查委

  所有用于MM,OV,BRCA,PDAC,UCEC,CRC,CESC/AD,SKCM和HNSCC的样品,以及GBM的2个NAT和CCRCC的1个NAT,并在与机构审查委员会(IRB)批准的华盛顿大学批准的所有样本一起收集了与知情人士的同意。IRB协议如下:201105374,201108117,201407156,202106166,201911095,201102270,201103136和20112312。在手术切除过程中收集了肿瘤样品,并通过标准病理进行了验证。源自NCI CPTAC的GBM和CCRCC样品是先前研究的一部分12,53。

  从每个样品中检索了大约20–30 mg的闪光或冷冻管或200μm的OCT组织,并将等分试样用于铬制备铬的下一个Gem Single Cell Multiome ATAC +基因表达测序方案,以同时分析同一表观远景和基因表达在同一单个核心中。The samples were resuspended in lysis buffer (10 mM Tris-HCl (pH 7.4) (Thermo Fisher Scientific, 15567027), 10 mM NaCl (Thermo Fisher Scientific, AM9759), 3 mM MgCl2 (Thermo Fisher Scientific, AM9530G), NP-40 substitute (Sigma-Aldrich, 74385-1L), 1 M DTT(Sigma-Aldrich,646563),10%库存BSA溶液(MAC,130-091-376),无核酸酶的水(Invitrogen,AM9937),加上0.1Uμl-1 RNase RNase抑制剂),然后使用锅炉和40μmmmed(然后通过锅炉)恢复和均匀地将其恢复并均匀地进行过滤。(2%BSA + 1×PBS + RNase抑制剂)。收集滤液,然后在4°C下以500克离心6分钟。然后,用RNase抑制剂重悬于BSA洗涤缓冲液中,用7AAD染色,并使用荧光激活的细胞分选(FACS)对核进行染色,并单细胞进行纯化。After counting and microscopy inspection of nucleus quality and concentration, nucleus suspensions were incubated in a transposition mix that included a transposase, while adapter sequences were simultaneously added to the ends of the DNA fragments and the preparation was diluted to 3,000–8,000 nuclei per μl to be used as one reaction for downstream preparation of both ATAC and gene expression preparation.使用下一个GEM单细胞多组ATAC +基因表达试剂盒(10x基因组学)和乳液中的凝胶珠(GEMS)使用铬铬铬芯下一个GEM J芯片J单细胞套件,16 rxns(PN-1000230),使用了约20,000个核进行分析。邮寄后的GEM-RT清理后, 进行了预扩增步骤,并将预放大的产物用作ATAC库构建的输入和基因表达文库结构的cDNA扩增。在逆转录(RT)反应期间,用16个核苷酸条形码和10个核苷酸分子标识符进行cDNA扩增/标记。放大后,将样品分开并用作两个单独的步骤的输入:40μl样品用于ATAC库的构造,并将35μl样品用于cDNA扩增。总cDNA中只有25%用于生成SnRNA-Seq的GEX库。根据制造商的SNATAC协议,使用10倍基因组单索引N集对库进行测序;而且,对于SNRNA,根据制造商的协议,使用了10倍基因组双索引TT集A。用于图书馆准备。

  根据颗粒的大小,分别用DRAQ5或7AAD染色了洗涤缓冲液(2%BSA + 1×PBS + RNase抑制剂)中100-500μl的核悬浮液,分别用于RNA或ATAC测序。具体而言,将SNRNA-SEQ核用每300μL样品染色1μlDRAQ5,并用每500μL样品用1μL的7AAD染色SNATAC-SEQ核。分类门是基于大小,粒度和染料染色信号的。

  骨髓单核细胞等分试样在解冻后以300g的含量离心5分钟,直至颗粒细胞。所有上清液均已去除。为了制备使用Miltenyi死细胞去除试剂盒处理的细胞,将细胞重悬于100μl的珠子中,并在室温下孵育15分钟。然后使用Automacs Pro分离器通过耗尽选择。通过以450克离心5分钟,将负数(活细胞)固定。最终将细胞重悬于冰冷的磷酸盐缓冲盐水(PBS)和0.5%BSA中,并加载到10倍基因组铬控制器上。使用10倍基因组铬加载样品,下一个宝石单细胞3',库和凝胶珠套件V2。然后将条形码的文库汇总并在具有相关流动池的Illumina Novaseq 6000系统上进行测序。

  首先,将15-25毫克的粉状组织放入冰上的5 mL eppendorf管中。使用宽孔移液尖端(Rainin),将根据细胞核异化方案(10X基因组学)和超级糖抑制剂(Invitrogen)制备的裂解缓冲液(Invitrogen)添加到管中。将组织溶液轻轻移动,直到裂解液变成略多云的颜色(移液迭代的数量取决于特定的组织)。然后将组织匀浆通过40μM滤网(Pluriselect)过滤,并用BSA洗涤缓冲液(2%BSA+1×PBS+RNase抑制剂)洗涤。收集滤液,在4°C下以500克离心6分钟,并用BSA洗涤缓冲液重悬于。然后,将100μl的细胞裂解溶液列出为未染色的参考,其余的用7AAD或DRAQ5染色,具体取决于ATAC或RNA方案。核经过FACS和分类门是基于大小,粒度和染料染色信号的。将最终悬浮液在4°C下以500克离心6分钟,并用BSA洗涤缓冲液重悬于。有关RNA协议的更多具体详细信息,请访问protocols.io(https://doi.org/10.17504/protocols.io.14egn7w6zv5d/v1(RNA协议);对于ATAC协议,7AAD代替了染料)。

  使用10倍基因组铬仪器在油滴中分离核和条形码珠。对单核悬浮液进行计数,并使用血细胞计数仪调节每µL 500至1,800个核。随后进行了逆转录以结合细胞和转录特异性条形码。所有SnRNA-Seq样品均使用铬的下一个宝石单细胞3'库和凝胶珠套件v3.1(10x基因组学)运行。对于SNATAC-SEQ,铬的下一个宝石单细胞ATAC库和凝胶珠盒V1.1 Prep(10x基因组学)用于一部分样品。对于多组盒,使用了铬的下一个宝石单细胞多组ATAC +基因表达试剂盒。然后将条形码的文库汇总并在Illumina novaseq 6000系统上进行测序,并带有相关的流动池。

  肿瘤组织和相应的正常邻近组织是从外科手术切除的样品中获得的,在去除一块以进行新鲜的单细胞制备后,其余样品在液氮中被捕获,并在-80°C下储存。在批量DNA提取之前,将样品冷冻污染(Covaris)并进行等分用于大量提取方法。使用Dneasy血液和组织试剂盒(Qiagen,69504)或Qiaamp DNA Mini Kit(Qiagen,51304)从组织样品中提取基因组DNA。根据制造商的说明(Qiagen),使用Qiaamp DNA Mini试剂盒(Qiagen,51304)从冷冻保存的外周血单核细胞中纯化基因组种系DNA。根据制造商的说明(Thermo Fisher Scientific),使用量子DSDNA HS分析(Q32854)通过荧光测定法(Q32854)评估DNA数量。

  总共将100–250 ng的基因组DNA碎裂在靶向250 bp插入物的Covaris LE220仪器上。使用Sciclone NGS平台(Perkin Elmer)上的KAPA Hyper Library Prep套件(Roche)构建了自动双索引库。在以5 µg库池为目标之前,将多达十个文库按等摩尔比以等摩尔比的比率合并。使用XGEN Exome Research面板V1.0试剂(IDT Technologies)杂交图书馆池,该研究小组跨越了人类基因组的39 MB靶区域(19,396个基因)。将库在65°C杂交16-18小时,然后进行严格的洗涤,以去除虚假的杂交库碎片。洗脱富集的文库片段,并进行PCR周期优化,以防止过度放大。在测序之前,使用Kapa Hifi Master Mix(Roche)扩增了富集的文库。根据制造商协议(Roche),通过定量PCR(QPCR)确定每个捕获的库池的浓度,以产生适合Illumina Novaseq 6000仪器的群集计数。生成了2×150 bp配对的末端读数,以靶向12 GB的序列,以达到每个库的100倍覆盖范围。在225个SNATAC-SEQ样品中,生成了195个匹配的WES数据。在这195个样品中,可以使用并用于生成173个WES文库,与SNATAC-SEQ样品的176个相对应。

  CAKI-1细胞系购自ATCC(ATCC,htb-46,https://www.atcc.org/products/htb-46),并使用ATCC的Short-Tandem-repeat(str)分析进行了认证。MCF7单元线是从ATCC(ATCC,htb-22,https://www.atcc.org/products/htb-22)购买的,并通过ATCC进行了str分析来验证。U251细胞系是从先前的研究54获得的,并通过Str分析验证。本文中没有使用的单元线在国际细胞线认证委员会(ICLAC)维护的普遍误识的细胞系数据库中列出了任何细胞系。这里使用的所有细胞系都使用mycoalert(Lonza,LT07-118)测试了支原体污染的阴性。

  根据有关美国类型文化收藏(ATCC)网站(https://www.atcc.org/)的信息,在指定条件下培养了CAKI-1,MCF7和U251细胞系。当细胞达到所需的汇合和数字时,根据制造商的协议应用了切割和运行套件(14-1048,Epicypher)和剪切和运行库预备套件(14-1002,Epicyher)。简而言之,清洗缓冲液,细胞透化缓冲液和抗体缓冲液在第1天准备新鲜。在这些步骤之后,为每种反应收集了500,000个细胞,然后在洗涤缓冲液中重悬于,并与活化的珠子混合。在室温下孵育10分钟后,将管子放在8株磁铁上,直到清除浆液。除去上清液,并将冷抗体缓冲液添加到每个反应中。首先将Snap-Cutana K-Metstat面板添加到设计用于正(H3K4ME3)和阴性(IgG)对照抗体的反应中。然后,将0.5μg指定的抗体添加到每个反应中并孵育过夜。第二天,用细胞通透缓冲液洗涤抗体结合的组蛋白PTM或染色质相互作用蛋白。接下来,将PAG-MNase添加到裂解靶-DNA复合物中。然后通过添加氯化钙,大肠杆菌尖峰 - 内DNA和停止缓冲液主混合物来消化靶向染色质并释放。根据Cutana Cut&Run Library Prep套件,将DNA纯化,并最多可用于5 ng剪切和富含的DNA进行进一步的图书馆构造。使用贴纸分析了库片段尺寸,并测序了库。

  在切割和运行分析中使用了以下抗体:抗NRF1(46743,细胞信号技术)和抗CTCF(3418,细胞信号技术)。

  将以下小鼠菌株(Mus musculus)用作本研究的一部分:p48-cre小鼠(C57BL/6J背景; S。hingorani实验室);LSL-KRASG12D小鼠(C57BL/6J背景;杰克逊实验室,008179);Trp53flox小鼠(C57BL/6J背景;杰克逊实验室,008462)。所有动物研究均根据NIH-AALAC标准完成,并与华盛顿大学医学学院IACUC法规一致(协议,22-0233),研究得到了华盛顿大学医学学院机构动物研究委员会的批准。所有动物均在12 h – 12 h的浅色周期内饲养在屏障设施中,每个笼子为1-5只小鼠。

  为了对小鼠PDAC进行MPIHC分析,将嵌入的组织切成6μM切片,并将其加载到键rxm(Leica Biosystems)中进行一系列染色,包括使用针对GATA6(Invitrogen,PA11-104)和CK19的抗体(Cell Signaling Technology,12434)。根据抗体宿主物种,使用了默认的制造商方案(Intenser和聚合物精炼),包括柠檬酸盐缓冲液,山羊血清和过氧化物块的抗原检索;原代抗体孵育;后孵化;使用AEC底物(ABCAM)的成色可视化。在染色的每两个循环之间,用甲莫妥蛋白和曙红手动染色,然后使用Axio Scan.z1(Zeiss)系统进行扫描。然后,载玻片被乙醇的梯度加上2%盐酸盐洗涤,并用额外的Avidin/biotin(载体实验室)和Fab碎片块(Jackson Laboratory)阻塞。在每个染色周期之前进行基于柠檬酸盐的抗原检索。使用Zen(Zeiss)将同一标本但使用不同污渍的图像裁剪成多个片段。然后将每个片段对单个污渍进行反驳(Deonvolution,V.1.0.4; Indica Labs),并使用Halo Software(Zeiss)与默认制造商的设置进行融合。使用HALO软件中的高PLEX FL模块对感兴趣的标记进行了伪化学和量化。

  使用Trimgalore v.0.6.7(具有参数-Length 36,以及设置为默认值的所有其他参数; https://github.com/felixkrueger/trimgalore)对FASTQ文件进行预处理。然后,使用BWA -MEM55 V.0.7.17与GADC文件对齐与GDC的GRCH38人参考基因组(GRCH38.D1.VD1),带有参数-M,所有其他参数设置为默认值。使用SAMTools(https://github.com/samtools/samtools; v.1.14)视图将输出SAM文件转换为BAM,并使用参数-SHB以及所有其他参数设置为默认。对BAM文件进行排序,并使用PICARD v.2.6.26 SOLTAM工具标记了重复项,其中具有以下参数:create_index = true,sort_order = coordication,verialation_stringency = strict = strict,所有其他设置为默认为;并使用参数remove_duplicates = true进行标记绘制,所有其他设置为默认值。然后使用SamTools v.1.14索引将最终的BAM文件索引,并将所有参数设置为默认值。

  使用SomaticWrapper Pipeline v.1.6(https://github.com/ding-lab/somaticwrapper)从WES调用体突变,其中包括四个不同的呼叫者,即strelka(v.2.9.10)56,56,stutect(v.1.1.1.1.7)57,varscan(varscan(varscan(v.2.2.3.8)58和pindel(v.1.1.7)58和pindel(V.0)。我们将任何两个呼叫者在突变v.1.1.7,Varscan v.2.3.8和Strelka V.2.9.10中调用的外显子单核苷酸变体(SNV),以及Varscan v.2.3.8中的任何两个呼叫者,Strelka V.2.9.9.10和Pindel V.0.2.2.5。对于合并后的SNV和Indels,我们分别应用了14×和8×最小的覆盖范围,分别为正常。我们还通过肿瘤中的0.05的最小变体等位基因馏分(VAF)过滤了SNV和INIDEL,而正常样品中最大VAF为0.02。我们还过滤了在同一肿瘤样品中发现的Indel 10 bp之内的任何SNV。最后,我们根据已建立的基因共识列表49,在CCRCC驱动基因中以0.015范围内的VAF挽救了罕见的突变。在下游步骤中,SomaticWrapper使用Cocoon(https://github.com/ding-lab/cocoons)将相邻的SNV结合到双核苷酸多态性(DNPS)中:作为输入,Cocoon从标准的呼叫管道中获取MAF文件。首先,它作为DNP候选集合在2 bp窗口内提取变体。接下来,如果可用用于变体调用的相应BAM文件,它将提取每个变体集中所有候选DNP位置的读取(表示为NT),并计算所有共发生变体(表示为NC)的读数数量,以计算共恒流率(RC = NC/NT)。如果RC≥0.8,则将附近的SNV合并到DNP中,Cocoon将根据成绩单基于成绩单和MAF文件中的信息来更新来自同一密码子的DNP的注释。最后,我们用[0.015,0.05)的VAF营救了罕见的突变。 在基于上述基因共识列表的癌症驱动基因中49。进一步分析的重点是先前出版物中报道的癌症驱动基因12,49,53,60,61,62,63。

  对于不可用的配对正常样品的样品,使用Mutect2(来自gatk v.4.1.2.0的工具)称为tumour的体细胞变体,somaticwrapper pipeline(https://github.com/github.com/somationwrab/somaticwrab/somationwrapper/tree/tree/tree/tree/troper/troper/tonly.v1.0)(https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files; gatk4_mutect2_4136_pon.vcf.tar)。通过仅保留≥20×覆盖范围和> 3替代等位基因读取的变体位点,以≥0.1替代等位基因VAF来过滤误报。使用茧再次推断DNP。

  我们使用BAM ReadCount来确定和手动验证KRAS热点Gly12,Gly13和Gln61的散装WES中的KRAS突变状态。对于每种情况,我们首先应用BAM ReadCount来生成这些基因座的九个基础中的每个基础中的每个底数(3个密码子乘以3个基部),然后根据每个位置的参考和替代基础读数计算所有KRAS热点的VAF值。SomaticWrapper管道尚未确定变体的唯一实例是在PDAC中。由于PDAC中KRAS热点突变的高率(> 90%),因此在基因分型过程中在PDAC中检测到的任何此类突变都会自动报道64。

  为了识别KRAS热点突变状态,我们应用了内部工具SCVARSCAN,可以通过追踪每个SNRNA BAM文件中的分子条形码信息来识别支持参考和变体等位基因的读取。对于映射,我们使用了纪念斯隆kettering癌症中心热点网站(https://www.cancerhotspots.org)来获取Gly12,gly13和gln61上最常见的KRAS Hotspot突变,然后使用SCVARSCAN,然后使用SCVARSCAN来检测每个样品中潜在的少数KRAS突变。对于非PDAC样品,然后将命中过滤至高质量的突变等位基因> 5。

  使用gatk(v.4.1.9.0)65调用体拷贝数变体。具体而言,将HG38人参考基因组(NCI GDC数据门户网站)用于使用预处理互动函数中的目标间隔,并将BIN长度设置为1,000 bp,并使用重叠式的Interval-Merg-rule-lule-the。然后使用每个正常样本作为输入生成一组正常的(PON),而GATK函数收集了带有参数的ReadCounts-Interval-Mergy-Merging-rule-lule ryplapping_only,然后使用createAdcountpanelofnorofnormals进行参数-Minimim-Minterterval-Midterval-Median-Median-Percentile 5.0。对于肿瘤样品,使用GATK函数收集扣来计数目标间隔重叠的读物。然后使用gatk函数denoisereadCounts进行标准化和滴定肿瘤读数,并用 - 符号符号符号指定的PON。根据GATK的最佳实践(变体进一步筛选为0.2> af> 0.01),使用GATK功能收集的肿瘤中存在的肿瘤等位基计数。然后,使用GATK函数模型进行模型对段进行建模,并将其用作输入的DeNOCORIED拷贝比和肿瘤等位基因计数。然后使用GATK函数CallCopyRatioSegments在段区域调用片段的副本比率。

  BedTools66交叉点用于将拷贝数从段与基因映射,并分配称为扩增或缺失。对于重叠多个片段的基因,使用自定义的Python脚本将其称为基因,以放大,中性或删除为基于加权拷贝数比率,该基因是根据每个段重叠的副本比率计算得出的,重叠的长度,Z分数阈值由callCopopyPopyRatiaSosemgments使用。如果所得的z得分截止位于callcopyratiosegments使用的默认z得分阈值范围内(v.0.9,1.1),则使用默认z得分阈值的边界(复制callcopyrotatiosegments函数的逻辑)。

  为了将副本比率从片段到染色体臂映射,根据相同的方法使用了另一个脚本,然后将其称为染色体臂的放大,中性或删除。由于与CCRCC和GBM相关的读取深度增加,因此将用于GBM和CCRCC样品的PON专门由这些癌症的所有正常样品组成。用于所有其他癌症类型的PON都是从所有正常样品中汇编的,这些样本中的所有正常样品中的剩余癌症类型。

  为了处理测序的SC/SNRNA-seq样品,使用了10倍基因组学(具有计数功能)的细胞游侠(v.6.0.2)将读取与预构建的GRCH38基因组参考V.2020-A(Revdata-Gex-Gex-Gex-Gex-Gex-Gex-Gex-Gex-GERCH38-2020-A)对齐。R包Seurat(V.4.0.5)67使用了由此产生的基因唯一分子标识符(UMI)计数矩阵进行后续处理。配对样品必须来自与SNATAC-SEQ样品的同一组织,它们是由单核产生的,除了MM样品,这些样品是由单个细胞产生的。选择处理后的SC/SNRNA-SEQ样品,只要它们符合下面详细介绍的过滤标准即可。然后仔细评估每个样品中的Cellranger报告,我们包括没有严重错误或警告的样品。排除样品的错误示例如下:“错误:低分数读取自信地映射到转录组”或“错误:GEX读取映射到转录组为低”。此外,每个细胞的中位基因的中位基因少于700个(在某些情况下检测到大量细胞)的排除。

  将质量过滤器应用于数据以去除属于以下任何类别的条形码:表达过多的基因的可能碎片(<200) and too few UMIs (<1,000), possible more than one cell with too many genes expressed (>10,000) and too many UMIs (>80,000),可能的死细胞或细胞应激和凋亡的迹象,在总转录本计数(> 10%)中,线粒体基因表达的比例过高(> 10%),而细胞预测为双scrublet,如下所述。这些过滤器的截止值是基于Seurat软件包文档中的建议。

  使用Seurat的“ sctransform”函数(带有参数:vars.to.regress = c(“ ncount_rna”,“ percen.mito”),return.only.only.var.genes = f,以及所有其他设置为默认值)。由此,使用seurat runpca函数计算主组件。使用基于图的聚类(Seurat的默认)方法将细胞聚类。首先,我们根据主成分分析(PCA)空间中的欧几里得距离,将Seurat函数findneighbors用于嵌入细胞中的细胞,并在具有相似表达模式的细胞之间绘制边缘。我们将先前定义的第30个主要组件(PC)用作函数的输入,而其他参数则为默认值。对于群集单元,我们将模块化优化技术(使用默认的Louvain算法从Seurat函数Findclusters使用)一起迭代组细胞,以优化标准模块化函数。我们将分辨率参数设置为0.5,而其他参数则为默认值。

  为了删除双重组,使用Python封装scrublet V.0.2.3来识别从过滤的CellRanger基因by-by-cell umi计数矩阵中识别二倍体。使用scrublet函数(参数:Exceart_doublet_rate = 0.15),从0到1,000,000之间的随机整数设置为0和1,000,000的随机整数)。使用SCR.SCRUB_DOUBLETS函数为每个单独的样本分配了单个样本的单元格分数(参数:min_counts = 2,log_transform = true)。在最后的迭代中,scublet对象的随机_state设置为0。每次迭代后,使用kmeans对象(带有参数:n_clusters = 2,init = 2,k-means ++',n_init = 10,and max_iter = 10000)和fit_predict the pyn thepn pys s sklein vython s sklear v。然后将所有迭代后的双重和单线簇之间的边界进行平均,以确定最终的Doublet截止值,然后由scrublet call_doublets函数使用,以预测单元格的双重表格。

  使用seurat函数合并,在质量控制后合并了sc/snRNA-seq样品对象,生成了癌症 - 果实水平的对象。在合并之前,从所有样品中除去了注释为双重的条形码。合并后,使用seurat sctransform函数将对象进行标准化,该函数具有标准化单个对象时相同的参数。然后使用Findneighbors使用前50个PCA尺寸聚类,并使用FindNeighbors和FindClusters函数,分辨率= 0.5。为了生成合并的泛伴对象(包含所有细胞类型,肿瘤和选定的正常细胞类型;扩展数据图2C,F),从癌症队列对象开始采取相同的步骤。为了使所有细胞类型的合并对象为每个队列级群集随机采样600个单元,并将所得的单元组用于合并。所得合并对象的归一化矩阵用于后续分析。

  我们从文献中策划了众所周知的标记清单(补充表1E和补充图2C – F)。我们一次使用每种癌症类型的所有细胞的集成SC/SNRNA-SEQ数据,我们将标记基因过滤到至少一个至少一个簇中的5%中表达的标记基因。然后,我们通过检查所有群集中所有过滤标记基因表达的表达值和表达式(使用Seurat软件包的点函数)来标记每个群集。最后,我们还使用肌电表现结果验证了癌细胞类型的注释(补充图2B)。还使用已知标记的表达进行详细的正常上皮细胞注释(在补充表1E中,请参见正常胰腺细胞,正常结肠细胞和正常肾细胞标记)。首先,我们从PDAC,CRC和CCRCC癌症类型中分离出正常的上皮细胞,然后评估了跨簇(使用Seurat包装的点功能)的标记基因的表达。

  以前据报道,基底和非基质亚型的BRCA样品往往具有不同的染色质可及性景观6。为了用基础和非基质亚型注释样品,我们使用了两种方法。首先,我们使用SNRNA-SEQ数据检查了每个样品的PAM50基因68的表达(补充图7)。我们使用Seurat包装的点函数检查了这些标记的每样本表达,并在样品中找到了PAM50签名的特定表达式。接下来,我们基于SNATAC-SEQ数据的TF序列可访问性得分进行了相关分析。为此,我们使用了在每个样品的癌细胞中平均的TF评分(补充图8)。我们观察到两个样品簇,与基底和非基质亚型样品相对应。这两种正交方法产生了相似的结果,我们使用所得的注释将BRCA样品分为基础和非基质组。

  为了使用SC/SNRNA-SEQ数据检测大规模染色体CNV,将斜纹v.0.99.7用于10倍基因组学数据建议的默认参数。地渗出量是在样本级别运行的,并且仅使用原始计数矩阵使用后质量对照后的滤波数据进行运行。为了在所有样品上运行校正,有必要在bash环境中设置“ Ulimit -s无限”,然后在R中定义选项(表达式= 500000),对于特定的样品,当隐藏的Markov模型不融合这些更改时,必须使用suckcnv v.1.11.2。完成后,从隐藏的马尔可夫模型输出中收集了单元格中所有基因的复制比率。对于SNATAC-SEQ USWNV,使用细胞矩阵过滤的基因可访问性以与SC/SNRNA-SEQ调用相同的方式运行。

  我们应用了一个名为SCVARSCAN的内部工具,该工具可以通过在SC/SNRNA-SEQ BAM文件中追踪细胞和分子条形码信息来识别支持参考等位基因和变体等位基因,覆盖每个单个单元中的变体位点。该工具可在GitHub(https://github.com/ding-lab/10xmapping)上免费获得。对于映射,我们使用了WES数据中的高信心躯体突变。

  为了执行差异表达分析,我们使用默认的Wilcoxon Rank-sum测试中的Seurat软件包中的Findmarkers函数。对于所有DEG分析,我们使用了一个包含所有癌症的正常和癌细胞的合并对象。首先,为了鉴定组织和癌细胞特异性DEG,我们将每种肿瘤类型的癌细胞与所有其他肿瘤的癌细胞组合组合进行了比较。我们指定了以下参数:min.pct = 0.1,min.diff.pct = 0,logfc.threshold = 0 and唯一的pos = t。接下来,为了鉴定癌细胞特异性DEG,我们对癌细胞与它们最接近的正常细胞类型(图1C)进行了比较。为了进行此分析,我们仅使用了原发性肿瘤的癌细胞。此外,我们指定了以下参数:min.pct = 0.05,min.diff.pct = 0,logfc.threshold = 0 and唯一。pos= f。最后,为了鉴定与转移相关的DEG,我们比较了原发性肿瘤的肿瘤细胞与分析中四个同类群的转移性肿瘤的肿瘤细胞(CRC,PDAC,SKCM和UCEC)。指定了以下参数:min.pct = 0.1,min.diff.pct = 0,logfc.threshold = 0 and唯一的pos = f。对于所有DEG分析,使用每次比较中的所有基因都应用Bonferroni校正进行P值调整,如果它们具有调整后的P,则认为DEG被认为是显着的< 0.05.

  To infer gene regulatory networks, we used the SCENIC pipeline pySCENIC command line interface version (v.0.11.2)35. We ran SCENIC on an SCT-normalized assay of sampled sc/snRNA-seq merged object, 200 cells sampled randomly per cell type of each sample. For the first step we used the GRNBoost2 method, as it is suggested for large scale datasets. For the input, we provided the list of unique TFs that are present in JASPAR2020 db69. Steps 2 and 3 of regulon prediction were run with the default parameters using the RcisTarget hg38_refseq-r80 v.9 gene-motif ranking databases (10 kb around the TSS, and 500 bp around the TSS). As the first step of the SCENIC pipeline is using a stochastic gradient boosting algorithm, it is suggested by the developers to run it multiple times and to filter TFs and their targets to those that were reported across multiple iterations70. Consequently, we ran the SCENIC pipeline ten times and filtered TFs and their targets to those that appeared in at least 80% of SCENIC runs. We then recalculated AUC scores for the resulting regulons using the AUCell (v.1.19.1) R package. Finally, we filtered regulons to those that contain at least 20 target genes. By using this approach, we were able to obtain a more stable set of regulons that are active in our dataset.

  To prioritize regulons for Fig. 3a, we first conducted a differential analysis comparing neoplastic cells from each cancer versus neoplastic cells from all other cancers (primary tumours only), using regulons’ AUC scores. For this, we used a two-sided Wilcoxon test, and the resulting P values were adjusted using the Benjamini–Hochberg FDR method. We selected regulons that were both significant (FDR < 0.05) and that also met the following criteria: fold change between the two groups of greater than 1.5 and the mean score in cell group 1 exceeding the median of mean scores across all cell groups. The top 10 such regulons with the highest fold change were selected in each cancer type. If there were less than 10 regulons that passed these criteria, then all regulons were taken for that cancer type. We also added the following regulon–cancer pairs that were supported by comparing TF scores for the same cells’ group analysis using snATAC-seq data: KLF6 in PDAC, NRF1 in GBM, RARA in BRCA (non-basal), MXI1 in ccRCC, E2F7 in GBM and ELF3 in PDAC. Next, to annotate regulons as tissue- or cancer-cell-specific, we performed differential regulon analysis between tumour cells and their CNCs for each cancer type. For this, we used a two-sided Wilcoxon test, and the resulting P values were adjusted using the Benjamini–Hochberg FDR method. The regulon was annotated as cancer-cell-specific if FDR < 0.05, the difference in scores between these groups was >0.01 and log2[FC] >0.1。

  为了分析跨调节基因靶标的途径活性(扩展数据图6E),我们使用了每个组织和癌症特异性调节剂的基因组,这些基因也是同一癌症中的DEGS(补充表2B)。然后,我们使用JACCARD指数计算了途径活性分数,从HALLMARK MSIGDB71途径中产生的调节靶标和基因集。我们使用FGSEA R软件包的高几何测试进一步进行了过度代表分析。使用Benjamini – Hochberg FDR校正调整P值。

  为了处理测序的SNATAC-SEQ和SNMUTIOME-SEQ数据,我们分别使用了CellRanger-ATAC计数(V.2.0,10x Genomics)和CellRanger-ARC计数(V.2.0,10x Genomics)管道。这些管道过滤和映射SNATAC-SEQ读取并识别转座酶切割位点,并且CellRanger-Arc管道还执行SNRNA-Seq读取的过滤和对齐。GRCH38人参考用于读取映射(refdata-cellranger-arc-arc-grch38-2020-A-2.0.0)。由于低snRNA-seq质量,某些Snmultiome-seq样品的SNATAC-SEQ部分由修改版本的CellRanger-ATAC v.2.0分别运行,该版本的ATAC细胞条形码替换为Snmultiome-Seq-Seq Barodes。特别是,将snmultiome-seq条形码文件cellranger-arc-2.0.0/lib/python/python/atac/barcodes/737K-ARC-V1.TXT复制到cellranger-atac目录目录cellranger-atac-atac-atac-atac-atac-atac-atac-atac-2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0. and lib/python/barcodes/barcodes/rent-7373737373。仔细评估了每个样品中的Cellranger报告,我们排除了几乎没有错误的样品,除了“单元格数量过高”错误,同时保留没有错误或仅警告的样品。我们删除样品的错误示例如下:“细胞中的ATAC高质量片段很低”,“ ATAC TSS富集很低”和“峰值中的ATAC片段较低”。

  为了在SNATAC-SEQ数据(来自常规的SNATAC-SEQ和SNMULTIOME-SEQ)上调用峰值,我们通过MACS2工具(v.2.2.7.1)72通过Signac软件包的CallPeaks功能(v.1.3.0,https://github.com/timoast/signac)。我们进一步从Y染色体以及包含“ N”的基因组区域的重叠基因组区域中取出了峰。所有峰都调整为以MacS2定义的峰顶峰值峰值的501 bp。接下来,我们执行了先前描述的迭代去除程序,以获取一组非重叠峰。简而言之,我们从MacS2峰得分(-log10 [Q])保留最显着的峰值开始,从而消除了与其直接重叠的所有峰。我们重复此过程的剩余峰,直到拥有一组非重叠峰为止。所得的样品峰集用于使用Signac软件包中的特征ematrix来计算峰值计数矩阵,该软件包也用于下游分析。

  SNATAC-SEQ数据集的质量控制过滤是使用Signac软件包中的功能进行的。用于电池调用的过滤器包括:1,000< number of fragments in peaks < 20,000; percentage of reads in peaks > 15; ENCODE blacklist regions percentage < 0.05 (https://www.encodeproject.org/annotations/ENCSR636HFF/); nucleosome banding pattern score < 5; and enrichment-score for Tn5-integration events at transcriptional start sites > 2. Open chromatin regions were annotated with the R package ChIPseeker (v.1.26.2)73 using transcript database TxDb.Hsapiens.UCSC.hg38.knownGene. The promoter region was specified (−1000,100) relative to the TSS.

  The filtered peak-count matrix was normalized using term frequency-inverse document frequency (TF-IDF) normalization implemented in the Signac package. This procedure normalizes across cells, accounting for differences in coverage across them and across peaks, giving higher values to the rarer peaks. All peaks were used as features for dimensional reduction. We used the RunSVD Signac function to perform singular value decomposition on the normalized TF-IDF matrix, a method that is also known as latent semantic indexing (LSI) dimension reduction. The resulting 2:30 LSI components were used for nonlinear dimensionality reduction using the RunUMAP function from the Seurat package. The nuclei were clustered using a graph-based clustering approach implemented in Seurat. First, we used the Seurat function FindNeighbors to construct a shared nearest neighbour graph using the 2:30 LSI components. We next used the FindClusters function to iteratively group nuclei together while optimizing modularity using the Louvain algorithm.

  For snMultiome-seq data containing profiles of both snRNA- and snATAC-seq data, we first performed separate processing and filtering of cells using the same steps as were described for the processing of separate sc/snRNA-seq and snATAC-seq assays. To obtain the final list of barcodes, we retained the cells that passed the quality control filters in both the snRNA- and snATAC-seq assays. In the result, we obtained filtered gene- and peak-count matrices for the same set of cells. We then performed TF-IDF normalization of the peak-count matrix, followed by LSI dimensionality reduction using the RunTFIDF and RunSVD Signac functions. For normalization and dimensionality reduction of the gene-count matrix, we used the SCTransform and RunPCA functions of Seurat with the same parameters as used for regular sc/snRNA-seq data processing.

  We next computed the weighted nearest neighbour (WNN) graph with the FindMultiModalNeighbors function using both data modalities. We used 1:30 PCA components from snRNA-seq and 2:30 LSI components from snATAC-seq for this analysis. We performed nonlinear dimensionality reduction of the resulting WNN graph using the RunUMAP function of Seurat. Finally, we obtained clusters with the FindClusters function using the WNN graph, setting the argument algorithm = 3 (SLM).

  To identify doublets in snATAC-seq data, we used the Python package Scrublet v.0.2.3 on the filtered cellranger peak-by-cell UMI count matrix. The processing steps were the same as for doublet identification in sc/snRNA-seq. To assign doublets for snMultiome-seq barcodes, we performed doublet identification separately on the filtered CellRanger peak-by-cell and gene-by-cell UMI count matrices. We annotated a barcode as a doublet if it was identified as a doublet by using both assays.

  To create snATAC-seq cohort-level merged objects, functions from the Signac and Seurat packages were used. To normalize peak significance scores across samples and cancers, we converted MACS2 peak scores (−log10[q]) to a score per million as described previously6. To get the set of peaks for merging, we first combined peaks from all of the samples for each cohort separately. For overlapping peaks in each cohort, we performed an iterative removal procedure, the same as was used for creating individual sample peak sets, using normalized peak scores as described above. Using this procedure, we obtained the cancer-type-level peak sets. To gain the pan-cancer set of non-overlapping peaks, we renormalized peak scores using the score per million normalization procedure described above and performed the same iterative removal procedure for the combined cohort-level peak set from all 11 cancer types. The resulting list of pan-cancer peaks was quantified in each cohort using the FeatureMatrix Signac function, so that the resulting peak-cell matrices had the same set of features in all of the samples processed.

  To merge snATAC-seq datasets, the merge function of the Seurat package was used. We next performed TF-IDF normalization and LSI-dimensionality reduction using the RunSVD function from the Signac package. Non-linear dimensionality reduction was performed using the RunUMAP function with 2:50 LSI components. For analysis involving CNC, we also created two additional merged objects (HNSCC-CESC/AD and UCEC/OV), so that they contain CNC for the HNSCC and OV cohorts, respectively.

  For the analysis involving comparisons between cancers, we aimed to create a pan-cancer-level merged object. To reduce the computational complexity, we subsetted tumour and selected normal cell types for each cohort that can be the putative cell-of-origin: luminal mature and luminal/basal progenitors for BRCA; oligodendrocytes, OPC and astrocytes for GBM; acinar and islet for PDAC; ciliated and secretory endometrial epithelial cells from UCEC; and other normal epithelial cells from all cohorts where they were available, and normal B cells from the MM cohort (Extended Data Fig. 2b,e). We further randomly sampled 1,000 cells for each cohort-level cluster for this. We next used a merge procedure, followed by TF-IDF normalization and LSI dimensionality reduction using Seurat and Signac package functions. For nonlinear dimension reduction with the RunUMAP function, we used 2:150 LSI components. The resulting merged object normalized peak by cell matrix was used in the pan-cancer analysis and in the analysis of TF motif accessibility differences (Extended Data Fig. 2b,e). We also made another merged object to compare chromatin-accessibility profiles across broad cell groups. To make this object, 600 cells were randomly sampled for each cohort-level cluster, and the resulting set of cells was used for merging, applying the same processing steps as described for the processing of the first pan-cancer merged object (Extended Data Fig. 2a,d).

  For snMultiome-seq samples, cell labels were taken directly from snRNA-seq sample annotations. For regular snATAC-seq, the cell types of samples were first annotated with cell type label transfer using functions from Signac and Seurat. First, we quantified chromatin accessibility associated with each gene by summing the reads overlapping the gene body and its upstream region of 2 kb, therefore creating the gene by cell matrix. Coordinates for the genes were used from the Ensembl database v.86 (EnsDb.Hsapiens.v86 package). We next performed log-normalization of the resulting matrices using the NormalizeData function. The integration of paired snATAC-seq and sc/snRNA-seq datasets was performed using the FindTransferAnchors function with the canonical correlation analysis option for the dimensionality reduction. We then used the TransferData function to transfer cell type labels from the sc/snRNA-seq dataset to the snATAC-seq dataset using the obtained set of anchors from the previous step. The cell types were then re-evaluated at the cancer-type-merged object level, where, for each cluster, the cell type label was assigned by the most abundant cell type in that cluster. Cancer cell type annotation was also validated using inferCNV results (Supplementary Fig. 2b). Detailed normal epithelial cell type annotation was performed in sc/snRNA-seq space first. Then, for snMultiome-seq samples, cell labels were directly taken from snRNA-seq annotation and, for regular snATAC-seq samples, cell types were annotated with cell type label transfer using functions from Signac and Seurat.

  We set out to determine the CNCs for 10 cancer types (all except MM due to the lack of resolution, see below) that contained sufficient numbers of cells from 2 to 7 normal tissue cell types per cancer type. We divided the BRCA cohort samples based on basal versus non-basal subtypes as the two subtypes were reported to have different cells of origin74,75. To determine a CNC for each cancer, we did the following. We used a combined set of tissue- and cancer-specific DACRs (Supplementary Table 2a) or DEGs (Supplementary Table 2b) for snATAC-seq- and snRNA-seq-based calculations, respectively. For the resulting sets of genes and open chromatin regions, we calculated the average expression or accessibility for the pooled cells from the selected normal (potential cell of origin) cell types (Fig. 1c and Extended Data Fig. 4a), and cancer cells from each sample separately. We next calculated the Pearson correlation coefficient between each tumour and selected normal cell types from its tissue. Both data types produced similar patterns (Fig. 1c and Extended Data Fig. 4a), and we considered the cell type to be a CNC if it had a higher median of correlation coefficients across tumour samples based on snATAC-seq data (Supplementary Table 2d).

  On the basis of the above analysis, we defined CNCs for 10 cancers, and they were consistent with those reported in the previous studies: luminal mature cells for BRCA of non-basal subtypes74; luminal progenitor cells for BRCA of basal subtype74,75; ductal-like-2 cells for PDAC13,76,77,78,79; distal stem cells for CRC80; secretory endometrial epithelial cells for UCEC81 and OV; normal squamous cells for HNSCC and CESC; melanocytes for SKCM; proximal tubule cells for ccRCC82,83; and OPCs for GBM84,85. For MM, we used normal B cells as the CNC17. B cells are believed to acquire the initial CNV and structural variants during the class-switch recombination and somatic hypermutation process in a germinal centre. These abnormal B cells are believed to further differentiate into plasma cells and give rise to MM17. We have significantly more B cells than normal plasma cells in our dataset and, as the initial tumorigenic events seem to occur in B cells, they were used as the CNC.

  To perform analysis of differentially DACRs, we used the FindMarkers function of the Signac package (v.1.3) with logistic regression and the fraction of fragments in peaks used as a latent variable to reduce the effect of different sequencing depths across cells. P-value adjustment was performed using Bonferroni correction using all peaks in the dataset. We used the same groups of cells that were used for the identification of DEGs in the respective comparisons. To calculate the fold change for all DACRs, we used an improved version of the FoldChange function in the Signac package (v.1.8).

  First, to identify tissue- and cancer-cell-specific DACRs, we compared cancer cells from each tumour type to the combined set of cancer cells from all other tumours, using the merged pan-cancer object with cancer and selected normal cells. The following additional parameters were specified for the FindMarkers function: min.pct=0.1, min.diff.pct=0, logfc.threshold=0 and only.pos=T. ACRs that had inconsistent fold change direction between Signac v.1.3 and v.1.8 (n = 38) were removed.

  Next, to identify cancer cell-specific DACRs (Fig. 1d and Extended Data Fig. 4b), we compared cancer cells from primary tumours with their closest normal cell types (CNCs; Fig. 1c) for each cancer using cohort-level merged objects for 9 out of 11 cancers for which CNCs were available. For HNSCC, we used the CESC/AD-HNSCC merged object that has normal squamous cells for this comparison (CNC for HNSCC) and, for OV cancer, we used the merged UCEC-OV object that had secretory endometrial epithelial cells (CNC for OV). Furthermore, we specified the following parameters for the FindMarkers function: min.pct=0.05, min.diff.pct=0, logfc.threshold=0 and only.pos=F. Furthermore, we wanted to exclude DACRs that were probably affected by CNVs. For this, we annotated DACRs with their closest genes (using the ChIPseeker package, as described above) and then calculated CNV scores for those genes using inferCNV results. CNV scores were calculated as the fraction of cancer cells per cohort that had that gene amplified or deleted (AMP and DEL scores, respectively). We then filtered DACRs using the following criteria: with log2[FC] >0如果AMP> 0.25,对于具有Log2 [FC]的DACR< 0 if DEL >0.25。

  最后,为了鉴定与转移相关的DACRS,我们使用队列合并的对象比较了来自4个队列的转移性肿瘤(CRC,PDAC,SKCM和UCEC)的转移性肿瘤的癌细胞。指定了以下参数:min.pct = 0.01,min.diff.pct = 0,logfc.threshold = 0 and唯一的pos = f。为了选择用于绘图的DACR(扩展数据图8A),我们还计算了来自每个转移性肿瘤的癌细胞之间的样品级折叠变化,并从同一癌症的所有原发性肿瘤中汇集了癌细胞。我们进一步优先考虑前200个DACR,首先是带正变化的转移样品的最高分数,然后是在样品中的平均折叠变化。

  为了计算DACRS的途径活性(图4D和扩展数据图4F),我们使用了来自MSIGDB71的标志性基因集与癌症相关的途径。我们使用了来自FGSEA软件包(V.1.24.0)的过度代表分析函数FORA,以作为检测到的所有统一峰进行宇宙集进行超几何测试。选择了每种癌症类型的倍数变化的DACR,然后进一步过滤以在癌症类型中至少50%的样本中具有正倍变化。为了确保癌症类型之间的平衡比较,在图4F中,分析中使用的DACR数量为每种癌症的前1,000个,而图4D中每种癌症的前5,000个限制在图4D中,根据折叠变化。用每个基因集测试了来自癌症类型的DACR的每个列表,并计算了FDR以进行多测试校正。在相应的气泡图中报道了每个测试的FDR和测试中与DACR相关的基因数量。在条形图中报道了与癌症类型基因中的任何基因相关的DACR的总数。

  对于SNATAC-SEQ覆盖范围,我们使用了Signac软件包中的CoveragePlot函数。

  为了评估SNATAC-SEQ数据中的TF结合可访问性配置文件,我们使用了Chromvar工具(V.1.12.0)86,该工具计算了相对于平均单元格的每个TF基序的增益或访问性的损失,该工具计算出偏置校正偏差(TF MOTIF分数)。我们使用默认参数和JASPAR2020数据库的包装器包装函数运行Chromvar。使用MotifMatchr r软件包对TF图案进行映射到DACR。为了识别SNATAC-SEQ数据细胞组之间具有差异活性的TF,我们在测定中为整个TFS使用了双面Wilcoxon Rank-sum测试,随后将FDR校正应用于所得的P值。

  我们使用Chromvar评分进行了以下比较对差异可访问的TF基序(DAM)进行了分析:来自每个癌症队列的原代癌细胞与所有其他肿瘤的汇集的原代癌细胞(补充表4D),原发性癌细胞与各自的CNC(补充表4E)(补充表4E)和与原发性癌细胞相比。对于所有大坝分析,我们使用了在包含癌症和选定正常细胞的合并泛伴随物体上计算出的分数。为了执行大坝分析,我们在相应组之间使用了两侧Wilcoxon Rank-sum测试,随后将FDR校正应用于所得的P值。对于转移特异性TF,我们还使用了从Scesic(基于SC/SNRNA-SEQ数据)获得的差异调节的结果。使用相同细胞组之间的双面Wilcoxon秩和测试计算差异调节,然后应用FDR校正。我们仅选择了那些重要的TF(FDR< 0.05) in both the DAM and the regulon analysis, and also required them to have a score change in the same direction between the same cell groups. Finally, we calculated the expression score as the absolute value of expression log2[FC] between the metastatic and primary cancer cells using per sample average values, also requiring the same fold change direction as the direction of the score difference for the same TF (Fig. 4a).

  Open chromatin regions were annotated with cis-regulatory elements from the geneHancer Regulatory Elements Elite list for hg3887 from the Genome USCS browser (last updated version, 2 September 2018). Genomic regions of geneHancer enhancers and promoter/enhancers were overlapped with a minimum overlap of 400 bp using the findOverlaps function from the IRanges R package. We also downloaded scEnhancer enhancers from all tissues and overlapped them with open chromatin regions the same way. Moreover, we downloaded interactions between GeneHancer regulatory elements and target genes from geneHancer Interactions Double Elite list (last updated version, 15 January 2019). Region-to-gene links were then annotated by presence in the geneHancer Interactions Double Elite list.

  First, we identified each JASPAR2020 TF motif in every open chromatin region using the function matchMotifs from the motifmatchr (v.1.12.0) R-package. We next downloaded ENCODE ChIP–seq hg38 bed files for all available TFs (download date, 28 January 2022). We then overlapped TF-binding ChIP–seq regions with the corresponding TF motif coordinates in our chromatin regions set using the findOverlaps function from the IRanges R package with minimum overlap equal to the length of the motif. If a given TF motif fully overlapped with a ChIP–seq-based binding region of the same TF, then we labelled this motif as being supported by ChIP–seq data.

  We collected the sc/snATAC-seq or bulk ATAC-seq studies with relevant differentially expressed TF analysis from published literature, including BRCA88, MM89, ccRCC90, PDAC91, pan-organ chromatin accessibility8 and the bulk ATAC-seq study in human cancers6. For BRCA, epithelial cells were compared to endothelial cells, fibroblast or immune cells, and the upregulated TFs were identified. In MM, fold change expression of TFs between myeloma and plasma cells was calculated to identify upregulated TFs in myeloma cells. For ccRCC and PDAC, the fold change expression of cancer-specific TFs between cancer cells and normal cells was calculated to identify upregulated TFs in either cancer cells (cancer-specific TF) or normal cells (tissue-specific TF).

  Moreover, the sc/snATAC-seq dataset from the pan-organ chromatin accessibility study was used to confirm cancer and tissue-specific TFs. The relevant cell types were annotated with their relevant disease code as the CNCs of the cancer. For example, mammary luminal epithelial cells to non-basal BRCA, mammary basal epithelial to basal BRCA, keratinocyte to HNSCC, colon epithelial cells to CRC, enterocyte to CRC, colon goblet to CRC, SI goblet to CRC, melanocyte to SKCM, acinar to PDAC, ductal cell to PDAC, astrocyte to GBM, oligodendrocyte to GBM, oligo precursor to GBM and plasma cells to MM. The tissue and cancer cell-specific TFs highly expressed in the corresponding cell populations were then identified.

  Finally, the bulk ATAC-seq dataset from human cancers6 was used to determine whether the tissue and cancer-specific TFs are the markers for the corresponding chromatin-accessibility-driven clusters. The clusters were annotated as follows: Cluster 1 to ccRCC, cluster 2 to CRC, cluster 3 to BRCA, cluster 5 to GBM, cluster 7 to SKCM, cluster 8 to CESC, cluster 14 to BRCA, and cluster 15 to UCEC.

  To process the CUT&RUN reads, we first performed quality control using FastQC to assess read quality (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). We next used Trimmomatic92. The resulting trimmed reads were subsequently mapped to the human reference genome (GRCh38.d1.vd1.fa.tar.gz) using Bowtie293 with the dovetail setting. Finally, to eliminate any duplicated reads, the aligned reads were processed for duplicate removal using Picard (http://broadinstitute.github.io/picard/). To call peaks, we used MACS2 using an IgG BAM file as a control. Then, for resulting peaks, we applied the same filtering steps as for the peak calling on snATAC-seq data.

  To comprehensively analyse the TF-specific ChIP–seq datasets from ENCODE, we used SCENIC to obtain the list of target genes for 53 tissue- and cancer-specific TFs with ChIP–seq datasets available. Subsequently, we extracted and aggregated the ChIP–seq peaks from multiple biosamples (Supplementary Table 5b) using the readPeakFile function from ChIPseeker and determined overlaps with the promoter regions of the target genes (5 kb upstream and downstream of TSSs) using the makeBioRegionFromGranges and getTagMatrix functions from ChIPseeker (Supplementary Table 5c). To annotate all of the peaks and regions, we used TxDb.Hsapiens.UCSC.hg38.knownGene in Ensembl style. We also overlapped the snATAC-seq peaks and CUT&RUN peaks with target genes using the aforementioned methods. Finally, we visualized the average ChIP–seq, snATAC-seq and CUT&RUN signals around the TSSs of the target genes using the plotAvgProf function from ChIPseeker, allowing for a comprehensive understanding of the regulatory landscape of the TFs across various tissues, cell lines and cancer types.

  We compared the peak coordinates from bulk ATAC-seq6 with the peak coordinates from our snATAC-seq data in eight cancer types, namely UCEC, ccRCC, GBM, BRCA, CESC/AD, CRC, SKCM and HNSCC. For each cancer cohort, common open chromatin regions were defined as the ones that had overlaps of at least 50 bp (overlap >10%)从批量ATAC-SEQ研究和所有其他区域进行任何开放染色质区域(无重叠或重叠 < 10%) were identified as snATAC-seq-specific open chromatin regions.

  To generate the bar plot showing the snATAC-seq specific peaks that were found in multiple cell types, we first used the AccessiblePeaks function from the Signac package94 to identify accessible regions in each cell type of the eight cancer types. We next categorized the identified peaks into two groups on the basis of whether they appeared in only one cell type, such as tumour cells, or whether they appeared in more than one cell type. We then used the BEDtools66 intersect function to compare the peak coordinates of snATAC-seq-specific peaks with these two cell type groups of peaks. This analysis enabled us to identify the snATAC-seq-specific peaks that were accessible in multiple cell types, which we then used to generate the bar plot.

  To overlap snATAC-seq and ChIP–seq peaks, we used the BEDtools66 intersect function to compare the peak coordinates obtained from ChIP–seq95 with our snATAC-seq-specific, bulk-ATAC-seq-specific and bulk/snATAC-seq overlapping peaks. This analysis was conducted separately for each of the eight cancer types to determine the extent to which the identified snATAC-seq peaks represented true signals rather than noise.

  To investigate whether our snATAC-seq-specific peaks were recurrently observed in other datasets, we compared our peak sets with the cell atlas of fetal chromatin accessibility9. We downloaded the master list of sites (GSE149683_File_S1.Master_list_of_sites.txt) and converted the genomic coordinates from hg19 to hg38 using the UCSC liftOver tool. We then converted the bed files to GRanges objects and used the findOverlaps function of GRanges to determine the overlaps between our snATAC-seq-specific peaks and the fetal chromatin accessibility peaks.

  We applied the linkPeaks function from the Signac R package (v.1.8.0)94,96 on tumour cells with snMultiome-seq data (snRNA-seq and snATAC-seq measured in the same cell). only open chromatin regions located within 500 kb of a gene TSS were considered. links were considered to be significant with a correlation value r >0.05和p< 0.05. Furthermore, we followed an established procedure6 to account for diffuse correlations. Diffuse correlations occur in genomic regions in which chromatin accessibility is generally high, whereby the gene expression is increased. It does not necessarily relate to an increased accessibility of cis-regulatory elements. To account for diffuse correlations, we divided each chromosome into 100 kb windows, quantified accessibility of these regions and correlated this accessibility with expression of genes of which the TSS is within 500 kb. As diffuse 100 kb windows are significantly larger than peaks (501 bp), they have accessibility coverage in more cells and, consequently, receive higher correlation values on average. To mitigate these differences, we z-scored correlation values for both open chromatin regions and 100 kb windows. We then compared z-scored correlation values and retained only those predicted region-to-gene links that had higher z-scored correlation values than the 100 kb window that they belong to. Moreover, we reasoned that copy-number changes are strong drivers of gene expression patterns, so it is important to account for them. We used the results of inferCNV on RNA-assay for snMultiome-seq samples and on ATAC-assay for regular snATAC-seq samples to quantify the number of cells with a gain or amplification of each gene and excluded those genes that are frequently amplified (in >25%的癌细胞和> 2,000个细胞)。例如,根据这些阈值,ELF3基因在PDAC中被排除,但保留在BRCA,CRC,HNSCC,OV和UCEC中。

  为了确定与癌症从正常细胞到原发性肿瘤的过渡有关的联系(图2D和扩展数据图5G),我们需要以下内容:(1)链接中的ACR要明显更易于访问(Log2 [fc]> 0.5,Wilcoxon Rank-SUM Test FDR FDR FDR。< 0.05) and (2) the gene to be significantly upregulated (log2[FC] >0.25,在原代癌细胞中与CNC相比,Wilcoxon rank-sum测试FDR <0.05)。

  我们根据以下想法计算了与含有TF基序相关的基因的富集,即如果一组基因确实受TF调节,那么我们应该能够识别含TF-MOTIF的基因组区域,该基因组的访问与这些基因的表达相关。工作模型基于TF活性的生物学。也就是说,TF与靶基因附近的结合位点结合,表明该区域的可及性,并随后刺激该基因的表达。可访问性与基因表达之间的这种关系应通过与我们上面所示的相似方式来检测。为了测试通过风景分析鉴定的TF靶基因是否显着丰富了与该TF基序的可访问性相关的基因,我们首先试图通过执行500个随机挑选基因的500个调节的基因– ACR连锁的随机背景速率来表征n是n是n是构型基因的数量knot crib crib segif in norks norks knem knot knem knot的基因链接的发生率。随机基因仅限于每种癌症类型中表达的基因,因此每种癌症类型都不同。零假设被视为对与TF基序相关的基因数量的期望,即平均值。假定基因– Acr计数是正态分布的,然后我们使用了一个高斯曲线μ=平均值(k)和σ= s.d.(k)进行测试每个相应的调节,计算z =(m-μ)/σ,其中m是观察到的目标基因的靶基因链接到TF基序的数量,并将其转换为一个单点p值。出于可视化目的,我们计算了折叠更改。使用MotifMatchr r软件包中的createMotifMatrix函数鉴定了ACR中的主题。

  为了找到癌症驱动器基因中的ACR到基因链接,我们通过以前的研究过滤了Oncogenes的链接49。为了可视化EGFR在BRCA基础,CESC和HNSCC癌症中的可及性和基因表达,我们仅使用GATK Pipeline将WES的EGFR Copy-number中性调用包括了样本(请参见上文)。除一个HNSCC样本外,所有样本还基于SNRNA-SEQ数据结果,还具有中性的推断EGFR的Copy-number呼叫。该HNSCC样本具有CASE ID P5514的样本,与基于WES的EGFR-COPY-NUMBER中性病例没有显着的EGFR表达差异(Log2 [Fc] = 0.08,Wilcoxon Rank-SUM测试P值= 0.09)。然而,与EGFR拷贝数中性病例相比,所有基于WES的EGFR放大病例均显示EGFR表达的显着上调(p5504,log2 [fc] = 0.13,p = 3.7×10-6; p5216; p5216; p5216,log2 [fc] = 2.2,p = 2.2,p = 1.7×10-39; p5379; p5379,log2 [fc],p5576,log2 [fc] = 3.9,p = 9.3×10-61)。由于GATK管道在P5514中没有称为EGFR复制数增益,在这种情况下我们没有观察到EGFR表达的上调,因此我们将其包括在图5C中。

  为了检测样品中的HPV读取,我们遵循了一系列步骤。首先,我们为已知的HPV基因型构建了一个基因组数据库。接下来,我们从与人类基因组不符的snRNA-seq bam文件中提取了未映射的读物。然后,我们使用BWA55与构造的病毒基因组数据库对齐这些未映射的读数。最后,我们从对齐结果中确定了HPV读取。可以在GitHub存储库(https://github.com/ding-lab/virusscan/tree/simplified)上找到此过程的详细源代码。

  通过CBIOPORTAL(https://www.cbioportal.org/)获得TCGA样品的RNA-SEQ表达数据,以及TCGA Pan-Cancer临床数据资源(TCGA-CDR)的临床信息97。本研究产生的调节(请参见“使用风景秀丽的基因调节网络分析”)根据TCGA同类(HNSCC,GBM,GBM,GRED,READ,COAD和PAAD)的样品的批量RNA-seq表达数据来计算调节活性。

  根据调节活性评分进行样本分组:那些比中位数高于“高组”的分数,而得分≤Median作为“低组”。然后使用生存(v.3.2.7)和生存者(V.0.4.9)r包计算无进展生存/总生存期和Kaplan-Meier曲线的存活概率。我们还进行了COX比例危害模型,以辨别最显着和独立影响患者生存的调节。在调整年龄,性别和HPV状态后,从Kaplan – Meier曲线中鉴定出的重要规范以确定其对生存的独特贡献。

  为了评估HPV+和HPV-HNSCC样品之间的调节活性差异,我们使用了Wilcoxon rank-sum测试进行比较来识别与HPV-STATUS相关的调节型变化。我们使用计算出的调节分数进一步验证了TCGA-HNSCC队列中与HPV-STATUS相关的调节(补充表9a)。

  在我们的队列中,我们患有UCEC和CRC癌症中的9例患者病例和转移性肿瘤。为了进行EMT分析,我们创建了病例级的SNATAC-SEQ对象,包括来自主要样本的癌细胞,来自转移性样品的癌细胞和主要样本的正常上皮细胞(如果有)。这些细胞被重新归一化并聚集,类似于上述方法(使用runumap和Findneighbors功能的2:30 LSI组件)。调整群集分辨率,以防止过度集群。对于UCEC案例CPT1541DU和CPT704DU,该分辨率为0.2,对于CPT2373DU和CPT4096DU,它设置为0.1,对于CPT4427DU,为0.3。对于CM1563C和CM663C的CRC病例,该分辨率为0.1,对于CM268C和CM618C,分辨率为0.2。很少有病例显示出少量的细胞簇,具有免疫标记的可及性增加,这表明它们可能是具有免疫细胞的双重动物。因此,它们被标记为其他,不包括在下游分析中。

  为了对九个配对的初级/转移性SNATAC-seq样品进行轨迹分析,我们使用了弹弓r包装(V.2.5.1),该弹弓示出了在大型轨迹推理基准98,99中实现了最高的性能轨迹推理方法。弹弓需要两个输入:维度降低数据和单元格的聚类。对于细胞的聚类,我们在所有情况下使用细胞类型注释(正常,原发性肿瘤和转移性肿瘤)作为输入,如果样品中没有正常细胞,则将正常簇指定为起始簇或原发性肿瘤。对于降低维度,我们使用了一种被监督的方法,称为群集分析(BCA),该方法除了基础数据外,还使用了细胞类型信息,以更好地保留细胞类型之间的关系,以更适合轨迹推断来减少数据。具体地说,让我们表示一个SNATAC-SEQ测量值,其中n是细胞的数量,P是峰的数量和细胞聚集到K细胞类型中。令我们表示群集CK的质心,并定义群间方差Vb为AS,其中称为群集散射之间。BCA的目的是找到一组r = k -1的正交轴,该轴最能保留群集方差之间。具体而言,BCA解决了以下优化问题。解决方案(最佳W*)由SB(X,C)的最大R特征向量给出,相应的BCA嵌入对应于YBCA = XW*。对于每种情况,我们都将弹弓作为BCA的前两个组成部分作为输入(实际上,我们发现这与使用所有k -1组件相反,这是最佳性能执行的)。由于集群间方差是一个有监督的统计量(需要细胞类型聚类的知识),因此它保留了细胞类型之间的关系 这对于轨迹推断是理想的。请注意,BCA仅适用于具有三个或多个簇的数据集,而在想要产生轨迹推理方法的可视化或输入。

  为了控制总读数的混杂作用,我们将其输入了BCA 2:50 LSI组件以及细胞类型注释(正常,原发性或转移性癌细胞)或情况级别的簇(扩展数据图8B)。当样品中存在正常细胞时,我们将BCA作为簇注释(k = 3)。否则,我们使用了案例级别的群集。此外,CASE CM1563C和CASE CPT2373DU仅具有两个细胞型簇(原发性和转移)和两个或更少的病例级别簇。因此,我们将弹弓作为CM1563C和CPT2373DU的2:50 LSI组件的输入,因为BCA仅返回这两个样品的一维嵌入。

  弹弓为每个细胞输出一个假频率(对基本生物进展的实数)。对于每种情况,我们使用Pearson的系数与从我们以前的分析获得的663 TF基序分数中的每个分数相关联,并使用Benjamini -Hochberg方法调整了每个相关的相应P值,以控制多个测试。我们使用了0.05的FDR阈值截止值(补充表7)。

  对于九个配对的原发性 - 转移样品(4个CRC和5个UCEC),我们确定了重要的途径,这些途径以跨原代和转移性癌细胞的DACR为特征。我们确定了两组不同的原发性 - 转移性DACR:与TF分数和区域显着相关的区域,这些区域与与仅包含初级和转移性细胞的轨迹分析中鉴定出的假频率显着相关。这里的理由是,假期和TF基序评分都与原发性肿瘤细胞的转移性进展有关。因此,与两种特征显着相关的区域可能在肿瘤细胞的转移中起作用。我们使用R Glmnet Package100中实现的LASSO回归来确定哪些峰与TF基序分数或假频率显着相关。也就是说,我们将假频率或TF基序分数用作响应变量,并将所有染色质区域分配为协变量。我们选择拉索的原因有两个,而不是线性回归或脊回归。首先,峰值比细胞多得多。因此,线性回归不能应用。其次,我们希望一组稀疏的峰更可能成为转移过程的一部分。我们认为,如果峰具有非零套索回归系数,则它们是显着相关的。我们通过使用cv.glment进行十倍的交叉验证来选择Lasso正则化参数lambda的价值,随后选择了所有样品中所有样本中的lambda,除了在CPT2373DU中,除了经验上更好地选择了最终的lambda sequeart lambda sequeart lambda lamb lamb lamb lamb lamb lamb lamb lamb lamb。

  使用Hallmark MSIGDB途径的数据库71收集了与假频率DACR相关的基因,以进行基因集过度代表分析(扩展数据图9H)。通过使用ClusterProfiler列出具有不同FDR范围的途径的高几何测试获得了重要的途径。我们对与转移涉及的TF的活性相关的DACR进行了相同的分析(补充图5)。

  有关研究设计的更多信息可在与本文有关的自然投资组合报告摘要中获得。

分享到
声明:本文为用户投稿或编译自英文资料,不代表本站观点和立场,转载时请务必注明文章作者和来源,不尊重原创的行为将受到本站的追责;转载稿件或作者投稿可能会经编辑修改或者补充,有异议可投诉至本站。

热文导读