赞
踩
TCGAbiolinks
的使用,但是由于
TCGA
的改版,该包也做出了相应的更新,所以我再重新介绍一下新版
TCGAbiolinks
的使用。
TCGAbiolinks
是一个利用 GDC API
接口来查询、下载和分析 TCGA
数据库的数据的 R
包
TCGAbiolinks
包的功能主要可以分为三大块:
该包可以从 Bioconductor
上安装稳定版本
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
或者从 GitHub
上安装开发版本
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
导入包
library(TCGAbiolinks) # version 2.30.4
TCGAbiolinks
提供了一些函数用于查询和下载 GDC
中的数据,包括:
Harmonized
:这部分数据都比较新,使用的是 GRCh38
(hg38
) 基因组版本,使用的是 GDC pipeline
来处理数据Legacy
:这部分的数据应该是较早之前测的,使用的是 GRCh37
(hg19
) 基因组版本使用 GDCquery
函数来查询 GDC
的数据,该函数的参数为:
GDCquery(
project,
data.category,
data.type,
workflow.type,
access,
platform,
barcode,
data.format,
experimental.strategy,
sample.type
)
project
:该参数的取值非常多,可以使用如下命令来查询所有可用的项目> TCGAbiolinks:::getGDCprojects()$project_id [1] "HCMI-CMDC" "TCGA-BRCA" "TARGET-ALL-P3" [4] "EXCEPTIONAL_RESPONDERS-ER" "CGCI-HTMCP-LC" "CPTAC-2" [7] "CMI-MBC" "TARGET-ALL-P2" "OHSU-CNL" [10] "TARGET-ALL-P1" "MMRF-COMMPASS" "ORGANOID-PANCREATIC" [13] "NCICCR-DLBCL" "TCGA-SARC" "TCGA-ACC" [16] "WCDT-MCRPC" "TCGA-UCEC" "MP2PRT-ALL" [19] "TCGA-KIRC" "CGCI-HTMCP-CC" "CMI-ASC" [22] "CGCI-HTMCP-DLBCL" "BEATAML1.0-CRENOLANIB" "CDDP_EAGLE-1" [25] "APOLLO-LUAD" "CMI-MPC" "FM-AD" [28] "MATCH-Z1D" "MATCH-Y" "MATCH-N" [31] "MATCH-Q" "MP2PRT-WT" "TCGA-LAML" [34] "VAREPOP-APOLLO" "TCGA-SKCM" "TRIO-CRU" [37] "TCGA-PAAD" "TCGA-TGCT" "TCGA-CESC" [40] "TCGA-ESCA" "TCGA-THCA" "TCGA-LIHC" [43] "TCGA-PRAD" "TCGA-READ" "MATCH-I" [46] "MATCH-W" "MATCH-B" "MATCH-H" [49] "TCGA-OV" "TCGA-UVM" "MATCH-Z1A" [52] "MATCH-U" "BEATAML1.0-COHORT" "TCGA-BLCA" [55] "CGCI-BLGSP" "CTSP-DLBCL1" "MATCH-S1" [58] "MATCH-R" "MATCH-Z1I" "CPTAC-3" [61] "TCGA-CHOL" "TCGA-GBM" "MATCH-S2" [64] "TCGA-UCS" "TCGA-PCPG" "TCGA-MESO" [67] "TARGET-CCSK" "TARGET-WT" "TARGET-RT" [70] "TCGA-DLBC" "TARGET-OS" "TCGA-COAD" [73] "REBC-THYR" "TCGA-STAD" "TCGA-KIRP" [76] "TCGA-THYM" "TCGA-KICH" "TCGA-LGG" [79] "TARGET-AML" "TCGA-LUSC" "TCGA-LUAD" [82] "TCGA-HNSC" "TARGET-NBL"
data.category
:可以使用如下方式来查询 TCGA-BRCA
项目的可用的分类数据> TCGAbiolinks:::getProjectSummary("TCGA-BRCA") $file_count [1] 61173 $data_categories file_count case_count data_category 1 17337 1098 Simple Nucleotide Variation 2 9281 1098 Sequencing Reads 3 5316 1098 Biospecimen 4 2288 1098 Clinical 5 12292 1098 Copy Number Variation 6 4876 1097 Transcriptome Profiling 7 3714 1097 DNA Methylation 8 919 881 Proteome Profiling 9 226 101 Somatic Structural Variation 10 4924 1095 Structural Variation $case_count [1] 1098 $file_size [1] 6.245362e+14
新版本只能获取 harmonized
类型的数据,主要包含 7
个分类:
- Biospecimen
- Clinical
- Copy Number Variation
- DNA Methylation
- Sequencing Reads
- Simple Nucleotide Variation
- Transcriptome Profiling
注意
legacy
类型的数据在这个版本中已经不支持了,因为GDC
很块将会把该类型数据关闭。
sample.type
:样本类型,取值可以是三列中的任意一个值至于其他参数,对于 harmonized
数据,其参数取值有
参数列表的详细取值,可以从下面的网址中下载得到:https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=2046985454
在这个例子中,我们在 harmonized
数据库中搜索复发性 GBM
样本的甲基化数据
query <- GDCquery(
project = c("TCGA-GBM"),
data.category = "DNA Methylation",
platform = c("Illumina Human Methylation 450"),
sample.type = "Recurrent Tumor"
)
data <- getResults(query)
getResults()
用于获取查询结果,可以设置 rows
和 cols
参数来设置返回的行和列
如果是使用 RStudio
,可以使用 View
(data
) 来查看数据,也可以使用 DT
包的 datatable
来展示数据
返回结果如下
从数据库中分别获取所有包含甲基化数据(450k
)和表达数据的乳腺癌患者信息
query.met <- GDCquery( project = "TCGA-BRCA", data.category = "DNA Methylation", platform = c("Illumina Human Methylation 450") ) query.exp <- GDCquery( project = "TCGA-BRCA", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts" ) # 获取既检测甲基化又检测表达的样本 common.patients <- intersect( substr(getResults(query.met, cols = "cases"), 1, 12), substr(getResults(query.exp, cols = "cases"), 1, 12) )
> length(common.patients)
[1] 789
共有 789
个样本,可以设置 barcode
参数,只选择部分样本的数据
selected.met <- GDCquery(
project = "TCGA-BRCA",
data.category = "DNA Methylation",
platform = c("Illumina Human Methylation 450"),
barcode = common.patients[1:5]
)
或者直接使用 matchedMetExp
获取同时检测两种类型数据的样本编号
> samples <- matchedMetExp("TCGA-BRCA", n = 10)
> samples
[1] "TCGA-AQ-A0Y5-01" "TCGA-C8-A274-01" "TCGA-B6-A1KC-01" "TCGA-AC-A62V-01" "TCGA-A2-A0YM-01"
[6] "TCGA-AO-A03N-01" "TCGA-AO-A1KQ-01" "TCGA-E2-A1LI-01" "TCGA-B6-A0WV-01" "TCGA-E2-A1LE-01"
从数据库中查询乳腺癌原始测序数据(bam
),这部分样本访问受限,一般人用不了。
query <- GDCquery(
project = "TCGA-BRCA",
data.category = "Sequencing Reads",
data.type = "Aligned Reads",
data.format = "bam",
workflow.type = "STAR 2-Pass Transcriptome"
# workflow.type = "STAR 2-Pass Genome"
# workflow.type = "STAR 2-Pass Chimeric"
# workflow.type = "BWA with Mark Duplicates and BQSR"
# workflow.type = "BWA-aln"
)
> getResults(query, rows = 1:6, cols = c("file_name","cases"))
file_name
1 0361ec21-e83f-40a9-b825-d92487b1239a.rna_seq.transcriptome.gdc_realn.bam
2 0a9e33db-2527-4cc3-8669-a7c10fed7a7f.rna_seq.transcriptome.gdc_realn.bam
3 65c868ba-94ed-43f2-bfd1-814979ee9486.rna_seq.transcriptome.gdc_realn.bam
4 09f906ba-fa4b-487f-9631-fbd97d9db8f9.rna_seq.transcriptome.gdc_realn.bam
5 8c92ecf0-5d81-48c3-ad9f-9adf51860962.rna_seq.transcriptome.gdc_realn.bam
6 a8e0466b-d106-44b7-b846-c4fb666715f5.rna_seq.transcriptome.gdc_realn.bam
cases
1 TCGA-C8-A1HO-01A-11R-A13Q-07
2 TCGA-A7-A0CD-01A-11R-A00Z-07
3 TCGA-AO-A0J5-01A-11R-A034-07
4 TCGA-B6-A0IK-01A-12R-A056-07
5 TCGA-A7-A4SB-01A-21R-A266-07
6 TCGA-E9-A1R4-01A-21R-A14D-07
我们可以获取查询到数据的 manifest
文件列表,设置 save = TRUE
可以将结果输出到文件中,用其他软件下载对应的数据
getManifest(query, save = FALSE)
getResults(TCGAbiolinks:::GDCquery_ATAC_seq())[,c("file_name","file_size")] # A tibble: 47 × 2 file_name file_size <chr> <dbl> 1 TCGA-ATAC_PanCancer_PeakSet.txt 37522221 2 TCGA-ATAC_DataS1_DonorsAndStats_v4.xlsx 251795 3 MESO_bigWigs.tgz 1507969705 4 TCGA-ATAC_DataS5_GWAS_v2.xlsx 999661 5 COAD_bigWigs.tgz 8939070313 6 TCGA-ATAC_PanCan_Raw_Counts.txt 1241263873 7 TGCT_bigWigs.tgz 2058854449 8 TCGA-ATAC_DataS6_Footprinting_v1.xlsx 19544445 9 TCGA-ATAC_DataS10_WGS-ATAC_v2_Open.xlsx 893110 10 TCGA-ATAC_Cancer_Type-specific_Count_Matrices_log2norm_counts.zip 632769396 # ℹ 37 more rows # ℹ Use `print(n = ...)` to see more rows
TCGAbiolinks
提供了一些函数,用于下载和处理来自 GDC
数据库的数据,并用于后续的分析
GDCdownload
下载 GDC
数据的方法有两种:
client
:该方法会创建 MANIFEST
文件并使用 GDC Data Transfer Tool
来下载数据,但是速度上会比使用 API
更慢api
:该方法使用 GDC API
接口来下载数据,也会创建 MANIFEST
文件并将数据保存为 tar.gz
文件,如果数据过大,会将数据拆分为小块分别下载,files.per.chunk
参数可以控制同时下载的文件个数,减少下载错误GDCprepare
:使用 GDCprepare
处理数据后,会返回 SummarizedExperiment
或 data.frame
对象,主要包含三个主要的矩阵,可以使用 SummarizedExperiment
包来获取对应的信息:
colData(data)
来获取样本信息assay(data)
来获取分子数据rowRanges(data)
来获取特征元数据信息,如基因的信息参数列表
我们从 harmonized
(参考基因组为 hg38
)数据库中下载基因表达数据
# Gene expression aligned against hg38
query <- GDCquery(
project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01")
)
GDCdownload(query = query)
data <- GDCprepare(query = query)
返回的对象为 SummarizedExperiment
类型
class(data)
# [1] "RangedSummarizedExperiment"
# attr(,"package")
# [1] "SummarizedExperiment"
获取样本信息
# 先导入包
library(SummarizedExperiment)
colData(data)[,1:4]
# DataFrame with 2 rows and 4 columns
# barcode patient sample shortLetterCode
# <character> <character> <character> <character>
# TCGA-14-0736-02A-01R-2005-01 TCGA-14-0736-02A-01R.. TCGA-14-0736 TCGA-14-0736-02A TR
# TCGA-06-0211-02A-02R-2005-01 TCGA-06-0211-02A-02R.. TCGA-06-0211 TCGA-06-0211-02A TR
获取表达数据
assay(data)[1:6,]
# TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
# ENSG00000000003.15 4239 3825
# ENSG00000000005.6 34 7
# ENSG00000000419.13 891 1775
# ENSG00000000457.14 251 455
# ENSG00000000460.17 123 333
# ENSG00000000938.13 284 677
获取基因信息
rowRanges(data)[1:3, 1:3]
# GRanges object with 3 ranges and 3 metadata columns:
# seqnames ranges strand | source type score
# <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
# ENSG00000000003.15 chrX 100627108-100639991 - | HAVANA gene NA
# ENSG00000000005.6 chrX 100584936-100599885 + | HAVANA gene NA
# ENSG00000000419.13 chr20 50934867-50958555 - | HAVANA gene NA
# -------
# seqinfo: 25 sequences from an unspecified genome; no seqlengths
GDCprepare
函数的功能还在开发中,并不是所有情况下都能正常工作,目前支持的数据处理包括下表中列出的情况
下载 Copy Number Segment
类型的数据
query <- GDCquery(
project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Copy Number Segment",
barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01")
)
GDCdownload(query)
data <- GDCprepare(query)
下载 Gene Level Copy Number
类型的数据
query <- GDCquery(
project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Gene Level Copy Number",
access = "open"
)
GDCdownload(query)
data <- GDCprepare(query)
下载 Allele-specific Copy Number Segment
类型的数据
query <- GDCquery(
project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Allele-specific Copy Number Segment",
access = "open"
)
GDCdownload(query)
data <- GDCprepare(query)
下载 Masked Copy Number Segment
类型的数据
query <- GDCquery(
project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Masked Copy Number Segment",
access = "open"
)
GDCdownload(query)
data <- GDCprepare(query)
下载 mRNA
表达数据
# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
query.exp.hg38 <- GDCquery(
project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01")
)
GDCdownload(query.exp.hg38)
expdat <- GDCprepare(
query = query.exp.hg38,
save = TRUE,
save.filename = "exp.rda"
)
下载 miRNA
表达数据
query.mirna <- GDCquery(
project = "TARGET-AML",
experimental.strategy = "miRNA-Seq",
data.category = "Transcriptome Profiling",
barcode = c("TARGET-20-PATDNN","TARGET-20-PAPUNR"),
data.type = "miRNA Expression Quantification"
)
GDCdownload(query.mirna)
mirna <- GDCprepare(
query = query.mirna,
save = TRUE,
save.filename = "mirna.rda"
)
下载 isoform
(剪切异构体)表达数据
query.isoform <- GDCquery(
project = "TARGET-AML",
experimental.strategy = "miRNA-Seq",
data.category = "Transcriptome Profiling",
barcode = c("TARGET-20-PATDNN","TARGET-20-PAPUNR"),
data.type = "Isoform Expression Quantification"
)
GDCdownload(query.isoform)
isoform <- GDCprepare(
query = query.isoform,
save = TRUE,
save.filename = "mirna-isoform.rda"
)
下载甲基化 Beta
值数据
# 27K 数据 query_met.hg38 <- GDCquery( project = "TCGA-BRCA", data.category = "DNA Methylation", data.type = "Methylation Beta Value", platform = "Illumina Human Methylation 27", barcode = c("TCGA-B6-A0IM") ) GDCdownload(query_met.hg38) data.hg38 <- GDCprepare(query_met.hg38) # 450K 数据 query_met.hg38 <- GDCquery( project= "TCGA-LGG", data.category = "DNA Methylation", data.type = "Methylation Beta Value", platform = "Illumina Human Methylation 450", barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05") ) GDCdownload(query_met.hg38) data.hg38 <- GDCprepare(query_met.hg38) # Epic query_met.hg38 <- GDCquery( project= "HCMI-CMDC", data.category = "DNA Methylation", data.type = "Methylation Beta Value", platform = "Illumina Methylation Epic", barcode = c("HCM-BROD-0045") ) GDCdownload(query_met.hg38) data.hg38 <- GDCprepare(query_met.hg38)
下载 IDAT
文件
query <- GDCquery( project = "TCGA-BRCA", data.category = "DNA Methylation", data.type = "Masked Intensities", platform = "Illumina Human Methylation 27" ) GDCdownload(query, files.per.chunk=10) betas <- GDCprepare(query) query <- GDCquery( project = "HCMI-CMDC", data.category = "DNA Methylation", data.type = "Masked Intensities", platform = "Illumina Methylation Epic" ) GDCdownload(query, files.per.chunk = 10) betas <- GDCprepare(query)
下载蛋白质芯片数据
query.rppa <- GDCquery(
project = "TCGA-ESCA",
data.category = "Proteome Profiling",
data.type = "Protein Expression Quantification"
)
GDCdownload(query.rppa)
rppa <- GDCprepare(query.rppa)
下载样本的临床信息
# XML 格式 query <- GDCquery( project = "TCGA-COAD", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR XML", barcode = "TCGA-A6-5664" ) GDCdownload(query) drug <- GDCprepare_clinic(query,"drug") query <- GDCquery( project = "TCGA-COAD", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR OMF XML", barcode = "TCGA-AD-6964" ) GDCdownload(query) query <- GDCquery( project = "TCGA-ACC", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR Biotab" ) GDCdownload(query) clinical.BCRtab.all <- GDCprepare(query) names(clinical.BCRtab.all) # 放疗信息 query <- GDCquery( project = "TCGA-ACC", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR Biotab", file.type = "radiation" ) GDCdownload(query) clinical.BCRtab.radiation <- GDCprepare(query)
下载单核苷酸变异信息
query <- GDCquery(
project = "TCGA-HNSC",
data.category = "Simple Nucleotide Variation",
data.type = "Masked Somatic Mutation",
access = "open"
)
GDCdownload(query)
maf <- GDCprepare(query)
也可以下载 MC3 MAF
文件
maf <- getMC3MAF()
然后搭配 maftools
来可视化突变数据。例如
library(maftools)
library(dplyr)
query <- GDCquery(
project = "TCGA-CHOL",
data.category = "Simple Nucleotide Variation",
access = "open",
data.type = "Masked Somatic Mutation",
workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)
GDCdownload(query)
maf <- GDCprepare(query)
maf <- read.maf(maf)
绘制简单的汇总结果图
plotmafSummary(maf = maf, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE)
绘制突变瀑布图
oncoplot(maf = maf, top = 10, removeNonMutated = TRUE)
绘制 ti
(transitions
) 和 tv
(transversions
) 的比例,以及样本中 6
个碱基替换的占比
titv <- titv(maf = maf, plot = FALSE, useSyn = TRUE)
plotTiTv(res = titv)
PanCancerAtlas_subtypes
函数可访问从 synapse
检索到的信息(可能包含最新的分子亚型)
subtypes <- PanCancerAtlas_subtypes()
subtypes[1:4, 1:4]
# # A tibble: 4 × 4
# pan.samplesID cancer.type Subtype_mRNA Subtype_DNAmeth
# <chr> <chr> <chr> <chr>
# 1 TCGA-OR-A5J1 ACC steroid-phenotype-high+proliferation CIMP-high
# 2 TCGA-OR-A5J2 ACC steroid-phenotype-high+proliferation CIMP-low
# 3 TCGA-OR-A5J3 ACC steroid-phenotype-high CIMP-intermediate
# 4 TCGA-OR-A5J4 ACC NA CIMP-high
而 TCGAquery_subtype
函数则包含从 TCGA
论文中检索到的样本信息的完整表。
TCGAquery_subtype(tumor = "acc")[1:3, 1:4]
# acc subtype information from:doi:10.1016/j.ccell.2016.04.002
# patient Histology C1A.C1B mRNA_K4
# 1 TCGA-OR-A5J1 Usual Type C1A steroid-phenotype-high+proliferation
# 2 TCGA-OR-A5J2 Usual Type C1A steroid-phenotype-high+proliferation
# 3 TCGA-OR-A5J3 Usual Type C1A steroid-phenotype-high
获取单细胞转录组数据
query.sc.analysis <- GDCquery(
project = "CPTAC-3",
data.category = "Transcriptome Profiling",
access = "open",
data.type = "Single Cell Analysis",
data.format = "TSV"
)
GDCdownload(query.sc.analysis)
Single.Cell.Analysis.list <- GDCprepare(query.sc.analysis)
10x Raw Counts
数据
query.raw.counts <- GDCquery(
project = "CPTAC-3",
data.category = "Transcriptome Profiling",
access = "open",
data.type = "Gene Expression Quantification",
barcode = c("CPT0167860015","CPT0206880004"),
workflow.type = "CellRanger - 10x Raw Counts"
)
GDCdownload(query.raw.counts)
raw.counts.list <- GDCprepare(query.raw.counts)
过滤后的 10x Raw Counts
数据
query.filtered.counts <- GDCquery(
project = "CPTAC-3",
data.category = "Transcriptome Profiling",
access = "open",
data.type = "Gene Expression Quantification",
barcode = c("CPT0167860015","CPT0206880004"),
workflow.type = "CellRanger - 10x Filtered Counts"
)
GDCdownload(query.filtered.counts)
filtered.counts.list <- GDCprepare(query.filtered.counts)
10x Chromium
差异表达数据
query.sc.dea <- GDCquery(
project = "CPTAC-3",
data.category = "Transcriptome Profiling",
access = "open",
data.type = "Differential Gene Expression",
barcode = c("CPT0167860015","CPT0206880004"),
workflow.type = "Seurat - 10x Chromium"
)
GDCdownload(query.sc.dea)
sc.dea.list <- GDCprepare(query.sc.dea)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。