当前位置:   article > 正文

TCGA 数据下载 —— TCGAbiolinks 简单使用

tcgabiolinks

前言


前面我们曾介绍过 TCGAbiolinks 的使用,但是由于 TCGA 的改版,该包也做出了相应的更新,所以我再重新介绍一下新版 TCGAbiolinks 的使用。

TCGAbiolinks 是一个利用 GDC API 接口来查询、下载和分析 TCGA 数据库的数据的 R

TCGAbiolinks 包的功能主要可以分为三大块:

  • 数据查询和下载
  • 数据的常规分析
  • 可视化

该包可以从 Bioconductor 上安装稳定版本

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
  • 1
  • 2
  • 3

或者从 GitHub 上安装开发版本

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
  • 1
  • 2
  • 3
  • 4

导入包

library(TCGAbiolinks)  # version 2.30.4
  • 1

数据查询

TCGAbiolinks 提供了一些函数用于查询和下载 GDC 中的数据,包括:

  • Harmonized:这部分数据都比较新,使用的是 GRCh38 (hg38) 基因组版本,使用的是 GDC pipeline 来处理数据
  • Legacy:这部分的数据应该是较早之前测的,使用的是 GRCh37 (hg19) 基因组版本

使用 GDCquery 函数来查询 GDC 的数据,该函数的参数为:

GDCquery(
  project,
  data.category,
  data.type,
  workflow.type,
  access,
  platform,
  barcode,
  data.format,
  experimental.strategy,
  sample.type
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • project:该参数的取值非常多,可以使用如下命令来查询所有可用的项目
> TCGAbiolinks:::getGDCprojects()$project_id
[1] "HCMI-CMDC"                 "TCGA-BRCA"                 "TARGET-ALL-P3"            
 [4] "EXCEPTIONAL_RESPONDERS-ER" "CGCI-HTMCP-LC"             "CPTAC-2"                  
 [7] "CMI-MBC"                   "TARGET-ALL-P2"             "OHSU-CNL"                 
[10] "TARGET-ALL-P1"             "MMRF-COMMPASS"             "ORGANOID-PANCREATIC"      
[13] "NCICCR-DLBCL"              "TCGA-SARC"                 "TCGA-ACC"                 
[16] "WCDT-MCRPC"                "TCGA-UCEC"                 "MP2PRT-ALL"               
[19] "TCGA-KIRC"                 "CGCI-HTMCP-CC"             "CMI-ASC"                  
[22] "CGCI-HTMCP-DLBCL"          "BEATAML1.0-CRENOLANIB"     "CDDP_EAGLE-1"             
[25] "APOLLO-LUAD"               "CMI-MPC"                   "FM-AD"                    
[28] "MATCH-Z1D"                 "MATCH-Y"                   "MATCH-N"                  
[31] "MATCH-Q"                   "MP2PRT-WT"                 "TCGA-LAML"                
[34] "VAREPOP-APOLLO"            "TCGA-SKCM"                 "TRIO-CRU"                 
[37] "TCGA-PAAD"                 "TCGA-TGCT"                 "TCGA-CESC"                
[40] "TCGA-ESCA"                 "TCGA-THCA"                 "TCGA-LIHC"                
[43] "TCGA-PRAD"                 "TCGA-READ"                 "MATCH-I"                  
[46] "MATCH-W"                   "MATCH-B"                   "MATCH-H"                  
[49] "TCGA-OV"                   "TCGA-UVM"                  "MATCH-Z1A"                
[52] "MATCH-U"                   "BEATAML1.0-COHORT"         "TCGA-BLCA"                
[55] "CGCI-BLGSP"                "CTSP-DLBCL1"               "MATCH-S1"                 
[58] "MATCH-R"                   "MATCH-Z1I"                 "CPTAC-3"                  
[61] "TCGA-CHOL"                 "TCGA-GBM"                  "MATCH-S2"                 
[64] "TCGA-UCS"                  "TCGA-PCPG"                 "TCGA-MESO"                
[67] "TARGET-CCSK"               "TARGET-WT"                 "TARGET-RT"                
[70] "TCGA-DLBC"                 "TARGET-OS"                 "TCGA-COAD"                
[73] "REBC-THYR"                 "TCGA-STAD"                 "TCGA-KIRP"                
[76] "TCGA-THYM"                 "TCGA-KICH"                 "TCGA-LGG"                 
[79] "TARGET-AML"                "TCGA-LUSC"                 "TCGA-LUAD"                
[82] "TCGA-HNSC"                 "TARGET-NBL"
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • data.category:可以使用如下方式来查询 TCGA-BRCA 项目的可用的分类数据
> TCGAbiolinks:::getProjectSummary("TCGA-BRCA")
$file_count
[1] 61173

$data_categories
   file_count case_count                data_category
1       17337       1098  Simple Nucleotide Variation
2        9281       1098             Sequencing Reads
3        5316       1098                  Biospecimen
4        2288       1098                     Clinical
5       12292       1098        Copy Number Variation
6        4876       1097      Transcriptome Profiling
7        3714       1097              DNA Methylation
8         919        881           Proteome Profiling
9         226        101 Somatic Structural Variation
10       4924       1095         Structural Variation

$case_count
[1] 1098

$file_size
[1] 6.245362e+14
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

新版本只能获取 harmonized 类型的数据,主要包含 7 个分类:

- Biospecimen
- Clinical
- Copy Number Variation
- DNA Methylation
- Sequencing Reads
- Simple Nucleotide Variation
- Transcriptome Profiling
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

注意

legacy 类型的数据在这个版本中已经不支持了,因为 GDC 很块将会把该类型数据关闭。

  • sample.type:样本类型,取值可以是三列中的任意一个值

至于其他参数,对于 harmonized 数据,其参数取值有

参数列表的详细取值,可以从下面的网址中下载得到:https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=2046985454

harmonized 数据库

1. 甲基化数据:复发肿瘤样本

在这个例子中,我们在 harmonized 数据库中搜索复发性 GBM 样本的甲基化数据

query <- GDCquery(
  project = c("TCGA-GBM"),
  data.category = "DNA Methylation",
  platform = c("Illumina Human Methylation 450"),
  sample.type = "Recurrent Tumor"
)

data <- getResults(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

getResults() 用于获取查询结果,可以设置 rowscols 参数来设置返回的行和列

如果是使用 RStudio,可以使用 View(data) 来查看数据,也可以使用 DT 包的 datatable 来展示数据

返回结果如下

2. 查询甲基化和表达数据

从数据库中分别获取所有包含甲基化数据(450k)和表达数据的乳腺癌患者信息

query.met <- GDCquery(
  project = "TCGA-BRCA",
  data.category = "DNA Methylation",
  platform = c("Illumina Human Methylation 450")
)
query.exp <- GDCquery(
  project = "TCGA-BRCA",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification", 
  workflow.type = "STAR - Counts"
)
# 获取既检测甲基化又检测表达的样本
common.patients <- intersect(
  substr(getResults(query.met, cols = "cases"), 1, 12),
  substr(getResults(query.exp, cols = "cases"), 1, 12)
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
> length(common.patients)
[1] 789
  • 1
  • 2

共有 789 个样本,可以设置 barcode 参数,只选择部分样本的数据

selected.met <- GDCquery(
    project = "TCGA-BRCA",
    data.category = "DNA Methylation",
    platform = c("Illumina Human Methylation 450"),
    barcode = common.patients[1:5]
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

或者直接使用 matchedMetExp 获取同时检测两种类型数据的样本编号

> samples <- matchedMetExp("TCGA-BRCA", n = 10)
> samples
 [1] "TCGA-AQ-A0Y5-01" "TCGA-C8-A274-01" "TCGA-B6-A1KC-01" "TCGA-AC-A62V-01" "TCGA-A2-A0YM-01"
 [6] "TCGA-AO-A03N-01" "TCGA-AO-A1KQ-01" "TCGA-E2-A1LI-01" "TCGA-B6-A0WV-01" "TCGA-E2-A1LE-01"
  • 1
  • 2
  • 3
  • 4
3. 查询原始数据

从数据库中查询乳腺癌原始测序数据(bam),这部分样本访问受限,一般人用不了。

query <- GDCquery(
    project = "TCGA-BRCA", 
    data.category = "Sequencing Reads",
    data.type = "Aligned Reads", 
    data.format = "bam",
    workflow.type = "STAR 2-Pass Transcriptome"
    # workflow.type = "STAR 2-Pass Genome"
    # workflow.type = "STAR 2-Pass Chimeric"
    # workflow.type = "BWA with Mark Duplicates and BQSR"
    # workflow.type = "BWA-aln"
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
> getResults(query, rows = 1:6, cols = c("file_name","cases"))
                                                                  file_name
1  0361ec21-e83f-40a9-b825-d92487b1239a.rna_seq.transcriptome.gdc_realn.bam
2  0a9e33db-2527-4cc3-8669-a7c10fed7a7f.rna_seq.transcriptome.gdc_realn.bam
3  65c868ba-94ed-43f2-bfd1-814979ee9486.rna_seq.transcriptome.gdc_realn.bam
4  09f906ba-fa4b-487f-9631-fbd97d9db8f9.rna_seq.transcriptome.gdc_realn.bam
5  8c92ecf0-5d81-48c3-ad9f-9adf51860962.rna_seq.transcriptome.gdc_realn.bam
6  a8e0466b-d106-44b7-b846-c4fb666715f5.rna_seq.transcriptome.gdc_realn.bam
                          cases
1  TCGA-C8-A1HO-01A-11R-A13Q-07
2  TCGA-A7-A0CD-01A-11R-A00Z-07
3  TCGA-AO-A0J5-01A-11R-A034-07
4  TCGA-B6-A0IK-01A-12R-A056-07
5  TCGA-A7-A4SB-01A-21R-A266-07
6  TCGA-E9-A1R4-01A-21R-A14D-07
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

获取 manifest 文件

我们可以获取查询到数据的 manifest 文件列表,设置 save = TRUE 可以将结果输出到文件中,用其他软件下载对应的数据

getManifest(query, save = FALSE) 
  • 1

获取 ATAC-seq 数据

getResults(TCGAbiolinks:::GDCquery_ATAC_seq())[,c("file_name","file_size")]
# A tibble: 47 × 2
   file_name                                                          file_size
   <chr>                                                                  <dbl>
 1 TCGA-ATAC_PanCancer_PeakSet.txt                                     37522221
 2 TCGA-ATAC_DataS1_DonorsAndStats_v4.xlsx                               251795
 3 MESO_bigWigs.tgz                                                  1507969705
 4 TCGA-ATAC_DataS5_GWAS_v2.xlsx                                         999661
 5 COAD_bigWigs.tgz                                                  8939070313
 6 TCGA-ATAC_PanCan_Raw_Counts.txt                                   1241263873
 7 TGCT_bigWigs.tgz                                                  2058854449
 8 TCGA-ATAC_DataS6_Footprinting_v1.xlsx                               19544445
 9 TCGA-ATAC_DataS10_WGS-ATAC_v2_Open.xlsx                               893110
10 TCGA-ATAC_Cancer_Type-specific_Count_Matrices_log2norm_counts.zip  632769396
# ℹ 37 more rows
# ℹ Use `print(n = ...)` to see more rows
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

数据下载与处理

TCGAbiolinks 提供了一些函数,用于下载和处理来自 GDC 数据库的数据,并用于后续的分析

  1. 数据下载 GDCdownload

下载 GDC 数据的方法有两种:

  • client:该方法会创建 MANIFEST 文件并使用 GDC Data Transfer Tool 来下载数据,但是速度上会比使用 API 更慢
  • api:该方法使用 GDC API 接口来下载数据,也会创建 MANIFEST 文件并将数据保存为 tar.gz 文件,如果数据过大,会将数据拆分为小块分别下载,files.per.chunk 参数可以控制同时下载的文件个数,减少下载错误

  1. 数据处理 GDCprepare

使用 GDCprepare 处理数据后,会返回 SummarizedExperimentdata.frame 对象,主要包含三个主要的矩阵,可以使用 SummarizedExperiment 包来获取对应的信息:

  • 样本信息矩阵,使用 colData(data) 来获取样本信息
  • 检测矩阵信息,使用 assay(data) 来获取分子数据
  • 特征矩阵信息,使用 rowRanges(data) 来获取特征元数据信息,如基因的信息

参数列表

查询和下载数据

我们从 harmonized(参考基因组为 hg38)数据库中下载基因表达数据

# Gene expression aligned against hg38
query <- GDCquery(
    project = "TCGA-GBM",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification", 
    workflow.type = "STAR - Counts",
    barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01")
)
GDCdownload(query = query)
data <- GDCprepare(query = query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

返回的对象为 SummarizedExperiment 类型

class(data)
# [1] "RangedSummarizedExperiment"
# attr(,"package")
# [1] "SummarizedExperiment"
  • 1
  • 2
  • 3
  • 4

获取样本信息

# 先导入包
library(SummarizedExperiment)
colData(data)[,1:4]
# DataFrame with 2 rows and 4 columns
#                                             barcode      patient           sample shortLetterCode
#                                         <character>  <character>      <character>     <character>
# TCGA-14-0736-02A-01R-2005-01 TCGA-14-0736-02A-01R.. TCGA-14-0736 TCGA-14-0736-02A              TR
# TCGA-06-0211-02A-02R-2005-01 TCGA-06-0211-02A-02R.. TCGA-06-0211 TCGA-06-0211-02A              TR
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

获取表达数据

assay(data)[1:6,]
# TCGA-14-0736-02A-01R-2005-01 TCGA-06-0211-02A-02R-2005-01
# ENSG00000000003.15                         4239                         3825
# ENSG00000000005.6                            34                            7
# ENSG00000000419.13                          891                         1775
# ENSG00000000457.14                          251                          455
# ENSG00000000460.17                          123                          333
# ENSG00000000938.13                          284                          677
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

获取基因信息

rowRanges(data)[1:3, 1:3]
# GRanges object with 3 ranges and 3 metadata columns:
#     seqnames              ranges strand |   source     type     score
#     <Rle>              <IRanges>  <Rle> | <factor> <factor> <numeric>
# ENSG00000000003.15     chrX 100627108-100639991      - |   HAVANA     gene        NA
# ENSG00000000005.6     chrX 100584936-100599885      + |   HAVANA     gene        NA
# ENSG00000000419.13    chr20   50934867-50958555      - |   HAVANA     gene        NA
# -------
#     seqinfo: 25 sequences from an unspecified genome; no seqlengths
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

GDCprepare

GDCprepare 函数的功能还在开发中,并不是所有情况下都能正常工作,目前支持的数据处理包括下表中列出的情况

下载 CNV 数据

下载 Copy Number Segment 类型的数据

query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Copy Number Variation",
    data.type = "Copy Number Segment",
    barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01")
)
GDCdownload(query)
data <- GDCprepare(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

下载 Gene Level Copy Number 类型的数据

query <- GDCquery(
    project = "TCGA-ACC",
    data.category = "Copy Number Variation",
    data.type = "Gene Level Copy Number",              
    access = "open"
)
GDCdownload(query)
data <- GDCprepare(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

下载 Allele-specific Copy Number Segment 类型的数据

query <- GDCquery(
    project = "TCGA-ACC",
    data.category = "Copy Number Variation",
    data.type = "Allele-specific Copy Number Segment",              
    access = "open"
)
GDCdownload(query)
data <- GDCprepare(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

下载 Masked Copy Number Segment 类型的数据

query <- GDCquery(
    project = "TCGA-ACC",
    data.category = "Copy Number Variation",
    data.type = "Masked Copy Number Segment",              
    access = "open"
)
GDCdownload(query)
data <- GDCprepare(query)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
下载转录组数据

下载 mRNA 表达数据

# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
query.exp.hg38 <- GDCquery(
    project = "TCGA-GBM", 
    data.category = "Transcriptome Profiling", 
    data.type = "Gene Expression Quantification", 
    workflow.type = "STAR - Counts",
    barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01")
)
GDCdownload(query.exp.hg38)
expdat <- GDCprepare(
    query = query.exp.hg38,
    save = TRUE, 
    save.filename = "exp.rda"
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

下载 miRNA 表达数据

query.mirna <- GDCquery(
    project = "TARGET-AML", 
    experimental.strategy = "miRNA-Seq",
    data.category = "Transcriptome Profiling", 
    barcode = c("TARGET-20-PATDNN","TARGET-20-PAPUNR"),
    data.type = "miRNA Expression Quantification"
)
GDCdownload(query.mirna)
mirna <- GDCprepare(
    query = query.mirna,
    save = TRUE, 
    save.filename = "mirna.rda"
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

下载 isoform(剪切异构体)表达数据

query.isoform <- GDCquery(
    project = "TARGET-AML", 
    experimental.strategy = "miRNA-Seq",
    data.category = "Transcriptome Profiling", 
    barcode = c("TARGET-20-PATDNN","TARGET-20-PAPUNR"),
    data.type = "Isoform Expression Quantification"
)
GDCdownload(query.isoform)

isoform <- GDCprepare(
    query = query.isoform,
    save = TRUE, 
    save.filename = "mirna-isoform.rda"
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
下载 DNA 甲基化数据

下载甲基化 Beta 值数据

# 27K 数据
query_met.hg38 <- GDCquery(
    project = "TCGA-BRCA", 
    data.category = "DNA Methylation", 
    data.type = "Methylation Beta Value",
    platform = "Illumina Human Methylation 27", 
    barcode = c("TCGA-B6-A0IM")
)
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38)
# 450K 数据
query_met.hg38 <- GDCquery(
    project= "TCGA-LGG", 
    data.category = "DNA Methylation", 
    data.type = "Methylation Beta Value",
    platform = "Illumina Human Methylation 450", 
    barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05")
)
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38)

# Epic
query_met.hg38 <- GDCquery(
    project= "HCMI-CMDC", 
    data.category = "DNA Methylation", 
    data.type = "Methylation Beta Value",
    platform = "Illumina Methylation Epic", 
    barcode = c("HCM-BROD-0045")
)
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31

下载 IDAT 文件

query <- GDCquery(
    project = "TCGA-BRCA",
    data.category = "DNA Methylation",
    data.type = "Masked Intensities",
    platform = "Illumina Human Methylation 27"
)
GDCdownload(query, files.per.chunk=10)
betas <- GDCprepare(query)

query <- GDCquery(
    project = "HCMI-CMDC",
    data.category = "DNA Methylation",
    data.type = "Masked Intensities",
    platform = "Illumina Methylation Epic"
)
GDCdownload(query, files.per.chunk = 10)
betas <- GDCprepare(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
下载蛋白质组数据

下载蛋白质芯片数据

query.rppa <- GDCquery(
    project = "TCGA-ESCA", 
    data.category = "Proteome Profiling",
    data.type = "Protein Expression Quantification"
)
GDCdownload(query.rppa) 
rppa <- GDCprepare(query.rppa) 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
下载临床数据

下载样本的临床信息

# XML 格式
query <- GDCquery(
    project = "TCGA-COAD",
    data.category = "Clinical",
    data.type = "Clinical Supplement",
    data.format = "BCR XML",
    barcode = "TCGA-A6-5664"
)
GDCdownload(query)
drug <- GDCprepare_clinic(query,"drug")

query <- GDCquery(
    project = "TCGA-COAD",
    data.category = "Clinical",
    data.type = "Clinical Supplement",
    data.format = "BCR OMF XML",
    barcode = "TCGA-AD-6964"
)
GDCdownload(query)


query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Clinical",
    data.type = "Clinical Supplement", 
    data.format = "BCR Biotab"
)
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
names(clinical.BCRtab.all)
# 放疗信息
query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Clinical",
    data.type = "Clinical Supplement", 
    data.format = "BCR Biotab",
    file.type = "radiation"
)
GDCdownload(query)
clinical.BCRtab.radiation <- GDCprepare(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
下载 SNV 数据

下载单核苷酸变异信息

query <- GDCquery(
    project = "TCGA-HNSC", 
    data.category = "Simple Nucleotide Variation",
    data.type = "Masked Somatic Mutation",
    access = "open"
)
GDCdownload(query)
maf <- GDCprepare(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

也可以下载 MC3 MAF 文件

maf <- getMC3MAF()
  • 1

然后搭配 maftools 来可视化突变数据。例如

library(maftools)
library(dplyr)

query <- GDCquery(
    project = "TCGA-CHOL", 
    data.category = "Simple Nucleotide Variation", 
    access = "open",
    data.type = "Masked Somatic Mutation", 
    workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)
GDCdownload(query)
maf <- GDCprepare(query)

maf <- read.maf(maf)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

绘制简单的汇总结果图

plotmafSummary(maf = maf, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE)
  • 1


绘制突变瀑布图

oncoplot(maf = maf, top = 10, removeNonMutated = TRUE)
  • 1

绘制 titransitions) 和 tvtransversions) 的比例,以及样本中 6 个碱基替换的占比

titv <- titv(maf = maf, plot = FALSE, useSyn = TRUE)
plotTiTv(res = titv)
  • 1
  • 2

获取分子亚型

PanCancerAtlas_subtypes 函数可访问从 synapse 检索到的信息(可能包含最新的分子亚型)

subtypes <- PanCancerAtlas_subtypes()
subtypes[1:4, 1:4]
# # A tibble: 4 × 4
#   pan.samplesID cancer.type Subtype_mRNA                         Subtype_DNAmeth  
#   <chr>         <chr>       <chr>                                <chr>            
# 1 TCGA-OR-A5J1  ACC         steroid-phenotype-high+proliferation CIMP-high        
# 2 TCGA-OR-A5J2  ACC         steroid-phenotype-high+proliferation CIMP-low         
# 3 TCGA-OR-A5J3  ACC         steroid-phenotype-high               CIMP-intermediate
# 4 TCGA-OR-A5J4  ACC         NA                                   CIMP-high 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

TCGAquery_subtype 函数则包含从 TCGA 论文中检索到的样本信息的完整表。

TCGAquery_subtype(tumor = "acc")[1:3, 1:4]
# acc subtype information from:doi:10.1016/j.ccell.2016.04.002
#        patient  Histology C1A.C1B                              mRNA_K4
# 1 TCGA-OR-A5J1 Usual Type     C1A steroid-phenotype-high+proliferation
# 2 TCGA-OR-A5J2 Usual Type     C1A steroid-phenotype-high+proliferation
# 3 TCGA-OR-A5J3 Usual Type     C1A               steroid-phenotype-high
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
下载单细胞数据

获取单细胞转录组数据

query.sc.analysis <- GDCquery(
    project = "CPTAC-3", 
    data.category = "Transcriptome Profiling",
    access = "open",
    data.type = "Single Cell Analysis",
    data.format =  "TSV"
)  
GDCdownload(query.sc.analysis)
Single.Cell.Analysis.list <- GDCprepare(query.sc.analysis)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

10x Raw Counts 数据

query.raw.counts <- GDCquery(
    project = "CPTAC-3", 
    data.category = "Transcriptome Profiling",
    access = "open",
    data.type = "Gene Expression Quantification",
    barcode = c("CPT0167860015","CPT0206880004"),
    workflow.type = "CellRanger - 10x Raw Counts"
)  
GDCdownload(query.raw.counts)
raw.counts.list <- GDCprepare(query.raw.counts)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

过滤后的 10x Raw Counts 数据

query.filtered.counts <- GDCquery(
    project = "CPTAC-3", 
    data.category = "Transcriptome Profiling",
    access = "open",
    data.type = "Gene Expression Quantification",
    barcode = c("CPT0167860015","CPT0206880004"),
    workflow.type = "CellRanger - 10x Filtered Counts"
)  
GDCdownload(query.filtered.counts)
filtered.counts.list <- GDCprepare(query.filtered.counts)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

10x Chromium 差异表达数据

query.sc.dea <- GDCquery(
    project = "CPTAC-3", 
    data.category = "Transcriptome Profiling",
    access = "open",
    data.type = "Differential Gene Expression",
    barcode = c("CPT0167860015","CPT0206880004"),
    workflow.type = "Seurat - 10x Chromium"
)  
GDCdownload(query.sc.dea)
sc.dea.list <- GDCprepare(query.sc.dea)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/915017
推荐阅读
相关标签
  

闽ICP备14008679号