赞
踩
(i)
investigate how gene-regulatory networks or gene ontologies are gained or lost over time(这个绝对可以)(ii)
stratify selected gene sets (e.g. surface markers) by the order in which they appear(按基因出现的顺序对选定的基因集(例如表面标记)进行分层 ) (iii)
identify key differences in the gene expression changes in cell transitions that bifurcate over time(分化转变基因,这个最为关键)。GeneSwitches 的目标是以单细胞分辨率发现细胞状态转换期间基因表达和功能事件的顺序。 它适用于任何单细胞轨迹或细胞的伪时间排序,以发现充当细胞状态之间开/关开关的基因,重要的是这些开关发生的顺序。
Users may use following codes to check and install all the required packages.
- list.of.packages <- c("SingleCellExperiment", "Biobase", "fastglm", "ggplot2", "monocle",
- "plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "devtools",
- "mixtools")
-
- ## for package "fastglm", "ggplot2", "plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "mixtools"
- new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
- if(length(new.packages)) install.packages(new.packages)
-
- ## for package "SingleCellExperiment", "Biobase"
- if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
- new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
- if(length(new.packages)) BiocManager::install(new.packages)
-
The source code of GeneSwitches can be installed from GitHub with:
- devtools::install_github("SGDDNB/GeneSwitches")
-
GeneSwitches 需要两个输入,即基因表达矩阵和每个细胞的相应伪时间排序。 我们将这些输入数据集转换为一个 SingleCellExperiment
对象(Lun and Risso 2017),在下面你会发现一个完整的“从头到尾”的工作流程来实现这个分析的潜力。
- ## load libraries
- library(GeneSwitches)
- library(SingleCellExperiment)
-
这里的实例代码,我们将使用已发布的单细胞 RNA-seq 数据集,这些数据从人类胚胎干细胞 (hESC) 到心肌细胞 (CM) 的分化(Friedman 等人,2018 年)。 运用monocle2进行轨迹分析。 选择这个数据集的部分原因是它显示了心脏 hESC 分化的分叉细胞命运,它产生了明确的心肌细胞 (Path1) 或非收缩性心脏衍生物 (Path2),允许应用 GeneSwitches 的所有方面。
- ## Download example files to current directory
- get_example_inputData()
- ## Load input data log-normalized gene expression
- load("./logexpdata.RData")
- ## Load Monocle2 object with pseudo-time and dimensionality reduction
- load("./cardiac_monocle2.RData")
-
Users can input the gene expression (logexpdata
; recommend for log-normalized expression), pseudo-time (cell_pseudotime
) and dimensionality reductions (rd_PCA
; optional and only for gene expression plots) into SingleCellExperiment object as follows.
- ### create SingleCellExperiment object with log-normalized single cell data
- #sce <- SingleCellExperiment(assays = List(expdata = logexpdata))
- ### add pseudo-time information
- #colData(sce)$Pseudotime <- cell_pseudotime
- ### add dimensionality reductions, e.g. PCA, UMAP, tSNE
- #pca <- prcomp(t(assays(sce)$expdata), scale. = FALSE)
- #rd_PCA <- pca$x[,1:2]
- #reducedDims(sce) <- SimpleList(PCA = rd_PCA)
-
Alternatively, GeneSwitches provides functions to convert Monocle2 or Slingshot results into SingleCellExperiment object directly. For Monocle2 trajectory, users need to indicate the states of the desired path, which can be checked by plotting the trajectory using Monocle2 function plot_cell_trajectory
or the following function.
- ## plot Monocle2 trajectory colored by State
- # monocle:::plot_cell_trajectory(cardiac_monocle2, color_by = "State")
- plot_monocle_State(cardiac_monocle2)
-
Based on the marker genes, the pseudo-time trajectory starts from State 3, which are hESC cells. Definitive CM cells are in State 1 and non-contractile cardiac derivatives are in State 5. Therefore, we focus on Path1 with cells in states 3, 2, 1 and Path2 with cells in states 3, 2, 5, and extract these two paths from Monocle2 object.
- ## Input log-normalized gene expression, Monocle2 pseudo-time and dimensionality reduction
- ## Path1 containing cells in states 3,2,1
- sce_p1 <- convert_monocle2(monocle2_obj = cardiac_monocle2,
- states = c(3,2,1), expdata = logexpdata)
- ## Path2 containing cells in states 3,2,5
- sce_p2 <- convert_monocle2(monocle2_obj = cardiac_monocle2,
- states = c(3,2,5), expdata = logexpdata)
-
If we are only interested in the trajectory within a certain range of pseudotime, function subset_pseudotime
can be used to subset the SingleCellExperiment object accordingly, followed by filtering out lowly expressed genes.
- ### Subset cells to pseudotime range from 10 to 25
- #sce_p1_subset <- subset_pseudotime(sce_p1, min_time = 10, max_time = 25, minexp = 0, mincells = 10)
-
In Part I, we will apply GeneSwitches on a single trajectory, Path1, to demonstrate the general workflow and functions. Comparison of GeneSwitches results from two trajectories (Path1 & 2) will be shown in Part II.
由于我们关注的是打开或关闭的基因,因此我们首先将基因表达数据二值化为 1(on) 或 0(off) 状态。 为了实现这一点,对于每个基因,我们将两个高斯分布的混合模型拟合到输入基因表达中,以计算用于二值化的基因特异性阈值。 在拟合之前,我们在基因表达中添加了均值为零和标准差为 0.1 的高斯噪声,这确保了基因表达拟合的数值稳定性。 然后去除不具有明显双峰“开-关”分布的基因。 对于使用 3 个内核的 2000 个细胞,此步骤可能需要 2 分钟。
- ### binarize gene expression using gene-specific thresholds
- sce_p1 <- binarize_exp(sce_p1, ncores = 3)
-
Alternatively, we can use a global threshold for fast binarization. We plot a histogram of expression of all the genes in all cells and look for a break between the zero and expressed distributions to identify the global threshold.
- ### check the threshold for binarization
- #h <- hist(assays(sce_p1)$expdata, breaks = 200, plot = FALSE)
- #{plot(h, freq = FALSE, xlim = c(0,2), ylim = c(0,1), main = "Histogram of gene expression",
- #xlab = "Gene expression", col = "darkgoldenrod2", border = "grey")
- #abline(v=0.2, col="blue")}
-
- ###In this example, we choose 0.2 (blue line, also set as default) as the threshold.
- # bn_cutoff <- 0.2
- # sce_p1 <- binarize_exp(sce_p1, fix_cutoff = TRUE, binarize_cutoff = bn_cutoff)
-
Logistic regression is applied to model the binary states (on or off) of gene expression. Then the switching pseudo-time point is determined by the time at which the fitted line crosses the probability threshold 0.5. We use random downsampling of zero expressions (downsample = TRUE) to rescue the prediction of switching time for genes with high zero inflation.
- ## fit logistic regression and find the switching pseudo-time point for each gene
- ## with downsampling. This step takes less than 1 mins
- sce_p1 <- find_switch_logistic_fastglm(sce_p1, downsample = TRUE, show_warning = FALSE)
-
First, we filter poorly fitted genes based on zero-expression percentage (>90%), FDR (>0.05) and McFadden’s Pseudo R^2 (<0.03). We can then the number of top best fitting (high McFadden’s Pseudo R^2) genes to plot. One can also extract specific gene type(s) to plot, with provided gene type lists containing surface proteins (downloaded from here) and transcription factors (TFs, downloaded from here). Users are allowed to pass their own gene type lists as a data frame to parameter genelists, with rows as genes (non-duplicated) and two columns with name genenames
and genetypes
.
- ## filter top 15 best fitting switching genes among all the genes
- sg_allgenes <- filter_switchgenes(sce_p1, allgenes = TRUE, topnum = 15)
- ## filter top 15 best fitting switching genes among surface proteins and TFs only
- sg_gtypes <- filter_switchgenes(sce_p1, allgenes = FALSE, topnum = 20,
- genelists = gs_genelists, genetype = c("Surface proteins", "TFs"))
- ## combine switching genes and remove duplicated genes from sg_allgenes
- sg_vis <- rbind(sg_gtypes, sg_allgenes[setdiff(rownames(sg_allgenes), rownames(sg_gtypes)),])
-
Finally, plot the selected genes along the pseudo-timeline. Genes that are switched on are plotted above the line, while those switching off are below the line.
- plot_timeline_ggplot(sg_vis, timedata = sce_p1$Pseudotime, txtsize = 3)
-
It is possible to use the dimensionality reduction provided from the user to visualise the gene expression and logistic regression fitting plots if needed.
- plot_gene_exp(sce_p1, gene = "VIM", reduction = "monocleRD", downsample = F)
-
GeneSwitches can be used to order pathways or genesets as well. We include the pathways provided by MSigDB hallmark (Liberzon,A. et al., 2015), C2 curated and C5 gene ontology geneset collections. A Hypergeometric test is first applied to extract the pathways that are significantly overrepresented amongst those that are changing along the trajectory. The Switching time of the pathway is then determined by the median switching time of genes in that pathway.
- ## filter genes for pathway analysis using r^2 cutoff 0.1
- sg_pw <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.1)
- ## apply hypergeometric test and determine the switching time
- switch_pw <- find_switch_pathway(rowData(sce_p1), sig_FDR = 0.05,
- pathways = msigdb_h_c2_c5, sg_pw)
- ## remove redundant pathways
- switch_pw_reduce <- reduce_pathways(switch_pw, pathways = msigdb_h_c2_c5,
- redundant_pw_rate = 0.8)
-
To better visualise the functional changes ridge plots of pathways genes show the density of switching genes along the pseudo-time. Top 10 significantly changed pathways are plotted here, ordered by the switching time.
- plot_pathway_density(switch_pw_reduce[1:10,], sg_pw, orderbytime = TRUE)
- #> Picking joint bandwidth of 2.49
-
We can also select specific pathway(s) to plot the switching genes in it. Among top 10 significantly changed pathways, we plot genes related to myogenesis and cardiac muscle tissue development.
- sg_vis <- filter_switchgenes(sce_p1, topnum = 50, pathway_name = c("HALLMARK_MYOGENESIS",
- "GO_CARDIAC_MUSCLE_TISSUE_DEVELOPMENT"))
- plot_timeline_ggplot(sg_vis, timedata=sce_p1$Pseudotime, txtsize=3)
-
“Multiple” lables the genes in more than one pathways.
Before comparison, we need to apply same steps in I-1 and I-2 on the cells from Path2 to identify switching pseudo-time point for each gene.
- sce_p2 <- binarize_exp(sce_p2)
- sce_p2 <- find_switch_logistic_fastglm(sce_p2, downsample = TRUE, show_warnings = FALSE)
-
And we filter out poorly fitted genes for both paths using the same cutoff.
- sg_p1 <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.03)
- sg_p2 <- filter_switchgenes(sce_p2, allgenes = TRUE, r2cutoff = 0.03)
-
We then plot common switching genes between two paths to compare their ordering.
- sg_com <- common_genes(sg_p1, sg_p2, r2cutoff = 0.4,
- path1name = "Definitive CM", path2name = "non-contractile")
- common_genes_plot(sg_com, timedata = sce_p1$Pseudotime)
-
More importantly, we can plot the distinct switching genes of the two paths.
- sg_disgs <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52,
- path1name = "Definitive CM", path2name = "non-contractile",
- path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime)
- plot_timeline_ggplot(sg_disgs, timedata = sce_p1$Pseudotime, color_by = "Paths",
- iffulltml = FALSE, txtsize = 3)
-
We can also scale the timelines to be the same length (default number of bins is 100) so that differences are based on percentage of the trajectory covered rather than pseudo-time.
- sg_disgs_scale <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52,
- path1name = "Definitive CM", path2name = "non-contractile",
- path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime,
- scale_timeline = T, bin = 100)
- # timedata need to be 1 to (number of bins)
- plot_timeline_ggplot(sg_disgs_scale, timedata = 1:100, color_by = "Paths",
- iffulltml = FALSE, txtsize = 3)
-
这两个不同转换基因的图只显示了一系列发生转换事件的伪时间线。 这个范围实际上是在轨迹的末端,而共同基因大多处于早期(共同基因图)。
Similarly, we can check the gene expression plots for the two paths.
- gn <- "DCN"
- p <- plot_gene_exp(sce_p1, gene = gn, reduction = "monocleRD",
- downsample = FALSE, fitting = TRUE)
- #> Warning: glm.fit: algorithm did not converge
-
- p <- plot_gene_exp(sce_p2, gene = gn, reduction = "monocleRD",
- downsample = FALSE, fitting = TRUE)
-
生活很好,等你超越
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。