赞
踩
Bioconductor的GEOquery几个常用函数可以实现GEO数据的下载,但有时候我想直接通过终端下载而不是使用Rstudio然后运行脚本的方式,所以下面用shell脚本对GEOquery两个下载函数getGEO()以及getGEOSuppFiles()进行了简单的封装。
安装使用clone命令
git clone https://github.com/ShixiangWang/mytoolkit/
点击页面右上方的克隆或下载按钮预置与帮助
Linux系统安装R,如果你没有安装GEOquery包,脚本会自动判断并进行下载安装。
查看脚本帮助:
./getGEOSuppFiles.sh -h
./getGEO.sh -h
./bulkGEO.sh -h
下载GEO附加文件
GEO附加文件一般是原始的芯片数据。
用法:
Usage: ./getGEOSuppFiles.sh -n GEO -d directory
GEO: GEO accession 号,比如 GPL1073 or GSM1137
directory: 下载到的目录,默认为你的当前目录。
下载GEO表达矩阵文件
这个是最常用的功能,下载芯片的表达矩阵文件,数据已经经过研究者的预处理,可以直接使用。
用法:
Usage: ./getGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -P getGPL
Detail of Options
==================
-n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')
-d destdir: 要下载到的目的目录,默认为当前目录。
-M 逻辑值TRUE或FALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。
-A 逻辑值TRUE或FALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。
-P 逻辑值TRUE或FALSE,告诉脚本是否在下载GSEMatrix文件时下载GPL信息,如果你知道你要用bioconductor工具的注释包,你可以选择FALSE,默认为TRUE。
Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
Like
./getGEO.sh -n GEO
change the 'GEO' above to name of GSE you want to download
大量下载表达矩阵文件和原始文件
这个功能利用了前两个脚本,对它们进行循环调用。
用法:
Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp
Detail of Options
==================
-n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')
-d destdir: 要下载到的目的目录,默认为当前目录。
-M 逻辑值TRUE或FALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。
-A 逻辑值TRUE或FALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。
-f filename: 你可以把要下载的GEO对象名放入一个文件,然后指定它。注意,如果使用它,请不要设定-n选项,不然会被覆盖掉。
-s supp: 逻辑值TRUE或FALSE,设定是否要下载原始附加文件。
Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
Like
./bulkGEO.sh -n 'GEO1 GEO2 GEO3'
change the 'GEO' above to name of GSE you want to download
昨天为了避免自我感觉的下载麻烦所以写了这些代码,因为对linux的脚本还不是很精通,脚本可能会存在问题。基本的下载不会出错,我已经调试过。如果有问题或其他功能,欢迎提问,我会尝试解决。
谢谢阅读~
------------------------------------------------------------------------------------------------------------
今天刚好在一个新机器上下载GEO数据,只装了一些基本的R包,可以看看效果。
[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -h
Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp
Detail of Options
==================
-nGEO:A character string representing GEO objects for download and parsing. (eg., 'GDS505','GSE2','GSM2','GPL96'), you can use space to seperate multiple objects. Or you can use the -f option to locate the file where you put names of GEO object.
-ddestdir:The destination directory for any downloads. Defaults to the current directory. You may want to specify a different directory if you want to save the file for later use. Doing so is a good idea if you have a slow connection, as some of the GEO files are HUGE!
-MA boolean telling GEOquery whether or not to use GSE Series Matrix files from GEO. The parsing of these files can be many orders-of-magnitude faster than parsing the GSE SOFT format files. Defaults to TRUE, meaning that the SOFT format parsing will not occur; set to FALSE if you for some reason need other columns from the GSE records.
-AA boolean defaulting to TRUE as to whether or not to use the Annotation GPL information. These files are nice to use because they contain up-to-date information remapped from Entrez Gene on a regular basis. However, they do not exist for all GPLs; in general, they are only available for GPLs referenced by a GDS
-ffilename: a character string specify the filename where GEO names stored.
-ssupp: A boolean defaulting to FALSE as to whether or not to download supplementary files.
Minimal Use Method
==================
If you do not know how to use these options, just set -n option is OK
Like
./bulkGEO.sh -n 'GEO1 GEO2 GEO3'
change the 'GEO*' above to name of GSE you want to download
[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -d ~/workspace/GEO_data/igcc_cnv/ -f ~/workspace/GEO_data/igcc_cnv/geo_names.txt
Package GEOquery not available. Atempting to install it.
Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.2 (2017-09-28).
Installing package(s) ‘GEOquery’
also installing the dependencies ‘BiocGenerics’, ‘Biobase’
trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/BiocGenerics_0.24.0.tar.gz'
Content type 'application/x-gzip' length 43393 bytes (42 KB)
==================================================
downloaded 42 KB
trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/Biobase_2.38.0.tar.gz'
Content type 'application/x-gzip' length 1656734 bytes (1.6 MB)
==================================================
downloaded 1.6 MB
trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/GEOquery_2.46.13.tar.gz'
Content type 'application/x-gzip' length 13745245 bytes (13.1 MB)
==================================================
downloaded 13.1 MB
* installing *source* package ‘BiocGenerics’ ...
** R
** inst
** preparing package for lazy loading
Creating a new generic function for ‘append’ in package ‘BiocGenerics’
Creating a new generic function for ‘as.data.frame’ in package ‘BiocGenerics’
Creating a new generic function for ‘cbind’ in package ‘BiocGenerics’
Creating a new generic function for ‘rbind’ in package ‘BiocGenerics’
Creating a new generic function for ‘do.call’ in package ‘BiocGenerics’
Creating a new generic function for ‘duplicated’ in package ‘BiocGenerics’
Creating a new generic function for ‘anyDuplicated’ in package ‘BiocGenerics’
Creating a new generic function for ‘eval’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmax’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmin’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmax.int’ in package ‘BiocGenerics’
Creating a new generic function for ‘pmin.int’ in package ‘BiocGenerics’
Creating a new generic function for ‘Reduce’ in package ‘BiocGenerics’
Creating a new generic function for ‘Filter’ in package ‘BiocGenerics’
Creating a new generic function for ‘Find’ in package ‘BiocGenerics’
Creating a new generic function for ‘Map’ in package ‘BiocGenerics’
Creating a new generic function for ‘Position’ in package ‘BiocGenerics’
Creating a new generic function for ‘get’ in package ‘BiocGenerics’
Creating a new generic function for ‘mget’ in package ‘BiocGenerics’
Creating a new generic function for ‘grep’ in package ‘BiocGenerics’
Creating a new generic function for ‘grepl’ in package ‘BiocGenerics’
Creating a new generic function for ‘is.unsorted’ in package ‘BiocGenerics’
Creating a new generic function for ‘lapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘sapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘lengths’ in package ‘BiocGenerics’
Creating a new generic function for ‘mapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘match’ in package ‘BiocGenerics’
Creating a new generic function for ‘rowSums’ in package ‘BiocGenerics’
Creating a new generic function for ‘colSums’ in package ‘BiocGenerics’
Creating a new generic function for ‘rowMeans’ in package ‘BiocGenerics’
Creating a new generic function for ‘colMeans’ in package ‘BiocGenerics’
Creating a new generic function for ‘order’ in package ‘BiocGenerics’
Creating a new generic function for ‘paste’ in package ‘BiocGenerics’
Creating a new generic function for ‘rank’ in package ‘BiocGenerics’
Creating a new generic function for ‘rownames’ in package ‘BiocGenerics’
Creating a new generic function for ‘colnames’ in package ‘BiocGenerics’
Creating a new generic function for ‘union’ in package ‘BiocGenerics’
Creating a new generic function for ‘intersect’ in package ‘BiocGenerics’
Creating a new generic function for ‘setdiff’ in package ‘BiocGenerics’
Creating a new generic function for ‘sort’ in package ‘BiocGenerics’
Creating a new generic function for ‘table’ in package ‘BiocGenerics’
Creating a new generic function for ‘tapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘unique’ in package ‘BiocGenerics’
Creating a new generic function for ‘unsplit’ in package ‘BiocGenerics’
Creating a new generic function for ‘var’ in package ‘BiocGenerics’
Creating a new generic function for ‘sd’ in package ‘BiocGenerics’
Creating a new generic function for ‘which’ in package ‘BiocGenerics’
Creating a new generic function for ‘which.max’ in package ‘BiocGenerics’
Creating a new generic function for ‘which.min’ in package ‘BiocGenerics’
Creating a new generic function for ‘IQR’ in package ‘BiocGenerics’
Creating a new generic function for ‘mad’ in package ‘BiocGenerics’
Creating a new generic function for ‘xtabs’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterCall’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterApply’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterApplyLB’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterEvalQ’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterExport’ in package ‘BiocGenerics’
Creating a new generic function for ‘clusterMap’ in package ‘BiocGenerics’
Creating a new generic function for ‘parLapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parSapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parApply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parRapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parCapply’ in package ‘BiocGenerics’
Creating a new generic function for ‘parLapplyLB’ in package ‘BiocGenerics’
Creating a new generic function for ‘parSapplyLB’ in package ‘BiocGenerics’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (BiocGenerics)
* installing *source* package ‘Biobase’ ...
** libs
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c Rinit.c -o Rinit.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c anyMissing.c -o anyMissing.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c envir.c -o envir.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c matchpt.c -o matchpt.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c rowMedians.c -o rowMedians.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c sublist_extract.c -o sublist_extract.o
/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -shared -L/public/home/wangshx/anaconda3/lib/R/lib -Wl,-O2,--sort-common,--as-needed,-z,relro,-z,now -L/public/home/wangshx/anaconda3/lib -o Biobase.so Rinit.o anyMissing.o envir.o matchpt.o rowMedians.o sublist_extract.o -L/public/home/wangshx/anaconda3/lib/R/lib -lR
installing to /public/home/wangshx/anaconda3/lib/R/library/Biobase/libs
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (Biobase)
* installing *source* package ‘GEOquery’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (GEOquery)
The downloaded source packages are in
‘/tmp/Rtmptc9bgw/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'hms', 'limma', 'Rcpp', 'tibble', 'xml2'
GEO: GSE76730
destdir: /public/home/wangshx/workspace/GEO_data/igcc_cnv/
GSEMatrix: TRUE
AnnotGPL: TRUE
getGPL: TRUE
Found 1 file(s)
GSE76730_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE76nnn/GSE76730/matrix/GSE76730_series_matrix.txt.gz'
Content type 'application/x-gzip' length 262447098 bytes (250.3 MB)
==================================================
downloaded 250.3 MB
Parsed with column specification:
cols(
.default = col_double(),
ID_REF = col_character()
)
See spec(...) for full column specifications.
Annotation GPL not available, so will use submitter GPL instead
File stored at:
/public/home/wangshx/workspace/GEO_data/igcc_cnv//GPL3718.soft
$GSE76730_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 261981 features, 190 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM2036728 GSM2036729 ... GSM2036917 (190 total)
varLabels: title geo_accession ... who performance status:ch1 (61
total)
varMetadata: labelDescription
featureData
featureNames: SNP_A-1780270 SNP_A-1780272 ... SNP_A-4241299 (261981
total)
fvarLabels: ID Affy SNP ID ... SPOT_ID (27 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL3718
Warning message:
In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL3nnn/GPL3718/annot/GPL3718.annot.gz': HTTP status was '404 Not Found'
The files of GSE76730 download successfully!
The files of GSE76730 download successfully!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。