当前位置:   article > 正文

使用wget批量下载geo数据集的全部文件 linux下载geo数据 geo处理的数据不是下载原始数据 Linux如何下载ftp文件 geo ftp geo ftp下载 geo下载_linux系统下载geo的数据

linux系统下载geo的数据

 格式:

ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171524/suppl/

使用wget批量下载geo数据集的全部文件 - 知乎

wget: unable to resolve host address ‘ftp.ncbi.nlm.nih.gov’

ubuntu wget: unable to resolve host address 无法解析主机host地址解决办法_坎幽黑尔弥?的博客-CSDN博客

wget: unable to resolve host address 无法解析主机host地址解决办法

Ubuntu下“wget:unable to resolve host address”解决方案_CodingForBug的博客-CSDN博客

如何在 Linux 中使用 sFTP 上传或下载文件与文件夹

如何在命令行中使用 ftp 命令上传和下载文件

技术|如何在 Linux 中使用 sFTP 上传或下载文件与文件夹

技术|如何在命令行中使用 ftp 命令上传和下载文件

Linux下使用wget下载FTP服务器文件 - 腾讯云开发者社区-腾讯云

pubmed geo中的ftp服务器上的所以文件目录 匿名ftp服务器登录

geo上的数据,比如metadata信息和处理好的表达矩阵真正的保存地址在ftp服务器上,比如Index of /geo/series/GSE171nnn/GSE171524/supplhttps://ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171524/suppl/ 这个地址就保存了gse171524的数据。如果想通过Linux服务器下载该数据,可以使用linux中的ftp应用,匿名访问/登录ftp服务器进行下载。

NCBI/NLM/NIH :: Public FTPhttps://www.ncbi.nlm.nih.gov/projects/faspftp/

如何下载NCBI的ftp数据_毒鸡蛋的博客-CSDN博客_ncbi ftp

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

  1. What is the best protocol to use to download large data sets?

    We recommend using the rsync file transfer program from a Unix command line to download large data files because it is much more efficient than older protocols. The next best options for downloading multiple files are to use the HTTPS protocol, or the even older FTP protocol, using a command line tool such as wget or curl. Web browsers are very convenient options for downloading single files even though they will use the FTP protocol because of how our URLs are constructed. Other FTP clients are also widely available but do not all correctly handle the symbolic links used widely on the genomes FTP site (see below).

    To use rsync

    Replace the "ftp:" at the beginning of the FTP path with "rsync:". E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following rsync command:

    rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1 my_dir/

    A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz could be downloaded using the following rsync command:

    rsync --copy-links --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz my_dir/

    To use HTTPS

    Replace the "ftp:" at the beginning of the FTP path with "https:". Also append a '/' to the path if it is a directory. E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following wget command:

    wget --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/ -P my_dir/

    A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz could be downloaded using either of the following commands:

    wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz -P my_dir/

    curl --remote-name --remote-time https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz

    To use FTP

    Append a '/' to the path if it is a directory. E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following wget command:

    wget --recursive --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/ -P my_dir/

    A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz could be downloaded using either of the following commands:

    wget --timestamping ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz -P my_dir/

    curl --remote-name --remote-time ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz

单细胞转录组教程我们写的差不多了,是时候进军单细胞ATAC和空间单细胞了,找到了这个经典的 《单细胞ATAC》数据集:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129785 ,对应的文章是:Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat Biotechnol 2019 Aug;37(8):925-936. PMID: 31375813

如下所示的样品情况 :

发现其文件有点多,而且一个文件 就67G了,在自己的电脑里面操作的可能性很小,所以就直接去服务器处理吧。这里分享一下使用wget批量下载geo数据集的全部文件的经验:

首先查看文件列表(注意看网页的网址哦):

网页的网址是有规律的, 同理,文件名也是有规律的:

文件名如下所示:

  1. GSE129785_RAW.tar 2019-04-18 13:03 67G
  2. GSE129785_scATAC-Hematopoiesis-All.cell_barcodes.txt.gz 2019-04-14 19:30 2.9M
  3. GSE129785_scATAC-Hematopoiesis-All.mtx.gz 2019-04-14 19:20 6.0G
  4. GSE129785_scATAC-Hematopoiesis-All.peaks.txt.gz 2019-04-14 19:30 4.3M
  5. GSE129785_scATAC-Hematopoiesis-CD34.cell_barcodes.txt.gz 2019-04-14 19:30 849K
  6. GSE129785_scATAC-Hematopoiesis-CD34.mtx.gz 2019-04-14 19:21 2.1G
  7. GSE129785_scATAC-Hematopoiesis-CD34.peaks.txt.gz 2019-04-14 19:30 4.3M
  8. GSE129785_scATAC-PBMCs-Fresh.cell_barcodes.txt.gz 2019-04-14 19:30 192K
  9. GSE129785_scATAC-PBMCs-Fresh.mtx.gz 2019-04-14 19:21 415M
  10. GSE129785_scATAC-PBMCs-Fresh.peaks.txt.gz 2019-04-14 19:30 1.8M
  11. GSE129785_scATAC-PBMCs-Frozen.cell_barcodes.txt.gz 2019-04-14 19:30 228K
  12. GSE129785_scATAC-PBMCs-Frozen.mtx.gz 2019-04-14 19:21 209M
  13. GSE129785_scATAC-PBMCs-Frozen.peaks.txt.gz 2019-04-14 19:30 1.2M
  14. GSE129785_scATAC-PBMCs-FrozenSort.cell_barcodes.txt.gz 2019-04-14 19:30 207K
  15. GSE129785_scATAC-PBMCs-FrozenSort.mtx.gz 2019-04-14 19:21 319M
  16. GSE129785_scATAC-PBMCs-FrozenSort.peaks.txt.gz 2019-04-14 19:30 1.3M
  17. GSE129785_scATAC-TME-All.cell_barcodes.txt.gz 2019-04-14 19:30 1.7M
  18. GSE129785_scATAC-TME-All.mtx.gz 2019-04-14 19:21 4.5G
  19. GSE129785_scATAC-TME-All.peaks.txt.gz 2019-04-14 19:30 4.4M
  20. GSE129785_scATAC-TME-TCells.cell_barcodes.txt.gz 2019-04-14 19:30 1.3M
  21. GSE129785_scATAC-TME-TCells.mtx.gz 2019-04-14 19:21 2.8G
  22. GSE129785_scATAC-TME-TCells.peaks.txt.gz 2019-04-14 19:30 4.4M

简单的把上面的文件名存放到一个文本文件 list.txt ,就可以使用下面的命令批量下载啦:

  1. mkdir ~/scRNA/atac
  2. cd ~/scRNA/atac
  3. awk '{print $1}' list.txt |while read id;do (nohup wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129785/suppl/$id & );done

一个很简单的shell脚本,就可以得到全部文件如下所示;

  1. 67G Apr 19 2019 GSE129785_RAW.tar
  2. 2.9M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.cell_barcodes.txt.gz
  3. 6.0G Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.mtx.gz
  4. 4.3M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.peaks.txt.gz
  5. 849K Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.cell_barcodes.txt.gz
  6. 2.2G Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.mtx.gz
  7. 4.3M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.peaks.txt.gz
  8. 193K Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.cell_barcodes.txt.gz
  9. 415M Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.mtx.gz
  10. 1.9M Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.peaks.txt.gz
  11. 229K Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.cell_barcodes.txt.gz
  12. 210M Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.mtx.gz
  13. 1.2M Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.peaks.txt.gz
  14. 208K Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.cell_barcodes.txt.gz
  15. 320M Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.mtx.gz
  16. 1.3M Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.peaks.txt.gz
  17. 1.8M Apr 15 2019 GSE129785_scATAC-TME-All.cell_barcodes.txt.gz
  18. 4.6G Apr 15 2019 GSE129785_scATAC-TME-All.mtx.gz
  19. 4.4M Apr 15 2019 GSE129785_scATAC-TME-All.peaks.txt.gz
  20. 1.3M Apr 15 2019 GSE129785_scATAC-TME-TCells.cell_barcodes.txt.gz
  21. 2.9G Apr 15 2019 GSE129785_scATAC-TME-TCells.mtx.gz
  22. 4.4M Apr 15 2019 GSE129785_scATAC-TME-TCells.peaks.txt.gz

是不是超级简单啊!

当然了,后续的分析才是苦难的开始,虽然咱有普通atac-seq技术打底,学一个新技术会很快,但该有的挫折感并不会少!

另外一个超级经典的《单细胞ATAC》数据集是 :

  • GSE96772.
  • Integrated Single-Cell Analysis Maps the Continuous Regulatory Landscape of Human Hematopoietic Differentiation. Cell 2018 May 31;173(6):1535-1548.e16. PMID: 29706549

感兴趣的小伙伴也可以跟我一样,批量下载它们的结果,然后开启下游分析哦!

发布于 2021-09-30 23:21

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/256682
推荐阅读
相关标签
  

闽ICP备14008679号