赞
踩
格式:
ftp.ncbi.nlm.nih.gov/geo/series/GSE171nnn/GSE171524/suppl/
wget: unable to resolve host address 无法解析主机host地址解决办法
Ubuntu下“wget:unable to resolve host address”解决方案_CodingForBug的博客-CSDN博客
技术|如何在 Linux 中使用 sFTP 上传或下载文件与文件夹
Linux下使用wget下载FTP服务器文件 - 腾讯云开发者社区-腾讯云
NCBI/NLM/NIH :: Public FTPhttps://www.ncbi.nlm.nih.gov/projects/faspftp/
如何下载NCBI的ftp数据_毒鸡蛋的博客-CSDN博客_ncbi ftp
https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/
We recommend using the rsync file transfer program from a Unix command line to download large data files because it is much more efficient than older protocols. The next best options for downloading multiple files are to use the HTTPS protocol, or the even older FTP protocol, using a command line tool such as wget or curl. Web browsers are very convenient options for downloading single files even though they will use the FTP protocol because of how our URLs are constructed. Other FTP clients are also widely available but do not all correctly handle the symbolic links used widely on the genomes FTP site (see below).
Replace the "ftp:" at the beginning of the FTP path with "rsync:". E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1
, then the directory and its contents could be downloaded using the following rsync command:
rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1 my_dir/
A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz
could be downloaded using the following rsync command:
rsync --copy-links --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz my_dir/
Replace the "ftp:" at the beginning of the FTP path with "https:". Also append a '/' to the path if it is a directory. E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1
, then the directory and its contents could be downloaded using the following wget command:
wget --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/ -P my_dir/
A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz
could be downloaded using either of the following commands:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz -P my_dir/
curl --remote-name --remote-time https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz
Append a '/' to the path if it is a directory. E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1
, then the directory and its contents could be downloaded using the following wget command:
wget --recursive --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/ -P my_dir/
A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz
could be downloaded using either of the following commands:
wget --timestamping ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz -P my_dir/
curl --remote-name --remote-time ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz
单细胞转录组教程我们写的差不多了,是时候进军单细胞ATAC和空间单细胞了,找到了这个经典的 《单细胞ATAC》数据集:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129785 ,对应的文章是:Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat Biotechnol 2019 Aug;37(8):925-936. PMID: 31375813
如下所示的样品情况 :
发现其文件有点多,而且一个文件 就67G了,在自己的电脑里面操作的可能性很小,所以就直接去服务器处理吧。这里分享一下使用wget批量下载geo数据集的全部文件的经验:
首先查看文件列表(注意看网页的网址哦):
网页的网址是有规律的, 同理,文件名也是有规律的:
文件名如下所示:
- GSE129785_RAW.tar 2019-04-18 13:03 67G
- GSE129785_scATAC-Hematopoiesis-All.cell_barcodes.txt.gz 2019-04-14 19:30 2.9M
- GSE129785_scATAC-Hematopoiesis-All.mtx.gz 2019-04-14 19:20 6.0G
- GSE129785_scATAC-Hematopoiesis-All.peaks.txt.gz 2019-04-14 19:30 4.3M
- GSE129785_scATAC-Hematopoiesis-CD34.cell_barcodes.txt.gz 2019-04-14 19:30 849K
- GSE129785_scATAC-Hematopoiesis-CD34.mtx.gz 2019-04-14 19:21 2.1G
- GSE129785_scATAC-Hematopoiesis-CD34.peaks.txt.gz 2019-04-14 19:30 4.3M
- GSE129785_scATAC-PBMCs-Fresh.cell_barcodes.txt.gz 2019-04-14 19:30 192K
- GSE129785_scATAC-PBMCs-Fresh.mtx.gz 2019-04-14 19:21 415M
- GSE129785_scATAC-PBMCs-Fresh.peaks.txt.gz 2019-04-14 19:30 1.8M
- GSE129785_scATAC-PBMCs-Frozen.cell_barcodes.txt.gz 2019-04-14 19:30 228K
- GSE129785_scATAC-PBMCs-Frozen.mtx.gz 2019-04-14 19:21 209M
- GSE129785_scATAC-PBMCs-Frozen.peaks.txt.gz 2019-04-14 19:30 1.2M
- GSE129785_scATAC-PBMCs-FrozenSort.cell_barcodes.txt.gz 2019-04-14 19:30 207K
- GSE129785_scATAC-PBMCs-FrozenSort.mtx.gz 2019-04-14 19:21 319M
- GSE129785_scATAC-PBMCs-FrozenSort.peaks.txt.gz 2019-04-14 19:30 1.3M
- GSE129785_scATAC-TME-All.cell_barcodes.txt.gz 2019-04-14 19:30 1.7M
- GSE129785_scATAC-TME-All.mtx.gz 2019-04-14 19:21 4.5G
- GSE129785_scATAC-TME-All.peaks.txt.gz 2019-04-14 19:30 4.4M
- GSE129785_scATAC-TME-TCells.cell_barcodes.txt.gz 2019-04-14 19:30 1.3M
- GSE129785_scATAC-TME-TCells.mtx.gz 2019-04-14 19:21 2.8G
- GSE129785_scATAC-TME-TCells.peaks.txt.gz 2019-04-14 19:30 4.4M
简单的把上面的文件名存放到一个文本文件 list.txt ,就可以使用下面的命令批量下载啦:
- mkdir ~/scRNA/atac
- cd ~/scRNA/atac
- awk '{print $1}' list.txt |while read id;do (nohup wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129785/suppl/$id & );done
一个很简单的shell脚本,就可以得到全部文件如下所示;
- 67G Apr 19 2019 GSE129785_RAW.tar
- 2.9M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.cell_barcodes.txt.gz
- 6.0G Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.mtx.gz
- 4.3M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.peaks.txt.gz
- 849K Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.cell_barcodes.txt.gz
- 2.2G Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.mtx.gz
- 4.3M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.peaks.txt.gz
- 193K Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.cell_barcodes.txt.gz
- 415M Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.mtx.gz
- 1.9M Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.peaks.txt.gz
- 229K Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.cell_barcodes.txt.gz
- 210M Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.mtx.gz
- 1.2M Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.peaks.txt.gz
- 208K Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.cell_barcodes.txt.gz
- 320M Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.mtx.gz
- 1.3M Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.peaks.txt.gz
- 1.8M Apr 15 2019 GSE129785_scATAC-TME-All.cell_barcodes.txt.gz
- 4.6G Apr 15 2019 GSE129785_scATAC-TME-All.mtx.gz
- 4.4M Apr 15 2019 GSE129785_scATAC-TME-All.peaks.txt.gz
- 1.3M Apr 15 2019 GSE129785_scATAC-TME-TCells.cell_barcodes.txt.gz
- 2.9G Apr 15 2019 GSE129785_scATAC-TME-TCells.mtx.gz
- 4.4M Apr 15 2019 GSE129785_scATAC-TME-TCells.peaks.txt.gz
是不是超级简单啊!
当然了,后续的分析才是苦难的开始,虽然咱有普通atac-seq技术打底,学一个新技术会很快,但该有的挫折感并不会少!
另外一个超级经典的《单细胞ATAC》数据集是 :
感兴趣的小伙伴也可以跟我一样,批量下载它们的结果,然后开启下游分析哦!
发布于 2021-09-30 23:21
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。