当前位置:   article > 正文

SemiBin宏基因组半监督分箱工具中GTDB数据的下载与配置_semibin用法

semibin用法

最近想学一学宏基因组的分箱工具使用(讲真的,感觉bin还是挺复杂的,不是我这种小白该去涉猎的),本来想看看老牌工具metaWRAP的使用细节,奈何微信推送了一条新的分箱工具——SemiBin,还是基于当下很热门的半监督+神经网络的模型构建的,感觉很新鲜,想尝尝鲜,于是就去SemiBin的github上学习一下基本用法:

GitHub - BigDataBiology/SemiBin

 该软件跟着install的引导去下载就可以了,还是比较简单,不过其中有几个较大的配置文件/软件要下载,望大家耐心等待。另外需要注意的是组装工具megahit,比对工具bowtie2以及samtools需要自己在SemiBin所在的环境中下载,具体的下载安装细节搬运一下github:

  1. # semibin的下载和安装
  2. conda create -n SemiBin
  3. conda activate SemiBin
  4. conda install -c conda-forge -c bioconda semibin
  5. # 辅助semibin的软件下载
  6. conda install -c bioconda bowtie2
  7. conda install -c bioconda samtools
  8. conda install -c bioconda megahit
  9. # 检查semibin是否安装好安装
  10. SemiBin check_install
  11. # 出现下面这样就没什么问题了,表明安装成功:
  12. Looking for dependencies...
  13. bedtools : /home/dell/miniconda3/envs/SemiBin/bin/bedtools
  14. hmmsearch : /home/dell/miniconda3/envs/SemiBin/bin/hmmsearch
  15. mmseqs : /home/dell/miniconda3/envs/SemiBin/bin/mmseqs
  16. FragGeneScan : /home/dell/miniconda3/envs/SemiBin/bin/FragGeneScan
  17. prodigal : /home/dell/miniconda3/envs/SemiBin/bin/prodigal
  18. Installation OK
  19. If you find SemiBin useful, please cite:
  20. Pan, S., Zhu, C., Zhao, XM. et al. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun 13, 2326 (2022). https://doi.org/10.1038/s41467-022-29843-y.

下载好后,就是实战了,从说明文档中看该软件的使用较为简单,应该是基于最理想的情况,我就以最简单的单样本、简单bin为例进行尝试:

实例文件可以在github上找:GitHub - BigDataBiology/SemiBin_tutorial_from_scratch

 点到single_sample_binning这个文件夹中,可见里面有一个样本的双端测序的文件

 ok,在linux端的操作是:

  1. # 1. 下载实例数据集,我是采用手动下载并上传的方式
  2. # 2. cd到 single_sample_binning文件夹下
  3. cd single_sample_binning
  4. # 3. 组装
  5. megahit -1 sample1_R1.fastq.gz \
  6. -2 sample1_R2.fastq.gz \
  7. --out-dir assembly_contig \
  8. --out-prefix R1
  9. # 4. mapping 并进行binning前处理:
  10. bowtie2-build \
  11. -f assembly_contig/R1.contigs.fa assembly_contig/R1.contigs.fa
  12. bowtie2 -q --fr \
  13. -x assembly_contig/R1.contigs.fa \
  14. -1 sample1_R1.fastq.gz \
  15. -2 sample1_R2.fastq.gz \
  16. -S sample1.sam \
  17. -p 64
  18. samtools view -h -b -S sample1.sam -o sample1.bam -@ 64
  19. samtools view -b -F 4 sample1.bam -o sample1.mapped.bam -@ 64
  20. samtools sort \
  21. -m 1000000000 sample1.mapped.bam \
  22. -o sample1.mapped.sorted.bam -@ 64
  23. # 5. Easy single binning mode 最简单的binning方式
  24. SemiBin single_easy_bin \
  25. -i assembly_contig/R1.contigs.fa \
  26. -b sample1.mapped.sorted.bam \
  27. -o easy_single_sample_output
  28. # ========================================================
  29. # 在这边出问题了:
  30. (SemiBin) dell 20:18:05 /home/DB/fqz/mydata/SemiBin_testdata/single_sample_binning
  31. $ SemiBin single_easy_bin \
  32. > -i assembly_contig/R1.contigs.fa \
  33. > -b sample1.mapped.sorted.bam \
  34. > -o easy_single_sample_output
  35. 2022-08-25 20:19:36,299 - Setting number of CPUs to 112
  36. 2022-08-25 20:19:36,300 - Do not detect GPU. Running with CPU.
  37. 2022-08-25 20:19:36,315 - Generate training data.
  38. 2022-08-25 20:19:37,052 - Calculating coverage for every sample.
  39. 2022-08-25 20:19:37,057 - Processing `sample1.mapped.sorted.bam`
  40. 2022-08-25 20:19:37,805 - Processed:sample1.mapped.sorted.bam
  41. 2022-08-25 20:19:37,873 - Start generating kmer features from fasta file.
  42. 2022-08-25 20:19:39,087 - Running mmseqs and generate cannot-link file.
  43. 2022-08-25 20:19:39,149 - Downloading GTDB to /home/dell/.cache/SemiBin/mmseqs2-GTDB. It will take a while..
  44. # 第一使用该工具会下载软件依赖的GTDB数据库,由于放在外网所以下载速度很感人.....多次下载失败

直接使用SemiBin自带的数据库下载函数也不行,不过给了提示,该软件依赖的GTDB数据库使用的版本是v95:

 

所以,本次尝试主要遇到的问题就是SemiBin依赖的GTDB数据库下载总是失败,应该是外网限速的问题,我不会FQ,也没钱整DL,所以只能自己想办法,于是再次采用手动下载手动上传并安装GTDB v95版本的方式,首先去GTDB数据库的网站找v95版本,然后下载上传到 /home/dell/.cache/SemiBin/mmseqs2-GTDB 这个文件夹下,再次运行 SemiBin single_easy_bin 命令,结果文件被直接覆盖(即重新龟速下载),后来将手动下载的文件名命名成软件需要的名字,还是被覆盖,重复前述步骤,并将手动下载的GTDB数据库解压缩,还是被覆盖。所以得出的结论是这个文件可能不是SemiBin能够识别的文件,版本对,但是不能被识别,说明不是这个文件,同样还是中断龟速下载后,报错文件给了我提示:

 gtdb.py这个python脚本应该与下载GTBD数据库有关,于是去找github主页该文件的位置:

 

 还好作者在zenodo平台上留了一个GTDB数据库v95版本的备份,不然还真难搞

GTDB reference genome generated by MMseqs2 used in SemiBin. | Zenodo

 下载链接: https://zenodo.org/record/4751564/files/GTDB_v95.tar.gz?download=1

基于这个下载链接,使用迅雷,在保存到迅雷云盘之后本地下载的方式,我感觉比较快

下载完毕后,上传到指定位置(我的位置是~/.cache/SemiBin/mmseqs2-GTDB),然后解压缩!一定要解压缩,不然SemiBin还不识别然后将你的文件覆盖掉重新下载......

解压缩就随意了,我是使用pigz + tar的方式解压:

  1. pigz -d GTDB_v95.tar.gz
  2. tar -xf GTDB_v95.tar
  3. # 也可以直接 tar -zxvf GTDB_v95.tar.gz
  4. # 验证一下是否已经下载好,semibin能够识别:
  5. SemiBin download_GTDB
  6. 2022-08-25 20:36:01,669 - Found GTDB directory: `/home/dell/.cache/SemiBin/mmseqs2-GTDB`.
  7. If you find SemiBin useful, please cite:
  8. Pan, S., Zhu, C., Zhao, XM. et al. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun 13, 2326 (2022). https://doi.org/10.1038/s41467-022-29843-y.
  9. # 可以继续了

查看解压后文件:

再次运行单样本简单binning,没有报错,一切正常了:

  1. (SemiBin) dell 20:36:42 /home/DB/fqz/mydata/SemiBin_testdata/single_sample_binning
  2. $ SemiBin single_easy_bin \
  3. > -i assembly_contig/R1.contigs.fa \
  4. > -b sample1.mapped.sorted.bam \
  5. > -o easy_single_sample_output
  6. # 下面全是软件binning过程输出:
  7. 2022-08-25 20:37:06,248 - Setting number of CPUs to 112
  8. 2022-08-25 20:37:06,249 - Do not detect GPU. Running with CPU.
  9. 2022-08-25 20:37:06,264 - Generate training data.
  10. 2022-08-25 20:37:06,920 - Calculating coverage for every sample.
  11. 2022-08-25 20:37:06,925 - Processing `sample1.mapped.sorted.bam`
  12. 2022-08-25 20:37:07,667 - Processed:sample1.mapped.sorted.bam
  13. 2022-08-25 20:37:07,721 - Start generating kmer features from fasta file.
  14. 2022-08-25 20:37:08,903 - Running mmseqs and generate cannot-link file.
  15. 2022-08-25 20:37:08,915 - Found GTDB directory: `/home/dell/.cache/SemiBin/mmseqs2-GTDB`.
  16. createdb /tmp/tmp5y44n4h6/SemiBinFiltered.fa /tmp/tmp5y44n4h6/contig_DB
  17. MMseqs Version: 13.45111
  18. Database type 0
  19. Shuffle input database true
  20. Createdb mode 0
  21. Write lookup file 1
  22. Offset of numeric ids 0
  23. Compressed 0
  24. Verbosity 3
  25. Converting sequences
  26. [304] 0s 9ms
  27. Time for merging to contig_DB_h: 0h 0m 0s 19ms
  28. Time for merging to contig_DB: 0h 0m 0s 6ms
  29. Database type: Nucleotide
  30. Time for processing: 0h 0m 0s 45ms
  31. taxonomy /tmp/tmp5y44n4h6/contig_DB /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB easy_single_sample_output/mmseqs_contig_annotation/mmseqs_contig_annotation /tmp/tmp5y44n4h6 --tax-lineage 1 --threads 112
  32. MMseqs Version: 13.45111
  33. ORF filter 1
  34. ORF filter e-value 100
  35. ORF filter sensitivity 2
  36. LCA mode 3
  37. Taxonomy output mode 0
  38. Majority threshold 0.5
  39. Vote mode 1
  40. LCA ranks
  41. Column with taxonomic lineage 1
  42. Compressed 0
  43. Threads 112
  44. Verbosity 3
  45. Taxon blacklist 12908:unclassified sequences,28384:other sequences
  46. Substitution matrix nucl:nucleotide.out,aa:blosum62.out
  47. Add backtrace false
  48. Alignment mode 1
  49. Alignment mode 0
  50. Allow wrapped scoring false
  51. E-value threshold 1
  52. Seq. id. threshold 0
  53. Min alignment length 0
  54. Seq. id. mode 0
  55. Alternative alignments 0
  56. Coverage threshold 0
  57. Coverage mode 0
  58. Max sequence length 65535
  59. Compositional bias 1
  60. Max reject 5
  61. Max accept 30
  62. Include identical seq. id. false
  63. Preload mode 0
  64. Pseudo count a 1
  65. Pseudo count b 1.5
  66. Score bias 0
  67. Realign hits false
  68. Realign score bias -0.2
  69. Realign max seqs 2147483647
  70. Gap open cost nucl:5,aa:11
  71. Gap extension cost nucl:2,aa:1
  72. Zdrop 40
  73. Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out
  74. Sensitivity 2
  75. k-mer length 0
  76. k-score 2147483647
  77. Alphabet size nucl:5,aa:21
  78. Max results per query 300
  79. Split database 0
  80. Split mode 2
  81. Split memory limit 0
  82. Diagonal scoring true
  83. Exact k-mer matching 0
  84. Mask residues 1
  85. Mask lower case residues 0
  86. Minimum diagonal score 15
  87. Spaced k-mers 1
  88. Spaced k-mer pattern
  89. Local temporary path
  90. Rescore mode 0
  91. Remove hits by seq. id. and coverage false
  92. Sort results 0
  93. Mask profile 1
  94. Profile E-value threshold 0.001
  95. Global sequence weighting false
  96. Allow deletions false
  97. Filter MSA 1
  98. Maximum seq. id. threshold 0.9
  99. Minimum seq. id. 0
  100. Minimum score per column -20
  101. Minimum coverage 0
  102. Select N most diverse seqs 1000
  103. Min codons in orf 30
  104. Max codons in length 32734
  105. Max orf gaps 2147483647
  106. Contig start mode 2
  107. Contig end mode 2
  108. Orf start mode 1
  109. Forward frames 1,2,3
  110. Reverse frames 1,2,3
  111. Translation table 1
  112. Translate orf 0
  113. Use all table starts false
  114. Offset of numeric ids 0
  115. Create lookup 0
  116. Add orf stop false
  117. Overlap between sequences 0
  118. Sequence split mode 1
  119. Header split mode 0
  120. Chain overlapping alignments 0
  121. Merge query 1
  122. Search type 0
  123. Search iterations 1
  124. Start sensitivity 4
  125. Search steps 1
  126. Exhaustive search mode false
  127. Filter results during exhaustive search 0
  128. Strand selection 1
  129. LCA search mode false
  130. Disk space limit 0
  131. MPI runner
  132. Force restart with latest tmp false
  133. Remove temporary files false
  134. extractorfs /tmp/tmp5y44n4h6/contig_DB /tmp/tmp5y44n4h6/10505879959436434455/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 112 --compressed 0 -v 3
  135. [=================================================================] 100.00% 392 0s 61ms
  136. Time for merging to orfs_aa_h: 0h 0m 0s 48ms
  137. Time for merging to orfs_aa: 0h 0m 0s 60ms
  138. Time for processing: 0h 0m 0s 299ms
  139. prefilter /tmp/tmp5y44n4h6/10505879959436434455/orfs_aa /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/orfs_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 2 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 112 --compressed 0 -v 3
  140. Query database size: 31958 type: Aminoacid
  141. Estimated memory consumption: 753G
  142. Target database size: 106052079 type: Aminoacid
  143. Index table k-mer threshold: 163 at k-mer size 7
  144. Index table: counting k-mers
  145. [=================================================================] 100.00% 106.05M 2m 3s 49ms
  146. Index table: Masked residues: 269144627
  147. Index table: fill
  148. [=================================================================] 100.00% 106.05M 2m 31s 544ms
  149. Index statistics
  150. Entries: 28079084193
  151. DB size: 170435 MB
  152. Avg k-mer size: 21.936785
  153. Top 10 k-mers
  154. SGQQRIA 397575
  155. GPGGKLL 319073
  156. GGQRVAR 221105
  157. YTGTGKG 177317
  158. LSGQQAI 153681
  159. GGRRVAR 125622
  160. ALGNGKS 109876
  161. LLGPGKT 107267
  162. GRFVVEV 105507
  163. TPHDFEV 88676
  164. Time for index table init: 0h 4m 54s 119ms
  165. Process prefiltering step 1 of 1
  166. k-mer similarity threshold: 163
  167. Starting prefiltering scores calculation (step 1 of 1)
  168. Query db start 1 to 31958
  169. Target db start 1 to 106052079
  170. [=================================================================] 100.00% 31.96K 0s 537ms
  171. 30.136279 k-mers per position
  172. 27916 DB matches per sequence
  173. 0 overflows
  174. 0 queries produce too many hits (truncated result)
  175. 0 sequences passed prefiltering per query sequence
  176. 0 median result list length
  177. 29412 sequences with 0 size result lists
  178. Time for merging to orfs_pref: 0h 0m 0s 21ms
  179. Time for processing: 0h 5m 16s 25ms
  180. rescorediagonal /tmp/tmp5y44n4h6/10505879959436434455/orfs_aa /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/orfs_pref /tmp/tmp5y44n4h6/10505879959436434455/orfs_aln --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e 100 -c 0 -a 0 --cov-mode 0 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 112 --compressed 0 -v 3
  181. [=================================================================] 100.00% 31.96K 0s 63ms
  182. Time for merging to orfs_aln: 0h 0m 0s 10ms=================> ] 91.58% 29.27K eta 0s
  183. Time for processing: 0h 0m 5s 87ms
  184. createsubdb /tmp/tmp5y44n4h6/10505879959436434455/orfs_aln.list /tmp/tmp5y44n4h6/10505879959436434455/orfs_aa /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter --subdb-mode 1 -v 3
  185. Time for merging to orfs_filter: 0h 0m 0s 0ms
  186. Time for processing: 0h 0m 0s 22ms
  187. rmdb /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter_h -v 3
  188. Time for processing: 0h 0m 0s 2ms
  189. createsubdb /tmp/tmp5y44n4h6/10505879959436434455/orfs_aln.list /tmp/tmp5y44n4h6/10505879959436434455/orfs_aa_h /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter_h --subdb-mode 1 -v 3
  190. Time for merging to orfs_filter_h: 0h 0m 0s 0ms
  191. Time for processing: 0h 0m 0s 22ms
  192. Create directory /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy
  193. taxonomy /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/orfs_tax /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy --tax-output-mode 2 --tax-lineage 0 --threads 112 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 2 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1
  194. Create directory /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/tmp_hsp1
  195. search /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/first /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/tmp_hsp1 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 --threads 112 -s 2 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --lca-search 1
  196. prefilter /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/tmp_hsp1/10382565990596377146/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 112 --compressed 0 -v 3 -s 2.0
  197. Query database size: 2531 type: Aminoacid
  198. Estimated memory consumption: 753G
  199. Target database size: 106052079 type: Aminoacid
  200. Index table k-mer threshold: 163 at k-mer size 7
  201. Index table: counting k-mers
  202. [=================================================================] 100.00% 106.05M 2m 4s 599ms
  203. Index table: Masked residues: 269144627
  204. Index table: fill
  205. [=================================================================] 100.00% 106.05M 2m 32s 452ms
  206. Index statistics
  207. Entries: 28079084193
  208. DB size: 170435 MB
  209. Avg k-mer size: 21.936785
  210. Top 10 k-mers
  211. SGQQRIA 397575
  212. GPGGKLL 319073
  213. GGQRVAR 221105
  214. YTGTGKG 177317
  215. LSGQQAI 153681
  216. GGRRVAR 125622
  217. ALGNGKS 109876
  218. LLGPGKT 107267
  219. GRFVVEV 105507
  220. TPHDFEV 88676
  221. Time for index table init: 0h 4m 54s 821ms
  222. Process prefiltering step 1 of 1
  223. k-mer similarity threshold: 163
  224. Starting prefiltering scores calculation (step 1 of 1)
  225. Query db start 1 to 2531
  226. Target db start 1 to 106052079
  227. [=================================================================] 100.00% 2.53K 0s 910ms
  228. 31.980804 k-mers per position
  229. 170052 DB matches per sequence
  230. 0 overflows
  231. 0 queries produce too many hits (truncated result)
  232. 235 sequences passed prefiltering per query sequence
  233. 300 median result list length
  234. 0 sequences with 0 size result lists
  235. Time for merging to pref_0: 0h 0m 0s 7ms
  236. Time for processing: 0h 5m 20s 93ms
  237. lcaalign /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/tmp_hsp1/10382565990596377146/pref_0 /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/first --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 1 --alignment-output-mode 0 --wrapped-scoring 0 -e 1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 5 --max-accept 30 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 112 --compressed 0 -v 3
  238. Compute score only
  239. Query database size: 2531 type: Aminoacid
  240. Target database size: 106052079 type: Aminoacid
  241. [=================================================================] 100.00% 2.53K 0s 737ms
  242. Time for merging to first: 0h 0m 0s 7ms
  243. 59140 alignments calculated
  244. 54382 sequence pairs passed the thresholds (0.919547 of overall calculated)
  245. 21.486368 hits per query sequence
  246. Time for processing: 0h 0m 7s 61ms
  247. lca /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/first /tmp/tmp5y44n4h6/10505879959436434455/orfs_tax --blacklist '12908:unclassified sequences,28384:other sequences' --tax-lineage 0 --compressed 0 --threads 112 -v 3
  248. Node name 'unclassified sequences' does not match to be blocked name 'RBG-16-66-30'
  249. Node name 'other sequences' does not match to be blocked name 'B14-G1 sp003648675'
  250. [=================================================================] 100.00% 2.53K 0s 28ms
  251. Taxonomy for 0 out of 13195 entries not found
  252. Time for merging to orfs_tax: 0h 0m 0s 16ms
  253. Time for processing: 0h 0m 3s 758ms
  254. mvdb /tmp/tmp5y44n4h6/10505879959436434455/tmp_taxonomy/5836584711541975065/first /tmp/tmp5y44n4h6/10505879959436434455/orfs_tax_aln -v 3
  255. Time for processing: 0h 0m 0s 2ms
  256. swapdb /tmp/tmp5y44n4h6/10505879959436434455/orfs_filter_h /tmp/tmp5y44n4h6/10505879959436434455/orfs_h_swapped --split-memory-limit 0 --threads 112 --compressed 0 -v 3
  257. [=================================================================] 100.00% 2.53K 0s 12ms
  258. Computing offsets.
  259. [=================================================================] 100.00% 2.53K 0s 3ms
  260. Reading results.
  261. [=================================================================] 100.00% 2.53K 0s 5ms
  262. Output database: /tmp/tmp5y44n4h6/10505879959436434455/orfs_h_swapped
  263. [=================================================================] 100.00% 392 0s 105ms
  264. Time for merging to orfs_h_swapped: 0h 0m 0s 3ms
  265. Time for processing: 0h 0m 0s 177ms
  266. aggregatetaxweights /home/dell/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmp5y44n4h6/10505879959436434455/orfs_h_swapped /tmp/tmp5y44n4h6/10505879959436434455/orfs_tax /tmp/tmp5y44n4h6/10505879959436434455/orfs_tax_aln easy_single_sample_output/mmseqs_contig_annotation/mmseqs_contig_annotation --tax-lineage 1 --compressed 0 --threads 112 -v 3
  267. [=================================================================] 100.00% 392 0s 26ms
  268. Time for merging to mmseqs_contig_annotation: 0h 0m 0s 6ms
  269. Time for processing: 0h 0m 0s 143ms
  270. createtsv /tmp/tmp5y44n4h6/contig_DB easy_single_sample_output/mmseqs_contig_annotation/mmseqs_contig_annotation easy_single_sample_output/mmseqs_contig_annotation/taxonomyResult.tsv
  271. MMseqs Version: 13.45111
  272. First sequence as representative false
  273. Target column 1
  274. Add full header false
  275. Sequence source 0
  276. Database output false
  277. Threads 112
  278. Compressed 0
  279. Verbosity 3
  280. Time for merging to taxonomyResult.tsv: 0h 0m 0s 4ms
  281. Time for processing: 0h 0m 0s 133ms
  282. 2022-08-25 20:48:02,164 - Training model and clustering.
  283. 2022-08-25 20:48:02,165 - Start training from one sample.
  284. 2022-08-25 20:48:02,272 - Training model...
  285. 0%| | 0/20 [00:00<?, ?it/s]2022-08-25 20:48:10,698 - Generate training data of 0:
  286. 2022-08-25 20:48:10,739 - Number of must link pair:163
  287. 2022-08-25 20:48:10,739 - Number of can not link pair:9394
  288. 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:22<00:00, 1.14s/it]
  289. 2022-08-25 20:48:25,037 - Training finished.
  290. 2022-08-25 20:48:25,049 - Start binning.
  291. 2022-08-25 20:48:28,352 - Calculating depth matrix.
  292. 2022-08-25 20:48:28,366 - Edges:9962
  293. 2022-08-25 20:48:33,595 - Reclustering.
  294. 2022-08-25 20:48:45,490 - Binning finish.
  295. If you find SemiBin useful, please cite:
  296. Pan, S., Zhu, C., Zhao, XM. et al. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun 13, 2326 (2022). https://doi.org/10.1038/s41467-022-29843-y.

ok ,我有一个问题,不知道大家有没有解决办法,就是可不可以将这个数据库放到别的地方,然后semibin通过一个函数调用该位置的GTDB数据?希望本帖对大家有帮助!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/514870
推荐阅读
相关标签
  

闽ICP备14008679号