当前位置:   article > 正文

diamond大基因序列快速比对工具使用详解-包含超算集群多节点计算使用方法_65mb文件做blast比对时间多久

65mb文件做blast比对时间多久

Diamond是一款快速的序列比对工具,其使用方法如下:

1. 安装Diamond:

可从官方网站(https://github.com/bbuchfink/diamond/releases)下载安装包,并安装到本地电脑中。当然还有docker,conda以及编译安装方式,一般用不上,但注意新版对gcc的要求高,出现gcc错误时可选择下载低版本的diamond或者升级gcc到指定版本以上。

  1. #下载diamond程序文件
  2. wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
  3. ###其他版本直接访问http://github.com/bbuchfink/diamond/releases/download/查看
  4. #解压会出来一个diamond的文件
  5. ​tar -xzvf diamond-linux64.tar.gz
  6. #移到系统环境目录、或将当前目录加入系统环境目录,或者直接使用路径加diamond命令运行
  7. diamond blastx
  8. ./diamond blastx
  9. /opt/diamond blastx

2. 准备数据集:

首先需要准备用于比对的序列数据集,比如fasta格式的序列文件。

  1. #下载nr数据库,或这自己需要的数据库
  2. wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
  3. gunzip nr.gz
  4. #使用diamond命令创建dimond格式数据库
  5. diamond makedb --in nr --db nr

3. 运行Diamond:

常规使用

在终端中输入以下命令,即可启动Diamond程序并运行比对任务:
diamond blastx -d [参考序列文件] -q [待比对序列文件] -o [输出文件名]

  1. #下载nr数据库,或这自己需要的数据库
  2. wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
  3. gunzip nr.gz
  4. #使用diamond命令创建dimond格式数据库
  5. diamond makedb --in nr --db nr
  6. #命令使用
  7. diamond blastx --db nr -q reads.fna -o dna_matches_fmt6.txt
  8. diamond blastp --db nr -q reads.faa -o protein_matches_fmt6.txt

其中,blastx表示使用蛋白质序列比对算法,-d和-q分别指定参考序列文件和待比对序列文件,-o指定输出文件名。

超算集群多计算节点并行计算(私房菜)Distributed computing

diamond尽管速度快,但对于大文件进行比对时,大于1G以上的文件对于40核的单个节点可能仍然需要几天的时间,如果有较多的节点时,可以使用多节点的并行计算,这一点太给力了。

准备工作(重要,否则不成功):

1、将diamond程序目录在各节点间共享

2、样品序列目录在各节点间共享

3、所有节点使用相同的临时目录在各节点间共享。

  1. # Diamond distributed-memory parallel processing
  2. #Diamond supports the parallel processing of large alignments on HPC clusters and #supercomputers, spanning numerous compute nodes. Work distribution is orchestrated by #Diamond via a shared file system typically available on such clusters, using lightweight #file-based stacks and POSIX functionality.
  3. #Usage
  4. #To run Diamond in parallel, two steps need to be performed. First, during a short #initialization run using a single process, the query and database are scanned and chunks #of work are written to the file-based stacks on the parallel file system. Second, the #actual parallel run is performed, where multiple DIAMOND processes on different compute #nodes work on chunks of the query and the reference database to perform alignments and #joins.
  5. #Initialization 先进行任务初始化,这个只需要在第一个节点上初始化就行了。其他节点直接启动后面一步的并行计算命令就行
  6. #The initialization of a parallel run should be done (e.g. interactively on a login node) #using the parameters --multiprocessing --mp-init as follows:
  7. diamond blastp --db DATABASE.dmnd --query QUERY.fasta --multiprocessing --mp-init --tmpdir $TMPDIR --parallel-tmpdir $PTMPDIR
  8. #Here $TMPDIR refers to a local temporary directory, whereas $PTMPDIR refers to a #directory in the parallel file system where the file-based stacks containing the work #packages will be created. Note that the size of the chunking and thereby the number of #work packages is controlled via the --block-size parameter.
  9. #Parallel run 开始真实的并行计算,可以在所有计算节点启动
  10. #The actual parallel run should be done using the parameter --multiprocessing as follows:
  11. diamond blastp --db DATABASE.dmnd --query QUERY.fasta -o OUTPUT_FILE --multiprocessing --tmpdir $TMPDIR --parallel-tmpdir $PTMPDIR
  12. #这里特意说明文件夹与任务初始化文件夹的一致性,主要是临时计算目录tmpdir
  13. #Note that $PTMPDIR must refer to the same location as used during the initialization, #and it must be accessible from any of the compute nodes involved. To launch the parallel #processes on many nodes, a batch system such as SLURM is typically used. For the output #not a single stream is used but rather multiple files are created, one for each query #chunk.
  14. #SLURM batch file example slurm超算集群脚本,这个不多说了吧,使用这个更方便一点,没有也不用担心,使用前面那两个命令即可。
  15. #The following script shows an example of how a massively parallel can be performed using #the SLURM batch system on a supercomputer.
  16. #!/bin/bash -l
  17. #SBATCH -D ./
  18. #SBATCH -J DIAMOND
  19. #SBATCH --mem=185000
  20. #SBATCH --nodes=520
  21. #SBATCH --ntasks-per-node=1
  22. #SBATCH --ntasks-per-core=2
  23. #SBATCH --cpus-per-task=80
  24. #SBATCH --mail-type=none
  25. #SBATCH --time=24:00:00
  26. module purge
  27. module load gcc impi
  28. export SLURM_HINT=multithread
  29. ###以下是超算的相关说明,重点关注前面配置即可。
  30. srun diamond FLAGS
  31. FLAGS refers to the aforementioned parallel flags for Diamond. Note that the actual configuration of the nodes varies between machines, and therefore, the parameters shown here are not of general applicability. It is recommended to start with few nodes on small problems, first.
  32. Abort and resume
  33. Parallel runs can be aborted and later resumed, and unfinished work packages from a previous run can be recovered and resubmitted in a subsequent run.
  34. Using the option --multiprocessing --mp-recover for the same value of --parallel-tmpdir will scan the working directory and configure a new parallel run including only the work packages that have not been completed in the previous run.
  35. Placing a file stop in the working directory causes DIAMOND processes to shut down in a controlled way after finishing the current work package. After removing the stop file, the multiprocessing run can be continued.
  36. Parameter optimization
  37. The granularity of the size of the work packages can be adjusted via the --block-size which at the same time affects the memory requirements at runtime. Parallel runs on more than 512 nodes of a supercomputer have been performed successfully.

4. 结果解读:

比对结束后,可以查看输出文件中的比对结果。Diamond的输出文件包含每个待比对序列的匹配结果,包括匹配的参考序列名、匹配位置、匹配得分等信息。

结果字段表示与原生blast结果表示相同:

见: 生物信息学分析-blast序列比对及结果详细说明-CSDN博客

5.帮助说明

 以上就是Diamond的基本使用方法,更详细的说明可以参考官方文档:https://github.com/bbuchfink/diamond/wiki。

  1. # downloading the tool,下载工具
  2. wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
  3. tar xzf diamond-linux64.tar.gz
  4. # creating a diamond-formatted database file 创建diamond数据库
  5. ./diamond makedb --in reference.fasta -d reference
  6. # running a search in blastp mode 使用blastp模式比对序列
  7. ./diamond blastp -d reference -q queries.fasta -o matches.tsv
  8. # running a search in blastx mode 使用blastx 模式比对序列
  9. ./diamond blastx -d reference -q reads.fasta -o matches.tsv
  10. # downloading and using a BLAST database
  11. update_blastdb.pl --decompress --blastdb_version 5 swissprot
  12. ./diamond prepdb -d swissprot
  13. ./diamond blastp -d swissprot -q queries.fasta -o matches.tsv
  14. Some important points to consider:
  15. Repeat masking is applied to the query and reference sequences by default. To disable it, use --masking 0. 默认情况下是允许重复结果,如果只输出最优结果就加上 --masking 0
  16. DIAMOND is optimized for large input files of >1 million proteins. Naturally the tool can be used for smaller files as well, but the algorithm will not reach its full efficiency.
  17. The program may use quite a lot of memory and also temporary disk space. Should the program fail due to running out of either one, you need to set a lower value for the block size parameter -b. DIAMOND是大文件效率更好,对于小文件建议添加 -b 的参数
  18. The sensitivity can be adjusted using the options --fast, --mid-sensitive, --sensitive, --more-sensitive, --very-sensitive and --ultra-sensitive. 比对敏感性,越往后其结果越接近原生blast结果,但速度也越慢,一般使用--more-sensitive比较适中,计算资源不够的就使用fast。

全参数帮助文件

下面是diamond的较为详细的帮助说明:自己慢慢看吧,不过一般不用特意设置了。

  1. diamond --help
  2. diamond v2.0.11.149 (C) Max Planck Society for the Advancement of Science
  3. Documentation, support and updates available at http://www.diamondsearch.org
  4. Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
  5. Syntax: diamond COMMAND [OPTIONS]
  6. Commands:
  7. makedb Build DIAMOND database from a FASTA file #以fasta文件创建diamond格式数据库
  8. blastp Align amino acid query sequences against a protein reference database #功能与原生blastp功能一致
  9. blastx Align DNA query sequences against a protein reference database #功能与原生blastx一致
  10. view View DIAMOND alignment archive (DAA) formatted file
  11. help Produce help message
  12. version Display version information
  13. getseq Retrieve sequences from a DIAMOND database file
  14. dbinfo Print information about a DIAMOND database file
  15. test Run regression tests
  16. makeidx Make database index
  17. General options:
  18. --threads (-p) number of CPU threads #指定需要运行的线程数,可尽量大
  19. --db (-d) database file #diamond makedb产生的diamond可使用格式的数据库
  20. --out (-o) output file #比对结果输出命名
  21. --outfmt (-f) output format #outfmt,一般选6表格格式,与原生blast一致
  22. 0 = BLAST pairwise
  23. 5 = BLAST XML
  24. 6 = BLAST tabular
  25. 100 = DIAMOND alignment archive (DAA)
  26. 101 = SAM
  27. Value 6 may be followed by a space-separated list of these keywords:
  28. qseqid means Query Seq - id
  29. qlen means Query sequence length
  30. sseqid means Subject Seq - id
  31. sallseqid means All subject Seq - id(s), separated by a ';'
  32. slen means Subject sequence length
  33. qstart means Start of alignment in query
  34. qend means End of alignment in query
  35. sstart means Start of alignment in subject
  36. send means End of alignment in subject
  37. qseq means Aligned part of query sequence
  38. qseq_translated means Aligned part of query sequence (translated)
  39. full_qseq means Query sequence
  40. full_qseq_mate means Query sequence of the mate
  41. sseq means Aligned part of subject sequence
  42. full_sseq means Subject sequence
  43. evalue means Expect value
  44. bitscore means Bit score
  45. score means Raw score
  46. length means Alignment length
  47. pident means Percentage of identical matches
  48. nident means Number of identical matches
  49. mismatch means Number of mismatches
  50. positive means Number of positive - scoring matches
  51. gapopen means Number of gap openings
  52. gaps means Total number of gaps
  53. ppos means Percentage of positive - scoring matches
  54. qframe means Query frame
  55. btop means Blast traceback operations(BTOP)
  56. cigar means CIGAR string
  57. staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
  58. sscinames means unique Subject Scientific Name(s), separated by a ';'
  59. sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
  60. skingdoms means unique Subject Kingdom(s), separated by a ';'
  61. sphylums means unique Subject Phylum(s), separated by a ';'
  62. stitle means Subject Title
  63. salltitles means All Subject Title(s), separated by a '<>'
  64. qcovhsp means Query Coverage Per HSP
  65. scovhsp means Subject Coverage Per HSP
  66. qtitle means Query title
  67. qqual means Query quality values for the aligned part of the query
  68. full_qqual means Query quality values
  69. qstrand means Query strand
  70. Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
  71. --verbose (-v) verbose console output
  72. --log enable debug log
  73. --quiet disable console output
  74. --header Write header lines to blast tabular format.
  75. Makedb options:
  76. --in input reference file in FASTA format
  77. --taxonmap protein accession to taxid mapping file
  78. --taxonnodes taxonomy nodes.dmp from NCBI
  79. --taxonnames taxonomy names.dmp from NCBI
  80. Aligner options:
  81. --query (-q) input query file
  82. --strand query strands to search (both/minus/plus)
  83. --un file for unaligned queries
  84. --al file or aligned queries
  85. --unfmt format of unaligned query file (fasta/fastq)
  86. --alfmt format of aligned query file (fasta/fastq)
  87. --unal report unaligned queries (0=no, 1=yes)
  88. --max-target-seqs (-k) maximum number of target sequences to report alignments for (default=25)
  89. --top report alignments within this percentage range of top alignment score (overrides --max-target-seqs)
  90. --max-hsps maximum number of HSPs per target sequence to report for each query (default=1)
  91. --range-culling restrict hit culling to overlapping query ranges
  92. --compress compression for output files (0=none, 1=gzip, zstd)
  93. --evalue (-e) maximum e-value to report alignments (default=0.001)
  94. --min-score minimum bit score to report alignments (overrides e-value setting)
  95. --id minimum identity% to report an alignment
  96. --query-cover minimum query cover% to report an alignment
  97. --subject-cover minimum subject cover% to report an alignment
  98. --fast enable fast mode
  99. --mid-sensitive enable mid-sensitive mode
  100. --sensitive enable sensitive mode)
  101. --more-sensitive enable more sensitive mode
  102. --very-sensitive enable very sensitive mode
  103. --ultra-sensitive enable ultra sensitive mode
  104. --iterate iterated search with increasing sensitivity
  105. --global-ranking (-g) number of targets for global ranking
  106. --block-size (-b) sequence block size in billions of letters (default=2.0)
  107. --index-chunks (-c) number of chunks for index processing (default=4)
  108. --tmpdir (-t) directory for temporary files
  109. --parallel-tmpdir directory for temporary files used by multiprocessing
  110. --gapopen gap open penalty
  111. --gapextend gap extension penalty
  112. --frameshift (-F) frame shift penalty (default=disabled)
  113. --long-reads short for --range-culling --top 10 -F 15
  114. --matrix score matrix for protein alignment (default=BLOSUM62)
  115. --custom-matrix file containing custom scoring matrix
  116. --comp-based-stats composition based statistics mode (0-4)
  117. --masking enable tantan masking of repeat regions (0/1=default)
  118. --query-gencode genetic code to use to translate query (see user manual)
  119. --salltitles include full subject titles in DAA file
  120. --sallseqid include all subject ids in DAA file
  121. --no-self-hits suppress reporting of identical self hits
  122. --taxonlist restrict search to list of taxon ids (comma-separated)
  123. --taxon-exclude exclude list of taxon ids (comma-separated)
  124. --seqidlist filter the database by list of accessions
  125. --skip-missing-seqids ignore accessions missing in the database
  126. Advanced options:
  127. --algo Seed search algorithm (0=double-indexed/1=query-indexed/ctg=contiguous-seed)
  128. --bin number of query bins for seed search
  129. --min-orf (-l) ignore translated sequences without an open reading frame of at least this length
  130. --freq-sd number of standard deviations for ignoring frequent seeds
  131. --id2 minimum number of identities for stage 1 hit
  132. --xdrop (-x) xdrop for ungapped alignment
  133. --gapped-filter-evalue E-value threshold for gapped filter (auto)
  134. --band band for dynamic programming computation
  135. --shapes (-s) number of seed shapes (default=all available)
  136. --shape-mask seed shapes
  137. --multiprocessing enable distributed-memory parallel processing
  138. --mp-init initialize multiprocessing run
  139. --mp-recover enable continuation of interrupted multiprocessing run
  140. --mp-query-chunk process only a single query chunk as specified
  141. --ext-chunk-size chunk size for adaptive ranking (default=auto)
  142. --no-ranking disable ranking heuristic
  143. --ext Extension mode (banded-fast/banded-slow/full)
  144. --culling-overlap minimum range overlap with higher scoring hit to delete a hit (default=50%)
  145. --taxon-k maximum number of targets to report per species
  146. --range-cover percentage of query range to be covered for range culling (default=50%)
  147. --dbsize effective database size (in letters)
  148. --no-auto-append disable auto appending of DAA and DMND file extensions
  149. --xml-blord-format Use gnl|BL_ORD_ID| style format in XML output
  150. --stop-match-score Set the match score of stop codons against each other.
  151. --tantan-minMaskProb minimum repeat probability for masking (default=0.9)
  152. --file-buffer-size file buffer size in bytes (default=67108864)
  153. --memory-limit (-M) Memory limit for extension stage in GB
  154. --no-unlink Do not unlink temporary files.
  155. --target-indexed Enable target-indexed mode
  156. --ignore-warnings Ignore warnings
  157. View options:
  158. --daa (-a) DIAMOND alignment archive (DAA) file
  159. --forwardonly only show alignments of forward strand
  160. Getseq options:
  161. --seq Sequence numbers to display.
  162. Online documentation at http://www.diamondsearch.org

新版本帮助文件:

新版本帮助更简洁,不在一个层次的命令不显示出来,以免混淆。

  1. diamond --help
  2. diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
  3. Documentation, support and updates available at http://www.diamondsearch.org
  4. Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
  5. Syntax: diamond COMMAND [OPTIONS]
  6. Commands:
  7. makedb Build DIAMOND database from a FASTA file
  8. prepdb Prepare BLAST database for use with Diamond
  9. blastp Align amino acid query sequences against a protein reference database
  10. blastx Align DNA query sequences against a protein reference database
  11. cluster Cluster protein sequences
  12. linclust Cluster protein sequences in linear time
  13. realign Realign clustered sequences against their centroids
  14. recluster Recompute clustering to fix errors
  15. reassign Reassign clustered sequences to the closest centroid
  16. view View DIAMOND alignment archive (DAA) formatted file
  17. merge-daa Merge DAA files
  18. help Produce help message
  19. version Display version information
  20. getseq Retrieve sequences from a DIAMOND database file
  21. dbinfo Print information about a DIAMOND database file
  22. test Run regression tests
  23. makeidx Make database index
  24. greedy-vertex-cover Compute greedy vertex cover
  25. Possible [OPTIONS] for COMMAND can be seen with syntax: diamond COMMAND
  26. Online documentation at http://www.diamondsearch.org

要显示更具体的命令下的参数,直接增加功能命令回车即可显示,具体使用大家可在自己系统里面查看即可:

  1. diamond makedb
  2. diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
  3. Documentation, support and updates available at http://www.diamondsearch.org
  4. Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
  5. Options:
  6. --threads number of CPU threads
  7. --verbose verbose console output
  8. --log enable debug log
  9. --quiet disable console output
  10. --tmpdir directory for temporary files
  11. --db database file
  12. --in input reference file in FASTA format/input DAA files for merge-daa
  13. --taxonmap protein accession to taxid mapping file
  14. --taxonnodes taxonomy nodes.dmp from NCBI
  15. --taxonnames taxonomy names.dmp from NCBI
  16. --file-buffer-size file buffer size in bytes (default=67108864)
  17. --no-unlink Do not unlink temporary files.
  18. --ignore-warnings Ignore warnings
  19. --no-parse-seqids Print raw seqids without parsing
  20. Error: Missing parameter: database file (--db/-d)

6、结果过滤

1. 根据比对的得分进行过滤

Diamond Blastp 会产生比对得分,你可以根据得分来筛选结果。通过设置一个阈值,只保留得分高于阈值的比对结果。例如:

  1. diamond blastp -d database.fasta -q query.fasta -o output.m8 --min-score 100
  2. ### --min-score 100 表示只保留得分高于等于 100 的比对结果。

2. 根据期望的价值(E 值)进行过滤

E 值表示在随机情况下,期望观察到给定比对得分的次数。可以根据 E 值来过滤结果,只保留期望值低于设定阈值的比对结果。例如:

  1. diamond blastp -d database.fasta -q query.fasta -o output.m8 --evalue 1e-5
  2. ### --evalue 1e-5 表示只保留 E 值低于或等于 1e-5 的比对结果。

 3. 根据相似性阈值过滤

  1. diamond blastp -d database.fasta -q query.fasta -o output.m8 --id 97
  2. ### --id 97 表示只保留相似性大于等于 97% 的比对结果。

 4. 取唯一结果

  1. ### 使用 AWK 命令根据第一个列(query ID)或其他标识符来提取唯一结果
  2. sort -k1,1 -u output.m8 > unique_output.m8

 5、python脚本处理输出最优结果

  1. # 打开 Diamond Blastp 输出文件
  2. with open('output.m8', 'r') as file:
  3. best_hit = {}
  4. # 逐行读取文件
  5. for line in file:
  6. fields = line.strip().split('\t') # 根据制表符分割字段
  7. query_id, subject_id, percent_identity, alignment_length, e_value, bit_score = fields[:6]
  8. # 如果查询 ID 不在 best_hit 中或当前行比最佳结果更好,则更新最优结果
  9. if query_id not in best_hit or float(bit_score) > float(best_hit[query_id]['bit_score']):
  10. best_hit[query_id] = {
  11. 'subject_id': subject_id,
  12. 'percent_identity': float(percent_identity),
  13. 'alignment_length': int(alignment_length),
  14. 'e_value': float(e_value),
  15. 'bit_score': float(bit_score)
  16. }
  17. # 输出最优结果
  18. for query_id, hit_info in best_hit.items():
  19. print(f"Query ID: {query_id}")
  20. print(f"Subject ID: {hit_info['subject_id']}")
  21. print(f"Percent Identity: {hit_info['percent_identity']}")
  22. print(f"Alignment Length: {hit_info['alignment_length']}")
  23. print(f"E-value: {hit_info['e_value']}")
  24. print(f"Bit Score: {hit_info['bit_score']}")
  25. print("-------------")

 脚本说明:

  • 脚本读取 Diamond Blastp 输出文件,并遍历每一行。
  • 对于每个查询 ID,它保留具有最高比对分数(或其他选择的标准)的比对结果。
  • 最终输出每个查询 ID 对应的最优匹配结果的相关信息。

6. 使用R提取最有结果

  1. # 读取 Diamond Blastp 输出文件
  2. data <- read.table("output.m8", header = FALSE, sep = "\t")
  3. # 命名列名
  4. colnames(data) <- c("query_id", "subject_id", "percent_identity", "alignment_length", "e_value", "bit_score")
  5. # 根据查询 ID 获取最优结果
  6. library(dplyr)
  7. best_hits <- data %>%
  8. group_by(query_id) %>%
  9. slice(which.max(bit_score)) # 根据最高比对分数选择最优结果,可以根据其他标准替换 bit_score
  10. # 显示最优结果
  11. print(best_hits)

 脚本说明:

  • 脚本使用 read.table() 函数读取 Diamond Blastp 输出文件。
  • 使用 dplyr 包中的 group_by()slice() 函数按照查询 ID 分组,并选择每个查询 ID 的最高比对分数,以获取最优结果。
  • 最终输出包含最优匹配结果的数据框。

7. 参考文献

Benjamin Buchfink, Chao Xie, and Daniel H. Huson. Fast and sensitive protein alignment using diamond. Nature methods, 12(1):59–60, Jan 2015.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/453741
推荐阅读
相关标签
  

闽ICP备14008679号