赞
踩
DIAMOND是一款用于蛋白质和翻译DNA搜索的序列比对器,专为大序列数据的高性能分析而设计。
官方文档:Home · bbuchfink/diamond Wiki (github.com)
- # 使用conda创建diamond环境并安装diamond
- conda create --name diamond diamond
- # 激活diamond
- conda activate diamond
- # 查看diamond版本
- diamond --version
下载示例数据,这个数据集为FASTA格式,其中包含了14,323条蛋白质序列
wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa
现在利用diamond makedb
将刚下载的文件转换成DIAMOND数据库文件,这个数据库文件将用于后续的比对。
diamond makedb --in astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40
用同一文件进行序列查找
diamond blastp -q astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40 -o out.tsv --very-sensitive
参数解释:
-q 后接需要查询的文件
-d 后接上一步生成的数据库文件
-o 后接搜寻结果
DIAMOND具有多种灵敏度设置,以适应不同的应用。默认模式是最快的,专为查找 >70% 序列同一性的同源性而定制,--sensitive 模式针对 >40% 同一性的命中量身定制,而 --very-sensitive 和 --ultra-sensitive 模式在整个成对比对范围内提供较高的灵敏度。灵敏度越高,越可能匹配到阳性结果。
结果解释
部分结果:
d1dlwa_ d1dlwa_ 100 116 0 0 1 116 1 116 6.42e-77 220 d1dlwa_ d2gkma_ 35.4 113 73 0 1 113 13 125 1.43e-21 80.9 d1dlwa_ d4i0va_ 31.9 119 75 2 1 113 2 120 9.11e-13 58.2 d2gkma_ d2gkma_ 100 127 0 0 1 127 1 127 1.51e-87 248 d2gkma_ d1dlwa_ 34.8 115 75 0 13 127 1 115 6.90e-23 84.3 d2gkma_ d4i0va_ 33.6 110 69 1 13 118 2 111 1.35e-18 73.6 d2gkma_ d6bmea_ 35.5 110 67 1 13 118 2 111 1.32e-16 68.6 d2gkma_ d2bkma_ 37.3 67 38 2 13 76 5 70 5.18e-06 40.8 d1ngka_ d1ngka_ 100 126 0 0 1 126 1 126 4.34e-91 257 d1ngka_ d2bkma_ 38.4 125 73 2 1 125 4 124 1.42e-24 89.0
各列含义解释:
Query accession: the accession of the sequence that was the search query against the database, as specified in the input FASTA file after the >
character until the first blank.
Target accession: the accession of the target database sequence (also called subject) that the query was aligned against.
Sequence identity: The percentage of identical amino acid residues that were aligned against each other in the local alignment.
Length: The total length of the local alignment, which including matching and mismatching positions of query and subject, as well as gap positions in the query and subject.
Mismatches: The number of non-identical amino acid residues aligned against each other.
Gap openings: The number of gap openings.
Query start: The starting coordinate of the local alignment in the query (1-based).
Query end: The ending coordinate of the local alignment in the query (1-based).
Target start: The starting coordinate of the local alignment in the target (1-based).
Target end: The ending coordinate of the local alignment in the target (1-based).
E-value: The expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database. This number is most useful for measuring the significance of a hit. By default, DIAMOND will report all alignments with e-value < 0.001, meaning that a hit of this quality will be found by chance on average once per 1,000 queries.
Bit score: The bit score is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar. It is always >= 0 for local Smith Waterman alignments.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。