当前位置:   article > 正文

PSP - D-I-TASSER DeepMSA2 源码简读

deepmsa2

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/130519945

DIT

DIT:https://zhanggroup.org/D-I-TASSER/

D-I-TASSER (Deep-learning based Iterative Threading ASSEmbly Refinement) is a new method extended from I-TASSER for high-accuracy protein structure and function predictions.

D-I-TASSER (Deep-learning based Iterative Threading ASSEmbly Refinement,即基于深度学习的迭代、穿线、组装、精调) ,从 I-TASSER 扩展的新方法,用于高精度的蛋白结构和功能预测。

Starting from a query sequence, D-I-TASSER first generates inter-residue contact and distance maps and hydrogen-bond (HB) networks using multiple deep neural-network predictors, including AttentionPotential (self-attention network built on MSA transformer) and DeepPotential.

从一个查询序列开始,D-I-TASSER 首先使用多个深度神经网络预测器,生成,残基间接触 (inter-residue contact) 、距离图 (distance maps) 以及 氢键 (HB) 网络,包括 AttentionPotential (基于 MSA Transformer 的自注意力网络) 和 DeepPotential 。

Meanwhile, it identifies structural templates by the meta-threading LOMETS3 approach, which includes the models built from the state-of-the-art AlphaFold2 program.

同时,通过 元穿线 (meta-threading) LOMETS3 方法,识别结构模板,其中包括,由最先进的AlphaFold2程序构建的模型。

The full-length atomic models are finally assembled by iterative fragment assembly Monte Carlo simultions under the guidance of I-TASSER force field and deep-learning contact/distance/HB restraints, where biological functions of the query protein are derived from the structure models by COFACTOR.

最终,全长原子模型,在 I-TASSER 力场和深度学习 接触(contact) / 距离(distance) / HB 约束的指导下,通过迭代的 片段组装蒙特卡罗模拟 (fragment assembly Monte Carlo simultions) 进行组装,其中查询蛋白的生物学功能,由 COFACTOR 从结构模型中推导出来。

The D-I-TASSER pipeline (as ‘UM-TBM’ and ‘Zheng’) was ranked as the No. 1 server/predictor in all categories of protein structure prediction in the most recent CASP15 experiment, including Multi-domain Targets, Single-domain Targets, and Multi-chain Targets.

D-I-TASSER 流程 (作为“UM-TBM”和“Zheng”),在最近的 CASP15 实验中,被评为所有类别蛋白质结构预测的第一名,在服务器组和预测器组中,包括多域目标、单域目标和多链目标。

DeepMSA2

DeepMSA2:https://zhanggroup.org/DeepMSA/

DeepMSA (version 2) is a hierarchical approach to create high-quality multiple sequence alignments (MSAs) for monomer and multimer proteins.

DeepMSA (version 2) 是层次方法,用于创建单体和多体蛋白质的高质量多序列比对 (MSAs)。

The method is built on iterative sequence database searching followed by fold-based MSA ranking and selection.

该方法基于迭代的序列数据库搜索,然后,根据折叠为基础的 MSA 排名和选择。

For protein monomers, MSAs are produced with three iterative MSA searching pipelines (dMSA, qMSA and mMSA) through whole-genome (Uniclust30 and UniRef90) and metagenome (Metaclust, BFD, Mgnify, TaraDB, MetaSourceDB and JGIclust) sequence databases.

对于蛋白质单体,通过全基因组 (Uniclust30 和 UniRef90) 和宏基因组 (Metaclust、BFD、Mgnify、TaraDB、MetaSourceDB 和 JGIclust) 序列数据库,使用三种迭代的 MSA 搜索流程 (dMSA、qMSA 和 mMSA) 生成 MSAs。

For protein multimers, a number of hybrid MSAs are created by pairing the sequences from monomer MSAs of the component chains, with the optimal multimer MSAs selected based on a combined score of MSA depth and folding score of the monomer chains.

对于蛋白质多体,通过将来自组分链的单体 MSAs 的序列配对,创建了一些混合 MSAs,根据组合得分,包括 MSA 深度和单体链的折叠得分,选择最佳的多体 MSAs。

Large-scale benchmark data show significant advantage of DeepMSA2 in generating accurate MSAs with balanced depth and alignment coverage which are most suitable for deep-learning based protein and protein complex stucture and function predictions.

大规模的基准数据显示,DeepMSA2 在生成准确的 MSAs 方面具有显著的优势,这些 MSAs 具有平衡的深度和比对覆盖度,最适合基于深度学习的蛋白质以及蛋白质复合物结构和功能预测。

DeepMSA2 整体架构:
DeepMSA2

1. DeepMSA2 (no IMG)

核心逻辑位于DeepMSA2_noIMG.pl

dq_time = time.time()
subprocess.run([f"contact/DeepMSA2/scripts/DeepMSA2_noIMG.pl",
                self.seqname, self.datadir, self.pkgdir, self.libdir, python2])
print(f"[Info] dMSA和qMSA步运行时间: {time_elapsed(dq_time, time.time())}")
  • 1
  • 2
  • 3
  • 4

(1) 运行脚本qMSA2.py,即qMSA:

$python2 $HHLIB/scripts/qMSA2.py \\
    -hhblitsdb=$qhhblitsdb \\
    -jackhmmerdb=$qjackhmmerdb \\
    -hhblits3db=$qhhblits3db \\
    -hmmsearchdb=$qhmmsearchdb \\
    -tmpdir=/tmp/$ENV{USER}/$tag \\
    -ncpu=$cpu \\
    $datadir/MSA/qMSA.fasta
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

(2) 运行脚本build_MSA.py,即dMSA,DeepMSA:

$python2 $HHLIB/scripts/build_MSA.py \\
    -hhblitsdb=$dhhblitsdb \\
    -jackhmmerdb=$djackhmmerdb \\
    -hmmsearchdb=$dhmmsearchdb \\
    -tmpdir=/tmp/$ENV{USER}/$tag \\
    -ncpu=$cpu \\
    $datadir/MSA/DeepMSA.fasta
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

1.1 qMSA

qMSA

源码位于qMSA2.py,包括4个部分,即Uniclust30、UniRef90、BFD、MGnify。使用级联搜索,优先以Neff值高的MSA作为索引,最终,选择Neff值最高的作为qMSA。

第1部分

  • 搜索算法:hhblits
  • 搜库:Uniclust30 (uniclust30_2017_04),约64G,时间2017.04,最新是2022.02
  • 输出:qMSA.hhba3m.gz、qMSA.hhbaln.gz
  • 时间:19.1695s (117)

源码如下:

print("[Info] " + "+" * 100)
q1_time = time.time()
print("[Info] #### run hhblits ####")
hhblits_prefix=os.path.join(tmpdir,"hhblits")
if db_dict["hhblitsdb"]:
    c1 = overwrite_dict["hhblits"]
    c2 = not os.path.isfile(prefix+".hhbaln.gz")
    c3 = not os.path.isfile(prefix+".hhba3m.gz")
    print("[Info] 是否运行: {}, 具体条件: {} or {} or {}".format(c1 or c2 or c3, c1, c2, c3))
    if c1 or c2 or c3:
        # generates hhblits_prefix.a3m hhblits_prefix.aln
        # hhblits_prefix.60.a3m hhblits_prefix.60.aln
        sys.stdout = open(os.devnull, 'w')  # 关闭日志
        hhb_nf = run_hhblits(query_fasta, db_dict["hhblitsdb"], ncpu, hhblits_prefix)
        plain2gz(hhblits_prefix + ".aln", prefix + ".hhbaln.gz")
        plain2gz(hhblits_prefix + ".a3m", prefix + ".hhba3m.gz")
        sys.stdout = sys.__stdout__  # 打开日志
    else:
        gz2plain(prefix + ".hhbaln.gz", hhblits_prefix + ".aln")
        gz2plain(prefix + ".hhba3m.gz", hhblits_prefix + ".a3m")
        hhb_nf = getNf(hhblits_prefix)
        sys.stdout.write("%s and %s exists, skip hhblitsdb\n"%(prefix+".hhbaln",prefix+".hhba3m"))

    nf = hhb_nf
    print("[Info] hhb_nf = {}, target_nf = {}, passed: {}".format(nf, target_nf, hhb_nf >= target_nf))
    if hhb_nf >= target_nf:   # 满足条件直接停止
        shutil.copyfile(hhblits_prefix+".aln", prefix+".aln")
        shutil.copyfile(hhblits_prefix+".a3m", prefix+".a3m")
        sys.stdout.write("Final MSA by hhblits with Nf >=%.1f\n" % nf)
        if os.environ.get("DITADDUP", "N") == "N":
            return nf
print("[Info] #### FINISH run hhblits ####")
print("[Info] 数据库: {}".format(db_dict["hhblitsdb"]))
print("[Info] qMSA 第1步耗时: {} 输出: {}".format((time.time() - q1_time), prefix + ".hhbaln.gz(hhba3m.gz)"))
print("[Info] " + "+" * 100)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35

Uniclust30:https://wwwuser.gwdg.de/~compbiol/uniclust/2022_02/

  • 最新版本是2022.02

Uniclust30

第2部分

  • 搜索算法:jackblits
  • 搜库:UniRef90 (uniref90.fasta),约74G
  • 额外输入:hhblits.a3m,搜子库 jackblits-mydb (4.9M)
  • 输出:qMSA.jacaln.gz、qMSA.jaca3m.gz
  • 时间:230.8850s (117)

源码如下:

print("[Info] " + "+" * 100)
q2_time = time.time()
print("[Info] #### run jack_hhblits ####")
jackblits_prefix=os.path.join(tmpdir,"jackblits")
if db_dict["jackhmmerdb"]:
    c1 = overwrite_dict["jackhmmer"]
    c2 = not os.path.isfile(prefix+".jacaln.gz")
    c3 = not os.path.isfile(prefix+".jaca3m.gz")
    print("[Info] 是否运行: {}, 具体条件: {} or {} or {}".format(c1 or c2 or c3, c1, c2, c3))
    if c1 or c2 or c3:
        # generates jackblits_prefix.a3m jackblits_prefix.aln
        # jackblits_prefix.60.a3m jackblits_prefix.60.aln
        sys.stdout = open(os.devnull, 'w')  # 关闭日志
        jack_nf=run_jackblits(query_fasta, db_dict["jackhmmerdb"].split(':'), ncpu, hhblits_prefix, jackblits_prefix)
        plain2gz(jackblits_prefix+".aln",prefix+".jacaln.gz")
        plain2gz(jackblits_prefix+".a3m",prefix+".jaca3m.gz")
        sys.stdout = sys.__stdout__  # 打开日志
    else:
        gz2plain(prefix+".jacaln.gz",jackblits_prefix+".aln")
        gz2plain(prefix+".jaca3m.gz",jackblits_prefix+".a3m")
        jack_nf=getNf(jackblits_prefix)
        sys.stdout.write("%s and %s exists, skip jackhmmerdb\n"%(prefix+".jacaln.gz",prefix+".jaca3m.gz"))
    
    nf = max([jack_nf, hhb_nf])
    print("[Info] jack_nf = {}, max nf = {}, target_nf = {}, passed: {}"
          .format(jack_nf, nf, target_nf, nf >= target_nf))
    if jack_nf >= target_nf:
        shutil.copyfile(jackblits_prefix+".aln",prefix+".aln")
        shutil.copyfile(jackblits_prefix+".a3m",prefix+".a3m")
        sys.stdout.write("Final MSA by jackhmmer with Nf >=%.1f\n"%nf)
        if os.environ.get("DITADDUP", "N") == "N":
            return nf
print("[Info] #### FINISH run jack_hhblits ####")
print("[Info] 数据库: {}".format(db_dict["jackhmmerdb"]))
print("[Info] 是否使用 qMSA 第1步的输出: {}".format(os.path.isfile(hhblits_prefix + ".a3m")))
print("[Info] qMSA 第2步耗时: {} 输出: {}".format((time.time() - q2_time), prefix + ".jacaln.gz(jaca3m.gz)"))
print("[Info] " + "+" * 100)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37

第3部分

  • 搜索算法:hhblits3
  • 搜库:BFD(bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt),1.8T
  • 额外输入:优先使用第2步的输出 (jackblits.a3m),其次第1步 (hhblits.a3m),搜子库hhblits3-mydb (1.5M)。
  • 输出:qMSA.hh3aln.gz、qMSA.hh3a3m.gz
  • 时间:276.7781s (117)

源码如下:

print("[Info] " + "+" * 100)
q3_time = time.time()
print("[Info] #### run hhblits3 ####") # , flush=True)
hhblits3_prefix=os.path.join(tmpdir, "hhblits3")
if db_dict["hhblits3db"]:
    c1 = overwrite_dict["hhblits3"]
    c2 = not os.path.isfile(prefix+".hh3aln.gz")
    c3 = not os.path.isfile(prefix+".hh3a3m.gz")
    print("[Info] 是否运行: {}, 具体条件: {} or {} or {}".format(c1 or c2 or c3, c1, c2, c3))
    if c1 or c2 or c3:
        sys.stdout = open(os.devnull, 'w')  # 关闭日志
        hh3_nf=run_hhblits3(query_fasta, db_dict["hhblits3db"].split(':'),
                            ncpu, hhblits_prefix, jackblits_prefix, hhblits3_prefix)
        plain2gz(hhblits3_prefix+".aln", prefix+".hh3aln.gz")
        plain2gz(hhblits3_prefix+".a3m", prefix+".hh3a3m.gz")
        sys.stdout = sys.__stdout__  # 打开日志
    else:
        gz2plain(prefix+".hh3aln.gz",hhblits3_prefix+".aln")
        gz2plain(prefix+".hh3a3m.gz",hhblits3_prefix+".a3m")
        hh3_nf=getNf(hhblits3_prefix)
        sys.stdout.write("%s and %s exists, skip hhblits3db\n"%(prefix+".hh3aln.gz",prefix+".hh3a3m.gz"))
    nf = max([hh3_nf, jack_nf, hhb_nf])
    print("[Info] hh3_nf = {}, max nf = {}, target_nf = {}, passed: {}"
          .format(hh3_nf, nf, target_nf, hh3_nf >= target_nf))
    if hh3_nf >= target_nf:
        shutil.copyfile(hhblits3_prefix+".aln",prefix+".aln")
        shutil.copyfile(hhblits3_prefix+".a3m",prefix+".a3m")
        sys.stdout.write("Final MSA by hhblits3 with Nf >=%.1f\n"%nf)
        if os.environ.get("DITADDUP", "N") == "N":
            return nf
print("[Info] 数据库: {}".format(db_dict["hhblits3db"]))
print("[Info] 是否使用 qMSA 第1-2步的输出: {}, {}".format(
    os.path.isfile(jackblits_prefix+".a3m"), jackblits_prefix+".a3m"))
print("[Info] qMSA 第3步耗时: {} 输出: {}".format((time.time() - q3_time), prefix + ".hh3aln.gz(hh3a3m.gz)"))
print("[Info] " + "+" * 100)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35

第4部分

  • 搜索算法:hmsblits (hmmsearch)
  • 搜库:MGnify(mgy_clusters.clean.fasta),60G
  • 额外输入:使用最大NF值的a3m
  • 输出:qMSA.hmsaln.gz、qMSA.hmsa3m.gz
  • 时间:343.5184 (117)

源码如下:

q4_time = time.time()
print("[Info] " + "+" * 100)
print("[Info] #### run hmmsearch ####") # , flush=True)
hmmsearch_prefix = os.path.join(tmpdir,"hmmsearch")
if db_dict["hmmsearchdb"]:
    c1 = overwrite_dict["hmmsearch"]
    c2 = not os.path.isfile(prefix+".hmsaln.gz")
    c3 = not os.path.isfile(prefix+".hmsa3m.gz")
    print("[Info] 是否运行: {}, 具体条件: {} or {} or {}".format(c1 or c2 or c3, c1, c2, c3))
    if c1 or c2 or c3:
        # TODO: 建议使用字典
        # 优先处理 jumpstart_profile_prefix,其次处理 query_profile_prefix
        query_profile_prefix = jackblits_prefix  # 使用最大的NF值
        if jack_nf < hhb_nf:
            query_profile_prefix = hhblits_prefix
        jumpstart_profile_prefix = hhblits3_prefix  # 使用最大的NF值
        if hh3_nf < jack_nf:
            jumpstart_profile_prefix = jackblits_prefix
            if jack_nf < hhb_nf:
                jumpstart_profile_prefix = hhblits_prefix

        # generates hmmsearch_prefix.afq hmmsearch_prefix.hmm
        # hmmsearch_prefix.redundant hmmsearch_prefix.nonredundant
        # hmmsearch_prefix.aln
        sys.stdout = open(os.devnull, 'w')  # 关闭日志
        hms_nf=search_metaclust(query_fasta, sequence, query_profile_prefix, jumpstart_profile_prefix,
                                db_dict["hmmsearchdb"].split(':'),ncpu,hmmsearch_prefix)
        plain2gz(hmmsearch_prefix+".aln",prefix+".hmsaln.gz")
        plain2gz(hmmsearch_prefix+".a3m",prefix+".hmsa3m.gz")
        sys.stdout = sys.__stdout__  # 打开日志
        print("[Info] 是否使用 qMSA 第1-3步的输出: {}".format(jumpstart_profile_prefix + ".a3m"))
    else:
        gz2plain(prefix+".hmsaln.gz",hmmsearch_prefix+".aln")
        gz2plain(prefix+".hmsa3m.gz",hmmsearch_prefix+".a3m")
        sys.stdout.write("%s and %s exists, skip hmmsearchdb\n"%(prefix+".hmsaln.gz",prefix+".hmsa3m.gz"))
        hms_nf=getNf(hmmsearch_prefix)
    nf = max([hh3_nf, jack_nf, hhb_nf, hms_nf])
    print("[Info] NF - hhb_nf(s1): {}, jack_nf(s2): {}, hh3_nf(s3): {}, hms_nf(s4): {}"
          .format(hhb_nf, jack_nf, hh3_nf, hms_nf))
    print("[Info] hms_nf = {}, max nf = {} (target_nf), passed: {}".format(hms_nf, nf, hms_nf >= nf))
    if hms_nf > nf:  # hmmsearch replaces jackblits and hhblits result
        nf = hms_nf
        shutil.copyfile(hmmsearch_prefix+".aln", prefix+".aln")
        shutil.copyfile(hmmsearch_prefix+".a3m", prefix+".a3m")
        sys.stdout.write("Final MSA by hmmsearch with Nf >=%.1f\n"%nf)
        if os.environ.get("DITADDUP", "N") == "N":
            return nf
print("[Info] 数据库: {}".format(db_dict["hmmsearchdb"]))
print("[Info] qMSA 第4步耗时: {} 输出: {}".format((time.time() - q4_time), prefix + ".hmsaln.gz(hmsa3m.gz)"))
print("[Info] " + "+" * 100)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50

最终,根据Neff值,在4组MSA中,选出最终的MSA,作为qMSA.a3m,理论上Neff值越来越高。例如:

NF - hhb_nf(s1): 7.98306, jack_nf(s2): 20.9298, hh3_nf(s3): 22.5877, hms_nf(s4): 25.3418
MSA Len - hhb: 121, jac: 402, hh3: 492, hms: 568
  • 1
  • 2

源码如下:

# 选择最优的NF作为qMSA的最终输出
print("[Info] NF - hhb_nf(s1): {}, jack_nf(s2): {}, hh3_nf(s3): {}, hms_nf(s4): {}"
      .format(hhb_nf, jack_nf, hh3_nf, hms_nf))
nf = max([hh3_nf, jack_nf, hhb_nf, hms_nf])
print("[Info] hms_nf = {}, max nf = {} (target_nf), passed: {}".format(hms_nf, nf, hms_nf >= nf))
if hms_nf >= nf:  # hmmsearch replaces jackblits and hhblits result
    nf = hms_nf
    shutil.copyfile(hmmsearch_prefix + ".aln", prefix + ".aln")
    shutil.copyfile(hmmsearch_prefix + ".a3m", prefix + ".a3m")
    sys.stdout.write("[Info] step4 - hmmsearch MSA has %.1f Nf. Output anyway.\n" % hms_nf)
else:
    if hh3_nf >= max(jack_nf, hhb_nf):
        shutil.copyfile(hhblits3_prefix+".aln", prefix+".aln")
        shutil.copyfile(hhblits3_prefix+".a3m", prefix+".a3m")
        sys.stdout.write("[Info] step3 - hhblits3 MSA has %.1f Nf. Output anyway.\n"%hh3_nf)
    elif jack_nf >= max(hh3_nf, hhb_nf):
        shutil.copyfile(jackblits_prefix+".aln", prefix+".aln")
        shutil.copyfile(jackblits_prefix+".a3m", prefix+".a3m")
        sys.stdout.write("[Info] step2 - jackhmmer MSA has %.1f Nf. Output anyway.\n"%jack_nf)
    else:  # hhb_nf >= nf:
        shutil.copyfile(hhblits_prefix+".aln", prefix+".aln")
        shutil.copyfile(hhblits_prefix+".a3m", prefix+".a3m")
        sys.stdout.write("[Info] step1 - hhblits MSA has %.1f Nf. Output anyway.\n"%hhb_nf)
# prefix 即 output 的目录
print("[Info] qMSA 执行完成 {}, 输出 {}".format((time.time() - q1_time), prefix+".aln|a3m"))
print("[Info] " + "+" * 100)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

Neff,在MSA中,默认高质量阈值是128,公式如下:

Neff

1.2 dMSA (DeepMSA)

dMSA

源码位于build_MSA.py,包括4个部分,即Uniclust30、UniRef90、Metaclust&MGnify,最后一步,联合搜索Metaclust&MGnify。

第1部分:与qMSA完全相同,包括输出MSA的Neff值。

第2部分:与qMSA完全相同,包括输出MSA的Neff值。

第3部分,目标就是搜索,第3部分:

  • 搜索算法:hmsblits (hmmsearch)
  • 搜库:Metaclust(metaclust.fasta),153G;MGnify(mgy_clusters.clean.fasta),60G
  • 额外输入:使用最大NF值的a3m
  • 输出:DeepMSA.hmsaln、DeepMSA.hmsa3m
  • 时间:1086.5113 (117),1T大约84.9765

源码如下:

print("[Info] " + "+" * 100)
d3_time = time.time()
print("[Info] #### run hmmsearch ####") # , flush=True)
hmmsearch_prefix=os.path.join(tmpdir,"hmmsearch")
if db_dict["hmmsearchdb"]:
    c1 = overwrite_dict["hmmsearch"]
    c2 = not os.path.isfile(prefix+".hmsaln")
    c3 = not os.path.isfile(prefix+".hmsa3m")
    c4 = build_hmmsearch_db
    print("[Info] 是否运行: {}, 具体条件: {} or {} or {} or {}".format(c1 or c2 or c3 or c4, c1, c2, c3, c4))
    if c1 or c2 or c3 or c4:
        # generates hmmsearch_prefix.afq hmmsearch_prefix.hmm
        # hmmsearch_prefix.redundant hmmsearch_prefix.nonredundant
        # hmmsearch_prefix.aln
        sys.stdout = open(os.devnull, 'w')  # 关闭日志
        hms_nf=search_metaclust(query_fasta, sequence,
            hhblits_prefix if jack_nf < hhb_nf else jackblits_prefix,
            db_dict["hmmsearchdb"].split(':'), ncpu, hmmsearch_prefix)
        shutil.copyfile(hmmsearch_prefix + ".aln", prefix + ".hmsaln")
        if os.path.isfile(hmmsearch_prefix + ".a3m"):
            shutil.copyfile(hmmsearch_prefix + ".a3m", prefix + ".hmsa3m")
        sys.stdout = sys.__stdout__  # 打开日志
    else:
        shutil.copyfile(prefix+".hmsaln",hmmsearch_prefix+".aln")
        if os.path.isfile(prefix+".hmsa3m"):
            shutil.copyfile(prefix+".hmsa3m",hmmsearch_prefix+".a3m")
        sys.stdout.write("[Info] %s exists, skip hmmsearchdb\n"%(prefix+".hmsaln"))
        hms_nf = getNf(hmmsearch_prefix)

print("[Info] 数据库: {}".format(db_dict["hmmsearchdb"]))
print("[Info] dMSA 第3步耗时: {} 输出: {}".format((time.time() - d3_time), prefix + ".hmsaln(hmsa3m)"))
print("[Info] " + "+" * 100)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

其余,也与qMSA相同,也是根据Neff选择最优的MSA,例如:

NF - hhb_nf(s1): 7.98306, jack_nf(s2): 20.9298, hms_nf(s3): 23.6956
  • 1

2. DeepMSA2 (IMG)

检测是否需要运行IMG:

next if (-f "$datadir/JGI/$DBfasta.cdhit");  # 检查是否需要运行
  • 1

运行脚本:DeepMSA2_IMG.pl

第1部分

  • 使用qhmmsearch搜索qMSA.hh3aln.gz
  • 输出:DB.fasta.bq.cdhitDB.fasta.bq.cdhit.clstr
  • 耗时:24.9616s

源码如下:

cp $datadir/MSA/qMSA.hh3aln.gz seq.hh3aln.gz
gzip -df seq.hh3aln.gz
if [ ! -s "seq.hh3aln" ];then
    cp $datadir/MSA/DeepMSA.jacaln seq.hh3aln
fi
if [ ! -s "seq.hh3aln" ];then
    cp $datadir/MSA/DeepMSA.hhbaln seq.hh3aln
fi

sed = seq.hh3aln |sed 'N;s/\\n/\\t/'|sed 's/^/>/g'|sed 's/\\t/\\n/g'| $HHLIB/bin/qhmmbuild -n aln --amino -O seq.afq --informat afa seq.hmm -

$HHLIB/qhmmer_new/qhmmsearch --cpu 192 -E 10 --incE 1e-3 -A $DBfasta.match --tblout $DBfasta.tbl -o $DBfasta.out seq.hmm $JGI/$DBfasta
$HHLIB/bin/esl-sfetch -f $JGI/$DBfasta $DBfasta.tbl|sed 's/*//g' > $DBfasta.fseqs
$HHLIB/bin/cd-hit -i $DBfasta.fseqs -o $datadir/JGI/$DBfasta.cdhit -c 1 -M 3000
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

第2部分:常规搜索,类似qMSA和dMSA,输出DeepJGI.alnq3JGI.alnq4JGI.aln三个文件。

源码如下:

my $cmd="$HHLIB/scripts/qMSA2.py -hhblitsdb=$hhblitsdb -jackhmmerdb=$jackhmmerdb -hhblits3db=$hhblits3db -hmmsearchdb=$tmpdir/DB.fasta.fseqs -tmpdir=$tmpdir/MSA $datadir/JGI/q4JGI.fasta";
...
my $cmd="$HHLIB/scripts/qMSA2.py -hhblitsdb=$hhblitsdb -jackhmmerdb=$jackhmmerdb -hmmsearchdb=$tmpdir/DB.fasta.fseqs -tmpdir=$tmpdir/MSA $datadir/JGI/q3JGI.fasta";
...
my $cmd="$HHLIB/scripts/build_MSA.py -hhblitsdb=$hhblitsdb -jackhmmerdb=$jackhmmerdb -hmmsearchdb=$tmpdir/DB.fasta.fseqs -tmpdir=$tmpdir/MSA $datadir/JGI/DeepJGI.fasta";
  • 1
  • 2
  • 3
  • 4
  • 5

3. MSA 评估

使用官方的AF2 (AlphaFold2),评估MSA,区别是模型只使用model1_1、调整(relax)1次,因此,只输出1个模型,通过pLDDT进行排序。

具体步骤如下:

  1. 整理所有MSA,包括qMSA、dMSA和JGI-MSA,去重,保留7个MSA。
  2. 将不同的MSA建立不同的文件夹,文件夹中只有seq.aln、seq.a3m两种格式。
  3. 通过AF2预测结构,输入至MSA文件夹的seq目录。
  4. 整理评估结果至 AF2models 文件夹,MSA 信息位于 model.info 文件。
  5. 最优的MSA,位于MSA文件夹中,即 protein.a3m、protein.aln。

参考脚本:

  1. I-TASSERmod/run_AF2_multiMSA.py
  2. thirdparty/alphafold2/run_alphafold_msa_benchmark.sh
  3. thirdparty/alphafold2/run_alphafold_msa_benchmark.py

AF2模型位置位于:thirdparty/alphafold2/run_alphafold_msa_benchmark.py

model_params = data.get_model_haiku_params(
    model_name=model_name, data_dir=FLAGS.data_dir)
...

# thirdparty/alphafold2/alphafold/model/data.py
def get_model_haiku_params(model_name: str, data_dir: str) -> hk.Params:
  """Get the Haiku parameters from a model name."""

  path = os.path.join(data_dir, 'params', f'params_{model_name}.npz')

  with open(path, 'rb') as f:
    params = np.load(io.BytesIO(f.read()), allow_pickle=False)

  return utils.flat_params_to_haiku(params)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

官网:https://github.com/deepmind/alphafold

Model

参考

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小蓝xlanll/article/detail/273966
推荐阅读
相关标签
  

闽ICP备14008679号