当前位置:   article > 正文

在spaCy V3.0中用自训练词向量来训练文本分类模型_spacy zh_core_web_lg

spacy zh_core_web_lg

spaCy V3.0中用自训练词向量来训练文本分类模型

前文《spaCy V3.0 文本分类模型训练、评估、打包及数据预处理》中采用的是spaCy提供的预训练词向量—“zh_core_web_lg”。《使用Gensim在专业领域、高相关性、小语料库上训练词向量》在自定义语料上训练出了自己的词向量。

如何使用自己训练的词向量来训练文本分类模型?

1 保存并转换词向量

model = FastText.load('fasttext.bin')
model.wv.save_word2vec_format('fasttext_100.txt')
  • 1
  • 2

2 将fasttext_100.txt压缩为vectors.zip文件,并拷贝到文本分类工程的assets目录下

3 修改文本分类工程的project.yml文件:

commands:

  • name: init-vectors
    help: Download vectors and convert to model
    script:
    • “python -m spacy init vectors zh assets/vectors.zip assets/zh_fasttext_vectors”
      deps:
    • “assets/vectors.zip”
      outputs_no_cache:
    • “assets/zh_fasttext_vectors”

4 在文本分类工程目录的命令行下执行:

spacy project run init-vectors
  • 1

结果:

================================ init-vectors ================================
Running command: 'c:\users\administrator\appdata\local\programs\python\python37\python.exe' -m spacy init vectors zh assets/vectors.zip assets/zh_fasttext_vectors
ℹ Creating blank nlp object for language 'zh'
[2021-03-31 15:57:12,988] [INFO] Reading vectors from assets\vectors.zip
2845it [00:00, 21086.74it/s]
[2021-03-31 15:57:13,129] [INFO] Loaded vectors from assets\vectors.zip
✔ Successfully converted 2845 vectors
✔ Saved nlp object with vectors to output directory. You can now use the path to
it in your config as the 'vectors' setting in [initialize].
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

这样会在assets目录下生成zh_fasttext_vectors目录,其结构为:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BDXUOlqP-1617180518427)(13.1.png)]

5 更新工程配置文件–config/config.cfg:

[pretraining]

[initialize]
vectors = "assets/zh_fasttext_vectors"

#原为:
#vectors = "zh_core_web_lg"
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

6 训练模型:

spacy project run train
  • 1

结果:

=================================== train ===================================
Running command: 'c:\users\administrator\appdata\local\programs\python\python37\python.exe' -c '"import os; os.makedirs(os.path.join('"'"'training'"'"', '"'"'config'"'"'))"'
Running command: 'c:\users\administrator\appdata\local\programs\python\python37\python.exe' -m spacy train ./configs/config.cfg -o training/config --gpu-id -1
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2021-03-31 15:59:14,289] [INFO] Set up nlp object from config
[2021-03-31 15:59:14,304] [INFO] Pipeline: ['tok2vec', 'textcat']
[2021-03-31 15:59:14,312] [INFO] Created vocabulary
[2021-03-31 15:59:14,647] [INFO] Added vectors: assets/zh_fasttext_vectors
[2021-03-31 15:59:14,648] [INFO] Finished initializing nlp object
WARNING: features.msgpack does not exist, try loading features.pkl
[2021-03-31 15:59:20,586] [INFO] Initialized pipeline components: ['tok2vec', 'textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTCAT  CATS_SCORE  SCORE
---  ------  ------------  ------------  ----------  ------
  0       0          0.00          0.88        0.00    0.00
  0     200          6.85        141.26       53.80    0.54
  1     400         16.89         77.29       73.09    0.73
  1     600         19.11         42.11       76.79    0.77
  2     800         34.89         34.45       79.56    0.80
  3    1000         26.01         27.75       81.92    0.82
  4    1200          6.29          8.05       80.83    0.81
  6    1400          7.62          8.14       75.74    0.76
  7    1600         15.94          3.76       82.12    0.82
 10    1800          1.71          1.31       82.54    0.83
 12    2000          0.68          0.35       81.38    0.81
 15    2200          0.54          0.22       80.78    0.81
 18    2400          2.44          0.31       82.26    0.82
 21    2600          0.39          0.16       81.98    0.82
 24    2800          2.63          0.30       81.06    0.81
 27    3000          0.17          0.07       81.37    0.81
 30    3200          0.48          0.39       79.65    0.80
 33    3400          0.51          0.40       77.96    0.78
✔ Saved pipeline to output directory
training\config\model-last
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

7 测试新模型:

import spacy
nlp = spacy.load('training/config/model-best')
texts = ['变频装置操作原则','变频装置送电启动前检查项目','凝泵变频器检修转热备用']
docs = nlp.pipe(texts)
for doc in docs:
	print(doc.text)
	print(doc.cats)
	
变频装置操作原则
{'A': 5.065050117991632e-06,
'B': 0.0012312011094763875,
'C': 2.4202020085795084e-06,
'E': 8.4429457274382e-06,
'M': 1.0350024695071625e-06,
'O': 0.9986792206764221,
'R': 7.234238728415221e-05,
'S': 2.103930398789089e-07}
变频装置送电启动前检查项目
{'A': 1.6673536720190896e-06,
'B': 0.9929683804512024,
'C': 4.094211362826172e-06,
'E': 1.6252337218247703e-06,
'M': 1.8851016648113728e-05,
'O': 0.000206330994842574,
'R': 0.006798859685659409,
'S': 2.6590535640025337e-07}
凝泵变频器检修转热备用
{'A': 0.0001738495338940993,
'B': 0.006545054726302624,
'C': 2.776497012746404e-06,
'E': 4.217909008730203e-05,
'M': 0.004038343206048012,
'O': 0.9861150979995728,
'R': 0.0030819326639175415,
'S': 7.60167665703193e-07}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35

4.217909008730203e-05,
‘M’: 0.004038343206048012,
‘O’: 0.9861150979995728,
‘R’: 0.0030819326639175415,
‘S’: 7.60167665703193e-07}

结果正确!!!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/355463
推荐阅读
相关标签
  

闽ICP备14008679号