赞
踩
前文《spaCy V3.0 文本分类模型训练、评估、打包及数据预处理》中采用的是spaCy提供的预训练词向量—“zh_core_web_lg”。《使用Gensim在专业领域、高相关性、小语料库上训练词向量》在自定义语料上训练出了自己的词向量。
如何使用自己训练的词向量来训练文本分类模型?
model = FastText.load('fasttext.bin')
model.wv.save_word2vec_format('fasttext_100.txt')
commands:
spacy project run init-vectors
结果:
================================ init-vectors ================================
Running command: 'c:\users\administrator\appdata\local\programs\python\python37\python.exe' -m spacy init vectors zh assets/vectors.zip assets/zh_fasttext_vectors
ℹ Creating blank nlp object for language 'zh'
[2021-03-31 15:57:12,988] [INFO] Reading vectors from assets\vectors.zip
2845it [00:00, 21086.74it/s]
[2021-03-31 15:57:13,129] [INFO] Loaded vectors from assets\vectors.zip
✔ Successfully converted 2845 vectors
✔ Saved nlp object with vectors to output directory. You can now use the path to
it in your config as the 'vectors' setting in [initialize].
这样会在assets目录下生成zh_fasttext_vectors目录,其结构为:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BDXUOlqP-1617180518427)(13.1.png)]
[pretraining]
[initialize]
vectors = "assets/zh_fasttext_vectors"
#原为:
#vectors = "zh_core_web_lg"
spacy project run train
结果:
=================================== train =================================== Running command: 'c:\users\administrator\appdata\local\programs\python\python37\python.exe' -c '"import os; os.makedirs(os.path.join('"'"'training'"'"', '"'"'config'"'"'))"' Running command: 'c:\users\administrator\appdata\local\programs\python\python37\python.exe' -m spacy train ./configs/config.cfg -o training/config --gpu-id -1 ℹ Using CPU =========================== Initializing pipeline =========================== [2021-03-31 15:59:14,289] [INFO] Set up nlp object from config [2021-03-31 15:59:14,304] [INFO] Pipeline: ['tok2vec', 'textcat'] [2021-03-31 15:59:14,312] [INFO] Created vocabulary [2021-03-31 15:59:14,647] [INFO] Added vectors: assets/zh_fasttext_vectors [2021-03-31 15:59:14,648] [INFO] Finished initializing nlp object WARNING: features.msgpack does not exist, try loading features.pkl [2021-03-31 15:59:20,586] [INFO] Initialized pipeline components: ['tok2vec', 'textcat'] ✔ Initialized pipeline ============================= Training pipeline ============================= ℹ Pipeline: ['tok2vec', 'textcat'] ℹ Initial learn rate: 0.001 E # LOSS TOK2VEC LOSS TEXTCAT CATS_SCORE SCORE --- ------ ------------ ------------ ---------- ------ 0 0 0.00 0.88 0.00 0.00 0 200 6.85 141.26 53.80 0.54 1 400 16.89 77.29 73.09 0.73 1 600 19.11 42.11 76.79 0.77 2 800 34.89 34.45 79.56 0.80 3 1000 26.01 27.75 81.92 0.82 4 1200 6.29 8.05 80.83 0.81 6 1400 7.62 8.14 75.74 0.76 7 1600 15.94 3.76 82.12 0.82 10 1800 1.71 1.31 82.54 0.83 12 2000 0.68 0.35 81.38 0.81 15 2200 0.54 0.22 80.78 0.81 18 2400 2.44 0.31 82.26 0.82 21 2600 0.39 0.16 81.98 0.82 24 2800 2.63 0.30 81.06 0.81 27 3000 0.17 0.07 81.37 0.81 30 3200 0.48 0.39 79.65 0.80 33 3400 0.51 0.40 77.96 0.78 ✔ Saved pipeline to output directory training\config\model-last
import spacy nlp = spacy.load('training/config/model-best') texts = ['变频装置操作原则','变频装置送电启动前检查项目','凝泵变频器检修转热备用'] docs = nlp.pipe(texts) for doc in docs: print(doc.text) print(doc.cats) 变频装置操作原则 {'A': 5.065050117991632e-06, 'B': 0.0012312011094763875, 'C': 2.4202020085795084e-06, 'E': 8.4429457274382e-06, 'M': 1.0350024695071625e-06, 'O': 0.9986792206764221, 'R': 7.234238728415221e-05, 'S': 2.103930398789089e-07} 变频装置送电启动前检查项目 {'A': 1.6673536720190896e-06, 'B': 0.9929683804512024, 'C': 4.094211362826172e-06, 'E': 1.6252337218247703e-06, 'M': 1.8851016648113728e-05, 'O': 0.000206330994842574, 'R': 0.006798859685659409, 'S': 2.6590535640025337e-07} 凝泵变频器检修转热备用 {'A': 0.0001738495338940993, 'B': 0.006545054726302624, 'C': 2.776497012746404e-06, 'E': 4.217909008730203e-05, 'M': 0.004038343206048012, 'O': 0.9861150979995728, 'R': 0.0030819326639175415, 'S': 7.60167665703193e-07}
4.217909008730203e-05,
‘M’: 0.004038343206048012,
‘O’: 0.9861150979995728,
‘R’: 0.0030819326639175415,
‘S’: 7.60167665703193e-07}
结果正确!!!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。