赞
踩
本机显卡:NVIDIA GeForce GT 740M 算力:3.5 对应的pyTorch最高可用版本为1.2 spaCy
transformer模型的最低匹配pyTorch版本为1.5 故经过不断尝试,未能使用。
(对于算力5.2以上的GPU,如要使用,要下载CUDA 工具包,目前为“cuda_11.1.0_456.43_win10.exe”)
spacy [OPTIONS] COMMAND [ARGS]...
No | Commands | Description |
---|---|---|
1. | convert | Convert files into json or DocBin format for training. |
2. | debug | Suite of helpful commands for debugging and profiling. |
3. | download | Download compatible trained pipeline from the default download… |
4. | evaluate | Evaluate a trained pipeline. |
5. | info | Print info about spaCy installation. |
6. | init | Commands for initializing configs and pipeline packages. |
7. | package | Generate an installable Python package for a pipeline. |
8. | pretrain | Pre-train the ‘token-to-vector’ (tok2vec) layer of pipeline… |
9. | project | Command-line interface for spaCy projects and templates. |
10. | train | Train or update a spaCy pipeline. |
11. | validate | Validate the currently installed pipeline packages and spaCy… |
以文本分类为例(textcat)
建立目录,在该目录下以命令行方式运行以下命令:
python -m spacy project clone https://hub.fastgit.org/explosion/projects/tree/v3/tutorials/textcat_goemotions
1 原项目是以制表符分隔,本项目是逗号分隔
2 原项目数据是包含数据标注者,本项目没有
3 原项目一个样本可以包含多个标签,本项目只有一个
convert_corpus.py代码更改:
from pathlib import Path import typer from spacy.tokens import DocBin import spacy ASSETS_DIR = Path(__file__).parent / "assets" CORPUS_DIR = Path(__file__).parent / "corpus" def read_categories(path: Path): return path.open().read().strip().split("\n") def read_csv(file_): for line in file_: text, label = line.strip().split(",") yield { "text": text, "label": label } def convert_record(nlp, record, categories): """Convert a record from the csv into a spaCy Doc object.""" doc = nlp.make_doc(record["text"]) # All categories other than the true ones get value 0 doc.cats = {category: 0 for category in categories} # True labels get value 1 doc.cats[record["label"]] = 1 return doc def main(data_dir: Path=ASSETS_DIR, corpus_dir: Path=CORPUS_DIR, lang: str="zh"): """Convert the GoEmotion corpus's tsv files to spaCy's binary format.""" categories = read_categories(data_dir / "categories.txt") nlp = spacy.blank(lang) for csv_file in data_dir.iterdir(): if not csv_file.parts[-1].endswith(".csv"): continue records = read_csv(csv_file.open(encoding="utf8")) docs = [convert_record(nlp, record, categories) for record in records] out_file = corpus_dir / csv_file.with_suffix(".spacy").parts[-1] out_data = DocBin(docs=docs).to_bytes() with out_file.open("wb") as file_: file_.write(out_data) if __name__ == "__main__": typer.run(main)
spacy project run train
训练完成生成training目录,包含nodel-best和model_last目录
spacy project run evaluate
spacy project run package
生成packages目录,在\dist目录下会有训练好的模型包 XXXX.tar.gz文件,这个可以用pip install。用法同标准预训练模型。
下载下来的项目是基于英文的,对于中文最好使用中文分词器。
[nlp]
lang = “zh”
pipeline = [“tok2vec”,“textcat”]
[nlp.tokenizer]
@tokenizers = “spacy.zh.ChineseTokenizer”
segmenter = “pkuseg”
[initialize]
vectors = “zh_core_web_lg”
[initialize.tokenizer]
pkuseg_model = “mixed”
pkuseg_user_dict = “user_dict.txt”
用户自定义词典也可以放到数据目录,但要根据实际情况调整路径。
pkuseg的"mixed"模型,最好提前下载,解压到C:\Users\Administrator.pkuseg目录下。
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\spacy_pkuseg\__init__.py", line 58, in __init__
assert isinstance(w_t, tuple)
AssertionError
修改上述.py文件:
在52行下加入:
w_t = tuple(w_t)
w_t为List,值为[‘密封风’, ‘’]等,程序需要tuple,故进行转换。
具体什么原因导致类型不匹配,暂不清楚,先顺利通过以验证模型。以后再研究!!!
pip install zh_textcat_aux-0.0.1.tar.gz
import spacy
nlp = spacy.load('zh_textcat_aux')
texts = ['变频装置操作原则','变频装置送电启动前检查项目','凝泵变频器检修转热备用']
docs = nlp.pipe(texts)
for doc in docs:
print(doc.text)
print(doc.cats)
输出结果:
变频装置操作原则
{'A': 1.8430232273658476e-07, 'B': 2.4967513923002116e-07, 'C': 0.9736177921295166, 'E': 1.4260081115935463e-06, 'M': 0.0003109535900875926, 'O': 0.0260681863874197, 'R': 1.220762669618125e-06, 'S': 3.233772005728497e-08}
变频装置送电启动前检查项目
{'A': 2.9710254256798407e-09, 'B': 0.9999549388885498, 'C': 1.8397947769699385e-06, 'E': 1.1763377472107095e-07, 'M': 2.796637090796139e-07, 'O': 2.1979019493301166e-07, 'R': 4.259566048858687e-05, 'S': 1.0315843231717414e-12}
凝泵变频器检修转热备用
{'A': 0.000247561139985919, 'B': 0.00014120333071332425, 'C': 0.00045776416664011776, 'E': 0.0004549270961433649, 'M': 0.010306362062692642, 'O': 0.9880962371826172, 'R': 0.00028140778886154294, 'S': 1.4502625163004268e-05}
这三个句子对应的标记模型给出的结果:‘C’, ‘B’ , ‘O’
我们期望的也是:‘C’ , ‘B’ , ‘O’
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。