赞
踩
根据github最新官方文档整理
依赖PyTorch、TensorFlow等深度学习技术,适合专业NLP工程师、研究者以及本地海量数据场景。要求Python 3.6至3.10,支持Windows,推荐*nix。可以在CPU上运行,推荐GPU/TPU。安装PyTorch版:
安装时请关闭节点代理
pip install hanlp
回显内容:
(MyTest) C:\Users\Lenovo\PycharmProjects\MyTest>pip install hanlp Collecting hanlp Downloading hanlp-2.1.0b52-py3-none-any.whl (651 kB) ---------------------------------------- 651.5/651.5 kB 1.2 MB/s eta 0:00:00 Collecting pynvml Downloading pynvml-11.5.0-py3-none-any.whl (53 kB) ---------------------------------------- 53.1/53.1 kB ? eta 0:00:00 Collecting transformers>=4.1.1 Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB) ---------------------------------------- 7.2/7.2 MB 5.5 MB/s eta 0:00:00 Collecting hanlp-trie>=0.0.4 Downloading hanlp_trie-0.0.5.tar.gz (6.7 kB) Preparing metadata (setup.py) ... done Collecting toposort==1.5 Downloading toposort-1.5-py2.py3-none-any.whl (7.6 kB) Collecting hanlp-common>=0.0.19 Downloading hanlp_common-0.0.19.tar.gz (28 kB) Preparing metadata (setup.py) ... done Collecting termcolor Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB) Requirement already satisfied: hanlp-downloader in c:\users\lenovo\anaconda3\envs\mytest\lib\site-packages (from hanlp) (0.0.25) Collecting torch>=1.6.0 Downloading torch-1.13.1-cp37-cp37m-win_amd64.whl (162.6 MB) ---------------------------------------- 162.6/162.6 MB 6.3 MB/s eta 0:00:00 Collecting sentencepiece>=0.1.91 Downloading sentencepiece-0.1.99-cp37-cp37m-win_amd64.whl (977 kB) ---------------------------------------- 977.7/977.7 kB 10.3 MB/s eta 0:00:00 Collecting phrasetree Downloading phrasetree-0.0.8.tar.gz (42 kB) ---------------------------------------- 42.2/42.2 kB 2.0 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Collecting typing-extensions Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB) Collecting regex!=2019.12.17 Downloading regex-2023.10.3-cp37-cp37m-win_amd64.whl (269 kB) ---------------------------------------- 269.9/269.9 kB 17.3 MB/s eta 0:00:00 Collecting filelock Downloading filelock-3.12.2-py3-none-any.whl (10 kB) Collecting importlib-metadata Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB) Collecting huggingface-hub<1.0,>=0.14.1 Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB) ---------------------------------------- 268.8/268.8 kB 8.3 MB/s eta 0:00:00 Requirement already satisfied: requests in c:\users\lenovo\anaconda3\envs\mytest\lib\site-packages (from transformers>=4.1.1->hanlp) (2.31.0) Collecting tqdm>=4.27 Downloading tqdm-4.66.1-py3-none-any.whl (78 kB) ---------------------------------------- 78.3/78.3 kB ? eta 0:00:00 Collecting numpy>=1.17 Downloading numpy-1.21.6-cp37-cp37m-win_amd64.whl (14.0 MB) ---------------------------------------- 14.0/14.0 MB 11.7 MB/s eta 0:00:00 Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 Downloading tokenizers-0.13.3-cp37-cp37m-win_amd64.whl (3.5 MB) ---------------------------------------- 3.5/3.5 MB 12.3 MB/s eta 0:00:00 Collecting safetensors>=0.3.1 Downloading safetensors-0.4.0-cp37-none-win_amd64.whl (277 kB) ---------------------------------------- 277.3/277.3 kB 17.8 MB/s eta 0:00:00 Collecting pyyaml>=5.1 Downloading PyYAML-6.0.1-cp37-cp37m-win_amd64.whl (153 kB) ---------------------------------------- 153.2/153.2 kB 9.5 MB/s eta 0:00:00 Collecting packaging>=20.0 Downloading packaging-23.2-py3-none-any.whl (53 kB) ---------------------------------------- 53.0/53.0 kB ? eta 0:00:00 Collecting fsspec Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB) ---------------------------------------- 143.0/143.0 kB ? eta 0:00:00 Collecting colorama Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB) Collecting zipp>=0.5 Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB) Requirement already satisfied: certifi>=2017.4.17 in c:\users\lenovo\anaconda3\envs\mytest\lib\site-packages (from requests->transformers>=4.1.1->hanlp) (2022.12.7) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\lenovo\anaconda3\envs\mytest\lib\site-packages (from requests->transformers>=4.1.1-> hanlp) (3.3.2) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\lenovo\anaconda3\envs\mytest\lib\site-packages (from requests->transformers>=4.1.1->hanlp) (2.0.7) Requirement already satisfied: idna<4,>=2.5 in c:\users\lenovo\anaconda3\envs\mytest\lib\site-packages (from requests->transformers>=4.1.1->hanlp) (3.4) Building wheels for collected packages: hanlp-common, hanlp-trie, phrasetree Building wheel for hanlp-common (setup.py) ... done Created wheel for hanlp-common: filename=hanlp_common-0.0.19-py3-none-any.whl size=30650 sha256=d3135f8a0e8bde4ff02320c6c84f1d809a9357f9ae2524a5bd99d4 a096d2db2e Stored in directory: c:\users\lenovo\appdata\local\pip\cache\wheels\f2\70\bf\57226335746d58210d202e3a64428b8e3b4d57ca373f26d77b Building wheel for hanlp-trie (setup.py) ... done Created wheel for hanlp-trie: filename=hanlp_trie-0.0.5-py3-none-any.whl size=6831 sha256=87b214b03fe0473f53b8b12a34ed2a2bee54a152b96c9531ad22d6044b6e b790 Stored in directory: c:\users\lenovo\appdata\local\pip\cache\wheels\69\ce\b1\c15e96cb4d3b170d002be4b8fd14c2185d32111080f352b3e6 Building wheel for phrasetree (setup.py) ... done Created wheel for phrasetree: filename=phrasetree-0.0.8-py3-none-any.whl size=44234 sha256=e86b74c1ad7ebacc6dceeace9ea3b9452a2d4367ebf2080b253bb4bfef8 fb53d Stored in directory: c:\users\lenovo\appdata\local\pip\cache\wheels\c2\81\3f\3ed1a1f06d94d021590de96e6953e44854599db1cd90d66846 Successfully built hanlp-common hanlp-trie phrasetree Installing collected packages: toposort, tokenizers, sentencepiece, phrasetree, zipp, typing-extensions, termcolor, safetensors, regex, pyyaml, pynvml, packaging, numpy, hanlp-common, fsspec, filelock, colorama, tqdm, torch, importlib-metadata, hanlp-trie, huggingface-hub, transformers, hanlp Successfully installed colorama-0.4.6 filelock-3.12.2 fsspec-2023.1.0 hanlp-2.1.0b52 hanlp-common-0.0.19 hanlp-trie-0.0.5 huggingface-hub-0.16.4 importl ib-metadata-6.7.0 numpy-1.21.6 packaging-23.2 phrasetree-0.0.8 pynvml-11.5.0 pyyaml-6.0.1 regex-2023.10.3 safetensors-0.4.0 sentencepiece-0.1.99 termcol or-2.3.0 tokenizers-0.13.3 toposort-1.5 torch-1.13.1 tqdm-4.66.1 transformers-4.30.2 typing-extensions-4.7.1 zipp-3.15.0
第一次使用要预下载大约600M压缩包。
预下载:
hanlp
回显内容
下载 http://download.hanlp.com/hanlp-1.8.4-release.zip 到 C:\Users\Lenovo\anaconda3\envs\MyTest\lib\site-packages\pyhanlp\static\hanlp-1.8.4-release.zip 100% 1.8 MiB 727.1 KiB/s ETA: 0 s [=============================================================] 下载 https://file.hankcs.com/hanlp/data-for-1.7.5.zip 到 C:\Users\Lenovo\anaconda3\envs\MyTest\lib\site-packages\pyhanlp\static\data-for-1.8.4.zip 100% 637.7 MiB 89.3 KiB/s ETA: 0 s [=============================================================] 解压 data.zip... usage: hanlp [-h] [-v] {segment,parse,serve,update} ... HanLP: Han Language Processing v1.8.4 positional arguments: {segment,parse,serve,update} which task to perform? segment word segmentation parse dependency parsing serve start http server update update jar and data of HanLP optional arguments: -h, --help show this help message and exit -v, --version show installed versions of HanLP
语法:
classhanlp_common.document.Document(*args, **kwargs)
保存已解析注释的字典结构。 文档是 dict 的子类,它支持 dict 的每个接口。 此外,它还支持处理各种语言结构的接口。 它的 str 和 dict 表示形式与 JSON 序列化兼容。
参数:
*args – An iterator of key-value pairs.
**kwargs – Arguments from ** operator.
# Create a document from hanlp_common.document import Document doc = Document( tok=[["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]], pos=[["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]], ner=[[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]], dep=[[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]] ) # print(doc) or str(doc) to get its JSON representation print(doc) print("----------annotation-----------") # Access an annotation by its task name print(doc['tok']) print("----------count_sentences-----------") # Get number of sentences print(f'It has {doc.count_sentences()} sentence(s)') print("----------n-th sentence-----------") # Access the n-th sentence print(doc.squeeze(0)['tok']) # Pretty print it right in your console or notebook print("----------pretty_print-----------") doc.pretty_print() # To save the pretty prints in a str pretty_text: str = '\n\n'.join(doc.to_pretty()) print("----------squeeze-----------") print(doc.squeeze(i=0)) print("----------to_conll()-----------") print(doc.to_conll()) print("----------to_dict()-----------") print(doc.to_dict()) print("----------to_json-----------") print(doc.to_json()) print("----------to_pretty-----------") print(doc.to_pretty) print("----------translate-----------") print(doc.translate('zh'))
控制台输出:
C:\Users\Lenovo\anaconda3\envs\MyTest\python.exe C:/Users/Lenovo/PycharmProjects/MyTest/1113/hanLP/HanLP.py { "tok": [ ["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"] ], "pos": [ ["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"] ], "ner": [ [["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]] ], "dep": [ [[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]] ] } ----------annotation----------- [['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']] ----------count_sentences----------- It has 1 sentence(s) ----------n-th sentence----------- ['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司'] ----------pretty_print----------- Dep Tree Tok Relation Po Tok NER Type ─────────── ─── ──────── ── ─── ──────────────── ┌─► 晓美焰 nsubj NR 晓美焰 ───►PERSON ┌────┬──┴── 来到 root VV 来到 │ │ ┌─► 北京 name NR 北京 ◄─┐ │ └─►└── 立方庭 dobj NR 立方庭 ◄─┴►LOCATION └─►┌─────── 参观 conj VV 参观 │ ┌───► 自然 compound NN 自然 ◄─┐ │ │┌──► 语义 compound NN 语义 │ │ ││┌─► 科技 compound NN 科技 ├►ORGANIZATION └─►└┴┴── 公司 dobj NN 公司 ◄─┘ ----------squeeze----------- { "tok": [ "晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司" ], "pos": [ "NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN" ], "ner": [ ["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9] ], "dep": [ [2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"] ] } ----------to_conll()----------- 1 晓美焰 _ NR _ _ 2 nsubj _ _ 2 来到 _ VV _ _ 0 root _ _ 3 北京 _ NR _ _ 4 name _ _ 4 立方庭 _ NR _ _ 2 dobj _ _ 5 参观 _ VV _ _ 2 conj _ _ 6 自然 _ NN _ _ 9 compound _ _ 7 语义 _ NN _ _ 9 compound _ _ 8 科技 _ NN _ _ 9 compound _ _ 9 公司 _ NN _ _ 5 dobj _ _ ----------to_dict()----------- {'tok': [['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']], 'pos': [['NR', 'VV', 'NR', 'NR', 'VV', 'NN', 'NN', 'NN', 'NN']], 'ner': [[['晓美焰', 'PERSON', 0, 1], ['北京立方庭', 'LOCATION', 2, 4], ['自然语义科技公司', 'ORGANIZATION', 5, 9]]], 'dep': [[[2, 'nsubj'], [0, 'root'], [4, 'name'], [2, 'dobj'], [2, 'conj'], [9, 'compound'], [9, 'compound'], [9, 'compound'], [5, 'dobj']]]} ----------to_json----------- { "tok": [ ["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"] ], "pos": [ ["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"] ], "ner": [ [["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]] ], "dep": [ [[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]] ] } ----------to_pretty----------- <bound method Document.to_pretty of {'tok': [['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']], 'pos': [['NR', 'VV', 'NR', 'NR', 'VV', 'NN', 'NN', 'NN', 'NN']], 'ner': [[['晓美焰', 'PERSON', 0, 1], ['北京立方庭', 'LOCATION', 2, 4], ['自然语义科技公司', 'ORGANIZATION', 5, 9]]], 'dep': [[[2, 'nsubj'], [0, 'root'], [4, 'name'], [2, 'dobj'], [2, 'conj'], [9, 'compound'], [9, 'compound'], [9, 'compound'], [5, 'dobj']]]}> ----------translate----------- { "tok": [ ["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"] ], "pos": [ ["专有名词", "其他动词", "专有名词", "专有名词", "其他动词", "其他名词", "其他名词", "其他名词", "其他名词"] ], "ner": [ [["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]] ], "dep": [ [[2, "名词性主语"], [0, "核心关系"], [4, "name"], [2, "直接宾语"], [2, "连接性状语"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "直接宾语"]] ] } Process finished with exit code 0
count_sentences()→ int[source]
Count number of sentences in this document.
Number of sentences.
get_by_prefix(prefix: str)[source]
Get value by the prefix of a key.
prefix – The prefix of a key. If multiple keys are matched, only the first one will be used.
The value assigned with the matched key.
pretty_print(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)[source]
Print a pretty text representation which visualizes linguistic structures.
tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header – True to print a header which indicates each field with its name.
html – True to output HTML format so that non-ASCII characters can align correctly.
tok:词的键。
lem:词的词形还原键。
pos:词性标记的键。
dep:依赖关系树的键。
sdp:语义依赖关系树/图的键。SDP 可视化尚未实现。
ner:命名实体识别标记的键。
srl:语义角色标注的键。
con:句法分析树的键。
show_header:是否打印标题,标题显示每个字段的名称。默认值为 True。
html:是否以 HTML 格式输出格式化文本。这确保了非 ASCII 字符可以正确对齐。默认值为 False。
squeeze(i=0)[source]
Squeeze the dimension of each field into one. It’s intended to convert a nested document like [[sent_i]] to [sent_i]. When there are multiple sentences, only the i-th one will be returned. Note this is not an inplace operation.
i – Keep the element at index for all lists.
Returns:
A squeezed document with only one sentence.
to_conll(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp')→ Union[hanlp_common.conll.CoNLLSentence, List[hanlp_common.conll.CoNLLSentence]][source]
Convert to CoNLLSentence.
tok (str) – Field name for tok.
lem (str) – Field name for lem.
pos (str) – Filed name for upos.
dep (str) – Field name for dependency parsing.
sdp (str) – Field name for semantic dependency parsing.
tok: 词的字符串表示。
lem: 词的词形还原表示。
pos: 词的词性标记。
dep: 词的依赖关系标记。
sdp: 词的语义依赖关系标记。
A CoNLLSentence representation.
to_dict()[source]
Convert to a json compatible dict.
A dict representation.
to_json(ensure_ascii=False, indent=2)→ str[source]
Convert to json string.
ensure_ascii – False to allow for non-ascii text.
indent – Indent per nested structure.
A text representation in str.
to_pretty(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)→ Union[str, List[str]][source]
Convert to a pretty text representation which can be printed to visualize linguistic structures.
tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header – True to include a header which indicates each field with its name.
html – True to output HTML format so that non-ASCII characters can align correctly.
tok: 词素键。
lem: 词形还原键。
pos: 词性键。
dep: 依赖关系解析树键。
sdp: 语义依赖关系树/图键。 SDP 可视化尚未实现。
ner: 命名实体键。
srl: 语义角色标注键。
con: 句法分析键。
show_header: 是否包含标题,标题显示每个字段的名称。
html: 是否输出 HTML 格式以便正确对齐非 ASCII 字符。
A pretty string.
translate(lang, tok='tok', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl')[source]
Translate tags for each annotation. This is an inplace operation.
lang – Target language to be translated to.
tok – Token key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
lang: 要翻译的目标语言。
tok: 词素键。
pos: 词性键。
dep: 依赖关系解析树键。
sdp: 语义依赖关系树/图键。 SDP 可视化尚未实现。
ner: 命名实体键。
srl: 语义角色标注键。
截至 2023 年 11 月 16 日,hanlp.utils.lang 中支持的语言包括: 简体中文 (zh) 繁体中文 (zh-tw) 英语 (en) 日语 (ja) 韩语 (ko) 法语 (fr) 德语 (de) 西班牙语 (es) 俄语 (ru) 这些语言都支持词性标注、命名实体识别、依赖关系分析和语义角色标注。 以下是每个语言的简要说明: 简体中文:hanlp 支持简体中文的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 繁体中文:hanlp 支持繁体中文的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 英语:hanlp 支持英语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 日语:hanlp 支持日语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 韩语:hanlp 支持韩语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 法语:hanlp 支持法语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 德语:hanlp 支持德语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 西班牙语:hanlp 支持西班牙语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。 俄语:hanlp 支持俄语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
The translated document.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。