赞
踩
源码请到:自然语言处理练习: 学习自然语言处理时候写的一些代码 (gitee.com)
4.1 spacy工具包安装
spacy工具包宣称可以做到nltk做到的所有事情,并且速度更快,还更好的适配深度学习,最关键的是提供了中文语言模型!!
由于某些不可说的原因,使用官网的安装方式很难成功推荐直接使用conda内部的整合包
运行
- conda install spacy
- conda install -c conda-forge spacy-model-en_core_web_sm
就可以安装成功了
如果不成功可以网上寻找spacy的离线安装包,可以参考这篇文章
安装spaCy(最简单的教程)_spacy安装_御用厨师的博客-CSDN博客
4.2 加载模型
可以自行选择安装需要的模型,然后使用命令加载,我这里使用英文模型做示范
示例:
- # 加载模型
- nlp = spacy.load("en_core_web_sm")
4.3 分词
spacy同样可以做到分词
示例:
- # 加载语料
- doc = nlp('Weather is good, very windy and sunny. We have no classes in the afternoon')
- # 分词
- for token in doc:
- print(token)
4.4 分句
spacy还提供了分句功能
示例:
- # 分句
- for sent in doc.sents:
- print(sent)
4.5 词性
spacy和nltk一样提供了分析词性的功能
示例:
- # 词性
- for token in doc:
- print('{}-{}'.format(token, token.pos_))
词性对照表可以参考
4.6 命名体识别
spacy也提供了命名体识别功能
示例:
- # 命名体识别
- doc_2 = nlp("I went to Paris where I met my old friend Jack from uni")
- for ent in doc_2.ents:
- print('{}-{}'.format(ent, ent.label_))
还可以将结果进行可视化展示
- # 展示
- doc = nlp("I went to Paris where I met my old friend Jack from uni")
- svg = displacy.render(doc, style='ent')
- output_path = Path(os.path.join("./", "sentence.html"))
- output_path.open('w', encoding="utf-8").write(svg)
4.7 找出书中所有人物的名字
以傲慢与偏见为语料,做一个找出所有人物名字的实战示例
示例:
- # 找到书中所有人物名字
- def read_file(file_name):
- with open(file_name, 'r') as f:
- return f.read()
-
-
- text = read_file(os.path.join('./', 'data/Pride and Prejudice.txt'))
- processed_text = nlp(text)
- sentences = [s for s in processed_text.sents]
- print(len(sentences))
- print(sentences[:5])
-
-
- def find_person(doc):
- c = Counter()
- for ent in doc.ents:
- if ent.label_ == 'PERSON':
- c[ent.lemma_] += 1
- return c.most_common(10)
-
-
- print(find_person(processed_text))
4.8 恐怖袭击分析
根据世界反恐怖组织官网上下载的恐怖袭击事件,来分析特定的组织在特定的地点作案的次数
示例:
- # 恐怖袭击分析
- def read_file_to_list(file_name):
- with open(file_name, 'r') as f:
- return f.readlines()
-
-
- terrorist_articles = read_file_to_list(os.path.join('./', 'data/rand-terrorism-dataset.txt'))
- print(terrorist_articles[:5])
- terrorist_articles_nlp = [nlp(art.lower()) for art in terrorist_articles]
- common_terrorist_groups = [
- 'taliban',
- 'al-qaeda',
- 'hamas',
- 'fatah',
- 'plo',
- 'bilad al-rafidayn'
- ]
-
- commmon_locations = [
- 'iraq',
- 'baghdad',
- 'kirkuk',
- 'mosul',
- 'afghanistan',
- 'kabul',
- 'basra',
- 'palestine',
- 'gaza',
- 'israel',
- 'istanbul',
- 'beirut',
- 'pakistan'
- ]
-
- location_entity_dict = defaultdict(Counter)
- for article in terrorist_articles_nlp:
- article_terrorist_groups = [ent.lemma_ for ent in article.ents if ent.label_ == 'PERSON' or ent.label_ == "ORG"]
- article_locations = [ent.lemma_ for ent in article.ents if ent.label_ == 'GPE']
- terrorist_common = [ent for ent in article_terrorist_groups if ent in common_terrorist_groups]
- location_common = [ent for ent in article_locations if ent in commmon_locations]
- for found_entity in terrorist_common:
- for found_location in location_common:
- location_entity_dict[found_entity][found_location] += 1
-
- print(location_entity_dict)
- location_entity_df = pd.DataFrame.from_dict(dict(location_entity_dict), dtype=int)
- location_entity_df = location_entity_df.fillna(value=0).astype(int)
- print(location_entity_df)
-
- plt.figure(figsize=(12, 10))
- hmap = sns.heatmap(location_entity_df, annot=True, fmt='d', cmap='YlGnBu', cbar=False)
-
- plt.title("Global Incidents by Terrorist group")
- plt.xticks(rotation=30)
- plt.show()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。