当前位置:   article > 正文

自然语言处理文本分析_通过自然语言处理释放文本分析的力量

自然语言处理技术文本分析

自然语言处理文本分析

深度学习自然语言处理 (Deep Learning, Natural Language Processing)

Natural language is a language that is used for everyday communication between humans. It is highly unstructured in nature for both text and speech, thus making it difficult to parse and comprehend by machines. Natural language processing (“NLP”) is concerned with the interaction between natural human language and computers. It is an intersection between the fields of linguistics, computer science and artificial intelligence.

ñatural语言是用于人类之间日常交流的语言。 从本质上来说,它对于文本和语音都是高度非结构化的,因此很难通过机器进行解析和理解。 自然语言处理(“ NLP”)与自然人类语言和计算机之间的交互有关。 它是语言学,计算机科学和人工智能领域之间的交叉点。

According to predictions by International Data Corporation in their report, The Digitization of the World: From Edge to Core, the total volume of data

根据国际数据公司在其报告《世界数字化:从边缘到核心》中的预测,

will grow from 33 Zettabytes in 2018 to 175 Zettabytes in 2025.

将从2018年的33 ZB增长到2025年的175 ZB。

As per Forbes in 2019, 90% of data generated daily are unstructured data and much of this will be text. This represents the largest data source produced and is, therefore, a rich source of data for analytics and deploying AI applications in the enterprise. However, companies are often only used to managing and analyzing ‘structured’ data that fits neatly within the rows and columns of a database.

根据《福布斯》(Forbes)2019年的数据,每天生成的数据有90%是非结构化数据 ,其中大部分将是文本数据。 这代表了产生的最大数据源,因此,它是企业中用于分析和部署AI应用程序的丰富数据源。 但是,公司通常仅用于管理和分析恰好适合数据库行和列的“结构化”数据。

In order to unlock insights from unstructured data, we can leverage on the power of text analytics, and use NLP to transform the unstructured text in documents and databases into normalized, structured data suitable for analysis.

为了从非结构化数据中获取见解,我们可以利用文本分析的功能, 并使用NLP将文档和数据库中的非结构化文本转换为适合分析的标准化结构化数据。

NLP的两个主要组成部分 (Two main components of NLP)

  • Natural Language Understanding helps computers understand and interpret human language. Rather than requiring users to interact with computers through programming codes, NLP allows users to interact with the computer using everyday language, to which the computer can respond appropriately.

    自然语言理解可以帮助计算机理解和解释人类语言。 NLP不需要用户通过编程代码与计算机进行交互,而是允许用户使用日常语言与计算机进行交互,计算机可以对其进行适当的响应。

  • Natural Language Generation is the process where computers translate data into readable human languages. The data being translated includes the bits and bytes that make up the photos and text that appear on your computer screen.

    自然语言生成是计算机将数据转换为可读的人类语言的过程。 转换的数据包括构成计算机屏幕上出现的照片和文本的位和字节。

自然语言理解的主要应用 (Main Applications of Natural Language Understanding)

  1. Topic modeling extracts meaning from texts by identifying recurrent patterns or topics and unlock semantic structures behind each of the individual texts.

    主题建模通过识别循环模式或主题从文本中提取含义,并解锁每个单独文本背后的语义结构。

  2. Document classification helps to classify discrete collections of text into categories. Examples include email spam filter, differentiating positive and negative product reviews and customer reviews.

    文档分类有助于将文本的离散集合分类。 例如,电子邮件垃圾邮件过滤器,区分正面和负面产品评论以及客户评论。

  3. Document recommendation selects the most relevant document based on the given information via a content-based recommender system. A good example would be the Google search engine, where the most relevant web pages are shown based on the user’s query.

    文件推荐选择 通过基于内容的推荐器系统基于给定信息提供最相关的文档。 一个很好的例子是Google搜索引擎,其中根据用户的查询显示最相关的网页。

自然语言生成的主要应用 (Main Applications of Natural Language Generation)

  1. Document summarization generates text summaries from the growing amount of text data available. With the growth of telecommuting, the ability to capture key ideas and content from conversations is gaining traction. A speech summarization system that could turn voice to text and generate summaries from team meetings would be interesting.

    文档摘要会根据越来越多的可用文本数据生成文本摘要。 随着远程办公的发展,从对话中捕获关键思想和内容的能力越来越受到关注。 可以将语音转换为文本并从团队会议中生成摘要的语音摘要系统将很有趣。

  2. Machine translation to translate text between languages. An example would be Google Translate, which is probably the most used and well-known machine translation engine to-date.

    机器翻译以在不同语言之间翻译文本。 例如Google Translate,它可能是迄今为止使用最广泛,最著名的机器翻译引擎。

  3. Question answering by systems which answer questions posed by humans in a natural language. Think of the voice assistants such as Apple’s Siri and Amazon’s Alexa.

    问题通过回答问题在自然语言提出的人类系统回答 。 想想语音助手,例如苹果的Siri和亚马逊的Alexa。

In this post, we shall work through an NLP workflow, explore the basics of using Latent Dirichlet Allocation for topic modeling and Naïve Bayes for text classification.

在这篇文章中,我们将通过NLP工作流程进行工作,探讨使用潜在Dirichlet分配进行主题建模和使用朴素贝叶斯进行文本分类的基础。

问题陈述:对一组新闻语料库执行主题建模,并开发文本分类模型。 (Problem statement: Perform topic modeling for a set of news corpus and develop a text classification model.)

让我们开始编码! (Let’s start coding!)

  1. import os
  2. import string
  3. import pickle
  4. import re, nltk
  5. from tqdm import *
  6. import numpy as np
  7. import pandas as pd
  8. from pandas import DataFrame
  9. from datetime import datetime
  10. from wordcloud import WordCloud
  11. from nltk.tag import pos_tag
  12. from nltk.collocations import *
  13. from nltk.corpus import wordnet
  14. from nltk.corpus import stopwords
  15. from nltk.tokenize import word_tokenize
  16. from nltk.stem.wordnet import WordNetLemmatizer
  17. import sklearn
  18. from sklearn import metrics
  19. from sklearn.model_selection import train_test_split
  20. from sklearn.feature_extraction.text import CountVectorizer
  21. from sklearn.decomposition import LatentDirichletAllocation
  22. from sklearn.naive_bayes import MultinomialNB
  23. from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
  24. import pyLDAvis
  25. import pyLDAvis.sklearn
  26. import scikitplot as skplt
  27. import matplotlib.pyplot as plt
  28. import seaborn as sns
  29. %matplotlib inline

资料准备 (Data Preparation)

Load the news corpus and preview a snippet of the dataset.

加载新闻语料库并预览数据集的摘要。

  1. text = ''
  2. with open('Data/News/News Set B.txt','r', encoding='utf-8') as f:
  3. text = " ".join([l.strip() for l in f.readlines()])
  4. text[0:500]
Image for post

URL links were present in the news corpus, let’s utilize regular expression to remove the URL links. The re.compile() method combines a regular expression pattern into pattern objects for pattern matching. This enables us to search a pattern again without rewriting it. We would then split the individual news article by line, remove leading and trailing empty spaces.

URL链接存在于新闻语料库中,让我们利用正则表达式删除URL链接。 re.compile()方法将正则表达式模式组合到模式对象中以进行模式匹配。 这使我们能够再次搜索模式而无需重写它。 然后,我们将按行拆分单个新闻文章,删除开头和结尾的空白处。

  1. # Remove URLs
  2. p = re.compile('URL:\shttp[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|\
  3. [!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
  4. articles = p.split(text)
  5. for i, line in enumerate(articles): articles[i] = line.strip()
  6. # Display a sample article after initial clean-up
  7. print(articles[1])
Image for post

The sample article displayed above looks good to go. Let’s load the processed news corpus into a dataframe and create a column named “content”.

上面显示的示例文章看起来不错。 让我们将已处理的新闻语料库加载到数据框中,并创建一个名为“内容”的列。

  1. # Load data into a dataframe for further analysis
  2. df = DataFrame(articles,columns=['content'])
  3. df.head()
Image for post

It appears that the first row under index 0 is empty. Let’s find out the total number of articles present in the dataframe before cleaning.

似乎索引0下的第一行为空。 让我们找出清洗前数据框中存在的文章总数。

  1. # Display total number of news articles before cleaning
  2. len(df)
Image for post

数据清理 (Data Cleaning)

  1. # Assign nan value to cells with empty fields
  2. nan_value = float("NaN")
  3. # Co
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/344457
推荐阅读
相关标签
  

闽ICP备14008679号