python 机器人聊天
Did you know?
你知道吗?
“Chatbots can cut operational costs by up to 30%. eCommerce chatbot statistics show that businesses spend around $1.3 trillion on customer requests per year. With the assistance of chatbots, this expense could be reduced by 30%.” — The Future is Now — 37 Fascinating Chatbot Statistics by Danica Jovic
聊天机器人可以将运营成本削减多达30%。 电子商务聊天机器人的统计数据显示,企业每年在客户请求方面的支出约为1.3万亿美元。 在聊天机器人的帮助下,这笔费用可以减少30%。” — 未来就在眼前— Danica Jovic撰写的37个令人兴奋的聊天机器人统计数据
A few days back, I attended an online workshop from The MAD — Alpha where I got to learn about the development and deployment of various types of chatbots, like data-driven chatbots, machine learning chatbots, and a very robust chatbot through Rasa Framework. In this article, I am going to share the terminology involved while developing a chatbot and show you how to develop a data-driven chatbot and deploy it on Telegram.
几天前,我参加了MAD-Alpha的在线研讨会, 在那里我了解了各种类型的聊天机器人的开发和部署,例如数据驱动的聊天机器人,机器学习聊天机器人以及通过Rasa Framework实现的非常强大的聊天机器人。 在本文中,我将分享开发聊天机器人时涉及的术语,并向您展示如何开发数据驱动的聊天机器人并将其部署在Telegram上 。
什么是聊天机器人? (What Is a Chatbot?)
At an essential level, a chatbot is a PC program that recreates and forms human conversation, permitting people to collaborate with computerized gadgets as though they were speaking with a genuine person. Chatbots can be as simple as that answer to a basic query with a single line reaction, or as advanced as a computerized assistant that learns and develops to respond to expanding levels of personalization as they assemble and process data (or, we can say, a machine learning bot).
从本质上讲,聊天机器人是一种PC程序,可以重新创建并形成人类对话,使人们可以与计算机化的小工具进行协作,就像在与真人说话一样。 聊天机器人可以像对单行React的基本查询的回答一样简单,也可以像计算机助手一样先进,它可以学习和发展以在组装和处理数据时对不断扩展的个性化水平做出响应(或者,我们可以说,机器学习机器人)。
Some of the ways companies are using chatbots are:
公司使用聊天机器人的一些方式是:
- As customer support 作为客户支持
- Booking the items from a business 预定商家物品
聊天机器人的类型 (Types of chatbots)
There are mainly two types of chatbots: data-driven chatbots and contextual chatbots.
聊天机器人主要有两种: 数据驱动的聊天机器人和上下文聊天机器人。
In a data-driven chatbot, the bot analyzes the keywords from the user question and matches the keywords with the predetermined options to deliver the correct response. These are implemented with NLTK (Natural Language Toolkit) library in Python.
在数据驱动的聊天机器人中,该机器人分析用户问题中的关键字,并将关键字与预定选项进行匹配,以传递正确的响应。 这些是通过NLTK (自然语言工具包)实现的 Python中的库。
- A contextual chatbot is much more advanced. It simulates near-human interactions better than a data-driven chatbot because when a user types a question, the bot tries to learn the intent and sentiment behind the user’s query. These chatbots utilize machine learning to learn and advance over time. 上下文聊天机器人要先进得多。 它比数据驱动的聊天机器人更好地模拟了近乎人类的交互,因为当用户键入问题时,该机器人会尝试学习用户查询背后的意图和情感。 这些聊天机器人利用机器学习来学习和逐步发展。
In this article, we will build a data-driven chatbot based on the NLTK library in Python and deploy it on Telegram so as to converse with our bot.
在本文中,我们将基于Python的NLTK库构建一个数据驱动的聊天机器人,并将其部署在Telegram上,以便与我们的机器人对话。
聊天机器人的先决条件和术语 (Prerequisites and Terminologies for Chatbots)
先决条件 (Prerequisites)
Basic knowledge of Python is sufficient for anyone who wants to make this chatbot.
Python的基础知识对于想要创建此聊天机器人的任何人都足够。
自然语言处理 (NLP)
Natural language processing assists computers to communicate with people in their language and scales other language-related assignments. For instance, NLP makes it feasible for computers to understand text, to hear speech and interpret it, to measure conclusions, and to figure out which parts are significant. By using NLP, developers can perform tasks such as text summarization, named-entity relationship, sentiment analysis, and speech recognition.
自然语言处理可帮助计算机以他们的语言与人交流并扩展其他与语言相关的任务。 例如,NLP使计算机能够理解文本,听取语音并进行解释,测量结论并找出重要的部分。 通过使用NLP,开发人员可以执行诸如文本摘要,命名实体关系,情感分析和语音识别之类的任务。
NLTK (NLTK)
NLTK (Natural Language Toolkit) is the primary platform for building Python projects to work with human language information. It gives simple-to-utilize interfaces to more than 50 corpora and lexical resources, for example, WordNet, alongside the setup of text-handling libraries for classification, tokenization, stemming, tagging, parsing, and semantic thinking, and wrappers for industrial-strength NLP libraries.
NLTK(自然语言工具包)是构建Python项目以使用人类语言信息的主要平台。 它为50多种语料库和词汇资源 (例如WordNet)提供了易于使用的界面, 还提供了用于分类,标记化,词干,标记,解析和语义思维的文本处理库以及用于工业应用程序的包装器强度NLP库。
下载并安装NLTK (Downloading and installing NLTK)
Install NLTK: run
pip install nltk
安装NLTK:运行
pip install nltk
Test installation: run
python
then typeimport nltk
测试安装:运行
python
然后输入import nltk
For platform-specific instructions, read
有关平台的说明,请阅读
NLTK文档 。
安装NLTK软件包 (Installing NLTK packages)
In your Python interpreter, run the following commands:import nltk
and run nltk.download()
.This will open the NLTK downloader from where you can choose the corpora, models, and other data packages to download. You can also download all packages at once, or you specify the package you need as an argument in nltk.download()
.
在您的Python解释器中,运行以下命令: import nltk
并运行nltk.download()
。这将打开NLTK下载程序,您可以从其中选择要下载的语料库,模型和其他数据包。 您也可以一次下载所有软件包,或者在nltk.download()
中将所需的软件包指定为参数。
使用NLTK进行文本预处理 (Text pre-processing with NLTK)
The principle issue with text information is that it is all in strings (a group of text). However, machine learning calculations need a type of numerical element vector to perform the task. So before we start with any NLP project, we have to pre-process it to make it perfect for work. Fundamental text pre-handling incorporates:
文本信息的主要问题是所有信息都是字符串(一组文本)。 但是,机器学习计算需要一种数值元素向量来执行任务。 因此,在开始任何NLP项目之前,我们必须对其进行预处理以使其适合工作。 基本文本预处理包括:
Changing over the whole content into uppercase or lowercase, with the goal that the calculation doesn’t treat the same words in different contexts.
将整个内容转换为大写或小写 ,目标是在不同的上下文中计算不处理相同的单词。
Tokenization is a process in which the text of strings is converted into a list of tokens. There is a sentence tokenizer that can be used to find the list of sentences and word tokenizer that can be used to find the list of words in strings.
令牌化是将字符串文本转换为令牌列表的过程。 有一个句子标记器可用于查找句子列表,而词标记器可用于查找字符串中的单词列表。
Removing noise will remove everything that isn’t a standard letter or number, like a punctuation mark, extra spaces, etc.
删除噪音会删除所有不是标准字母或数字的内容,例如标点符号,多余的空格等。
Removing stop words: Stop words are words that are commonly used (such as the, a, an, in, etc.) that have little value in selecting the matching phrase according to a user query.
移除停止词:停止词是通常使用的词语( 如 , 一 , 一个 在等)在根据用户的查询选择匹配短语小值。
Stemming: It is a process of reducing a derived form of a word to its stem, base, or root form. For example, if we stem the following words: walks, walking, walked, then the stem word would be a single word, walk.
词干:这是将单词的派生形式简化为其词干,基数或词根形式的过程。 例如,如果我们词干以下单词: walks , walking ,walked,则词干单词将是单个单词walk 。
Lemmatization: A transformed version of stemming is lemmatization. The significant difference between these is that stemming operates on a single word without the knowledge of the context and can often create a non-existing word. In contrast, after lemmatization, we will get a valid word that has meaning in the dictionary. Lemmatization is based on the part of speech of a word that should be determined to get the correct lemma of the word. An example of lemmatization is that is, am, and are are forms of the verb to be; therefore, their lemma is be.
合法化 :词干的转换版本是词母化 。 两者之间的显着区别是,词干对单个单词进行操作而没有上下文的知识,并且通常可以创建一个不存在的单词。 相比之下,经过定词后,我们将在字典中得到一个具有含义的有效单词。 词法化基于单词的词性,应该确定该词以获得单词的正确词缀。 词形还原的一个例子是, 是 ,AM,并且是被动词为形式; 因此,他们的引理是 。
词袋 (Bag of words)
After the initial cleaning and processing phase, we need to transform the text into a meaningful vector (or cluster) of numbers. According to A Gentle Introduction to the Bag-of-Words Model by Jason Brownlee,
在初始清理和处理阶段之后,我们需要将文本转换为有意义的数字矢量(或簇)。 根据杰森·布朗利(Jason Brownlee) 的“词袋模型的温柔介绍” ,
“The bag of words is a representation of text that describes the occurrence of words inside a document. It includes two things:
“单词袋是文本的表示形式,描述了文档中单词的出现。 它包括两件事:
1. A vocabulary of known words
1.已知单词词汇
2. A measure of the presence of all the known words”
2.衡量所有已知单词的存在”
In the bag-of-words model, a text (such as a sentence or a document) is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity. The main reason behind using it is to check whether the sentence is similar in content or not with the document.
在词袋模型中,文本(例如句子或文档)表示为其词袋,而忽略了语法甚至词序,但保持了多重性。 使用它的主要原因是检查句子的内容与文档是否相似。
Suppose the vocabulary contains the words: {knowing, must, do, enough}. And we have the following sentence: “Knowing is not enough; we must apply.” Then bag-of-words representation for this would have the resulting vector: [1, 1, 0, 1].
假设词汇表中包含以下单词:{ 知道 , 必须 , 要做 , 足够 }。 我们的句子如下: 我们必须申请。” 然后,为此的词袋表示将具有结果向量:[1、1、0、1]。
TF-IDF方法 (TF-IDF approach)
The main disadvantage of using the bag-of-words model is that it leads to a high dimensional feature vector due to the large size of vocabulary, which leads to a highly sparse vector as there is non-zero value in dimensions corresponding to words that occur in the sentence.
使用词袋模型的主要缺点是由于词汇量大而导致维特征向量高,由于与词对应的词的维数中存在非零值,因此导致向量稀疏出现在句子中。
To overcome the disadvantage of bag of words, we perform normalization on the frequency value. A new approach, TF-IDF (term frequency-inverse document frequency), was invented for document search and similarity, and information retrieval. So, words that are common in every document (such as this, what, and if) rank low even though they may appear many times since they don’t mean much to that document in particular.
为了克服单词袋的缺点,我们对频率值进行归一化。 发明了一种新的方法TF-IDF(术语频率与文档频率成反比),用于文档搜索和相似性以及信息检索。 因此,即使它们可能出现很多次,但每个文档(例如this , what和if )中常用的单词排名仍然很低,因为它们对该文档的意义不大。
TF (Term Frequency): is scoring of how frequently a word occurs in the current document.
TF(术语频率) :是对单词在当前文档中出现的频率进行评分。
TF = (Number of times term t appears in a document)/(Number of terms in the document)
Inverse Document Frequency: is a scoring of how frequently the word occurs across the documents.
反向文档频率 :是单词在文档中出现频率的评分。
IDF = log(N/n), where, N is the total number of documents and n is the number of documents a term t has appeared in.
Finally, by taking a multiplicative value of TF and IDF, we get the TF-IDF score.
最后,通过将TF和IDF相乘得到TF-IDF分数。
TF-IDF = TF * IDF
Let’s take an example where we have a document of four words, and best occurs one time in it. Therefore, the TF for the word best is then(1/4) = 0.25. Now assume, we have five documents, and the word best occurs in two of them. Then the IDF is calculated as log(5/2) = 0.39. Thus the TF-IDF is the product of these quantities: 0.25 * 0.39 = 0.0975.
让我们举一个例子,其中有一个包含四个单词的文档,并且最好在其中出现一次。 因此, 最佳单词的TF为(1/4)= 0.25。 现在假设,我们有五个文档,其中单词best出现在其中两个文档中。 然后将IDF计算为log(5/2)= 0.39。 因此,TF-IDF是以下数量的乘积:0.25 * 0.39 = 0.0975。
Cosine similarity
余弦相似度
We applied TF-IDF to transform the text of the document into a real-valued vector in vector space. We can obtain a cosine similarity to determine how similar the vectors are irrespective of their size. We can then get the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors. Using the following formula, we can find out the similarity between any two document d1 and d2, where d1, d2 are two non-zero vectors.
我们使用TF-IDF将文档的文本转换为向量空间中的实值向量。 我们可以得到一个余弦相似度来确定矢量的相似程度,而与它们的大小无关。 然后,通过取向量的点积并将其除以其范数的乘积,可以得出任何一对向量的余弦相似度。 得出向量之间角度的余弦值。 使用以下公式,我们可以找出任意两个文档d1和d2之间的相似性,其中d1,d2是两个非零向量。
Cosine Similarity (d1, d2) = d1 • d2 / ||d1|| * ||d2||
聊天机器人创建 (Chatbot Creation)
Now we know some terminologies used in chatbot creation and have a fair idea of the NLP process. We are going to create and deploy a chatbot in Telegram that will give you information about the symptoms and prevention of the coronavirus, and it also gives updates about the COVID-19 cases in India.
现在我们知道了聊天机器人创建中使用的一些术语,并对NLP流程有了一个清晰的了解。 我们将在Telegram中创建并部署一个聊天机器人,该机器人将为您提供有关冠状病毒的症状和预防的信息,并且还提供有关印度COVID-19病例的更新。
You can find the entire code in my Github repository.
您可以在我的Github存储库中找到整个代码。
In the following process of chatbot creation, I used the Jupyter Notebook to write the Python code. I will recommend you use Jupyter Notebook. If you don’t want to install Jupyter on your machine, you can use Google Colab .
在创建聊天机器人的以下过程中,我使用了Jupyter Notebook编写Python代码。 我建议您使用Jupyter Notebook。 如果您不想在计算机上安装Jupyter,则可以使用Google Colab 。
导入必要的库 (Importing the necessary libraries)
语料库 (Corpus)
In NLP, a corpus refers to a collection of texts. Corpus contains the text from which user queries will be matched, and the chatbot tries to fetch the most reliable sentence to give it as a response to the user. In our case, the corpus will be the information about Covid-19 from the Wikipedia page. We will store the information in the content.txt
file.
在NLP中, 语料库是指文本的集合。 语料库包含将与用户查询匹配的文本,并且聊天机器人尝试获取最可靠的句子以将其作为对用户的响应。 在我们的案例中,语料库将是Wikipedia页面上有关Covid-19的信息。 我们将信息存储在content.txt
文件中。
读取数据 (Reading the data)
We will read the entire text from the content.txt
file and apply the sentence:
我们将从content.txt
文件中读取全文,然后加上以下句子:
关键字匹配 (Keyword matching)
We will define the greetings and its response so that our chatbot can match it from the pre-defined lists and greet the user appropriately while starting and ending the conversation.
我们将定义问候语及其响应,以便我们的聊天机器人可以从预定义列表中进行匹配,并在开始和结束对话时适当地向用户打招呼。
Let’s have an example of sent_tokens
and the word_tokens
:
让我们看一个sent_tokens
和word_tokens
:
sent_tokens[:1]['COVID-19 affects different people in different ways.']word_tokens[:7]['COVID-19', 'affects', 'different', 'people', 'in', 'different', 'ways']
预处理原始文本 (Pre-processing the raw text)
In the following code snippet, function Normalize
will tokenize the text and then lower its case and remove punctuation marks, and after that function, LemTokens
will lemmatize each token.
在下面的代码片段中,功能Normalize
将标记文本,然后LemTokens
其大小写并删除标点符号,在该功能之后, LemTokens
将对每个令牌进行定形。
产生回应 (Generating responses)
To generate a response from the chatbot for the user query, we will use document similarity: TF-IDF vectorizer and cosine similarity.
为了从聊天机器人生成用于用户查询的响应,我们将使用文档相似度:TF-IDF矢量化器和余弦相似度。
In the following code snippet, the response
function will get the user_response
passed from the bot_initialize
function and append it in the corpus to vectorizer the text and to find the similarity between the user_response
and the words in the corpus. If there no similarity in user_response
and the corpus, then the chatbot will return I am sorry! I don’t understand. Please rephrase your query.
Otherwise, it will return the sentence from the corpus with second-highest cosine_similarity
matching with the user_response
.
在下面的代码片段中, response
函数将从bot_initialize
函数获取user_response
并将其附加到语料库中,以对文本进行矢量化处理,并找到user_response
与语料库中单词之间的相似性。 如果user_response
和语料库没有相似之处,那么聊天机器人将返回I am sorry! I don't understand. Please rephrase your query.
I am sorry! I don't understand. Please rephrase your query.
否则,它将从语料库返回具有第二高的cosine_similarity
与user_response
匹配。
The following bot_initialize
function will get the input of the user query from Telegram and will start processing it to generate a credible response. We should also make our chatbot sophisticated so that it can greet the user at the commencement and termination of the conversation.
以下bot_initialize
函数将从Telegram获取用户查询的输入,并将开始对其进行处理以生成可靠的响应。 我们还应该使我们的聊天机器人更加复杂,以便它可以在对话开始和结束时与用户打招呼。
Till now, we make our chatbot respond to the user query, but where will be the bot get the query to make a response? We are now going to integrate our chatbot with Telegram so that the user can converse with our bot through Telegram.
到现在为止,我们让聊天机器人对用户查询做出响应,但是机器人将在哪里获取查询以做出响应? 现在,我们将聊天机器人与Telegram集成在一起,以便用户可以通过Telegram与我们的机器人对话。
激活数据驱动的Telegram机器人 (Activating the data-driven Telegram bot)
For activating and deploying your chatbot on Telegram, you have to meet some initial prerequisites.
为了在Telegram上激活和部署聊天机器人,您必须满足一些初始条件。
First of all, you should have an account on Telegram.
首先,您应该在Telegram上拥有一个帐户。
After creating your account on Telegram, search for BotFather, and create your chatbot as instructed by BotFather. After successfully creating your chatbot, BotFather will give you a token to authorize the bot and send requests to the Bot API. You should get a message like this:
Use this token to access the HTTP API:
在Telegram上创建帐户后,搜索BotFather ,然后按照BotFather的指示创建聊天机器人。 成功创建聊天机器人后,BotFather将为您提供令牌以授权该机器人并将请求发送到Bot API。 您应该收到以下消息:
Use this token to access the HTTP API:
After creating your account on Telegram, search for BotFather, and create your chatbot as instructed by BotFather. After successfully creating your chatbot, BotFather will give you a token to authorize the bot and send requests to the Bot API. You should get a message like this:
Use this token to access the HTTP API:
1245993642:AAGc_EZbIoHag4SXXXXXXXXXXXXXX
.在Telegram上创建帐户后,搜索BotFather ,然后按照BotFather的指示创建聊天机器人。 成功创建聊天机器人后,BotFather将为您提供令牌以授权该机器人并将请求发送到Bot API。 您应该收到以下消息:
Use this token to access the HTTP API:
1245993642:AAGc_EZbIoHag4SXXXXXXXXXXXXXX
。After successfully getting your Telegram Bot API, write the following snippet succeeding the
bot_initialize
function.成功获取Telegram Bot API之后,在
bot_initialize
函数之后编写以下代码段。
These are the lines of code that you can copy to make your chatbot work through Telegram. You can read the Telegram Bot API documentation to get an in-depth knowledge of how to add more functionality to your Telegram chatbot.
这些是您可以复制的代码行,以通过Telegram使您的聊天机器人正常工作。 您可以阅读Telegram Bot API文档 获得有关如何为Telegram聊天机器人添加更多功能的深入知识。
Finally, we made a chatbot through NLTK that can able to converse with us on Telegram when the Jupyter Notebook is running on our system.
最后,我们通过NLTK制作了一个聊天机器人,当Jupyter Notebook在系统上运行时,可以在Telegram上与我们交谈。
You can see my conversation with the chatbot on Telegram. It is pretty much able to find the similarity between the user response and the corpus, and it tries to reply with the sentence with maximum similarity. You can also see that when I send “okay,” the bot is not able to find any suitable match from the corpus of the greetings; therefore, it is replying, “I am sorry! I don’t understand you. Please rephrase your query.”
您可以在Telegram上看到我与聊天机器人的对话。 它几乎能够找到用户响应和语料库之间的相似性,并且尝试使用具有最大相似性的句子进行回复。 您也可以在我发送邮件时看到 “好吧,” 机器人无法从问候语料中找到任何合适的匹配项; 因此,它在答复:“对不起! 我不懂你 请重新输入您的查询。”
结论 (Conclusion)
As a conclusion, I would like to say that the above chatbot is just to make you understand the concept of NLP and that data-driven chatbots are considered to be the foundation for chatbots. Further, for more reliable chatbot development, one should know about machine learning chatbots. You can also consider a framework like Rasa for an AI-powered chatbot.
总结一下,我想说的是,上面的聊天机器人只是为了让您理解NLP的概念,并且数据驱动的聊天机器人被认为是聊天机器人的基础。 此外,为使聊天机器人开发更可靠,应该了解机器学习聊天机器人。 您还可以考虑将Rasa之类的框架用于AI驱动的聊天机器人。
python 机器人聊天