木道寻08

这个屌丝很懒，什么也没留下！

热门标签

人工智能不过尔尔，基于Python3深度学习库Keras/TensorFlow打造属于自己的聊天机器人(ChatRobot)

作者：木道寻08 | 2024-08-03 23:35:15

踩

聊天机器人(ChatRobot)的概念我们并不陌生，也许你曾经在百无聊赖之下和Siri打情骂俏过，亦或是闲暇之余与小爱同学谈笑风生，无论如何，我们都得承认，人工智能已经深入了我们的生活。目前市面上提供三方api的机器人不胜枚举：微软小冰、图灵机器人、腾讯闲聊、青云客机器人等等，只要我们想，就随时可以在app端或者web应用上进行接入。但是，这些应用的底层到底如何实现的？在没有网络接入的情况下，我们能不能像美剧《西部世界》(Westworld)里面描绘的那样，机器人只需要存储在本地的“心智球”就可以和人类沟通交流，如果你不仅仅满足于当一个“调包侠”，请跟随我们的旅程，本次我们将首度使用深度学习库Keras/TensorFlow打造属于自己的本地聊天机器人，不依赖任何三方接口与网络。

首先安装相关依赖：

pip3 install Tensorflow  
pip3 install Keras  
pip3 install nltk
1
2
3

然后撰写脚本test_bot.py导入需要的库：

import nltk  
import ssl  
from nltk.stem.lancaster import LancasterStemmer  
stemmer = LancasterStemmer()  
  
import numpy as np  
from keras.models import Sequential  
from keras.layers import Dense, Activation, Dropout  
from keras.optimizers import SGD  
import pandas as pd  
import pickle  
import random
1
2
3
4
5
6
7
8
9
10
11
12

这里有一个坑，就是自然语言分析库NLTK会报一个错误：



Resource punkt not found


1
2
3
4
5

正常情况下，只要加上一行下载器代码即可

import nltk  
nltk.download('punkt')
1
2

但是由于学术上网的原因，很难通过python下载器正常下载，所以我们玩一次曲线救国，手动自己下载压缩包：

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
1

解压之后，放在你的用户目录下即可：

C:\Users\liuyue\tokenizers\nltk_data\punkt
1

ok，言归正传，开发聊天机器人所面对的最主要挑战是对用户输入信息进行分类，以及能够识别人类的正确意图（这个可以用机器学习解决，但是太复杂，我偷懒了，所以用的深度学习Keras）。第二就是怎样保持语境，也就是分析和跟踪上下文，通常情况下，我们不太需要对用户意图进行分类，只需要把用户输入的信息当作聊天机器人问题的答案即可，所这里我们使用Keras深度学习库用于构建分类模型。

聊天机器人的意向和需要学习的模式都定义在一个简单的变量中。不需要动辄上T的语料库。我们知道如果玩机器人的，手里没有语料库，就会被人嘲笑，但是我们的目标只是为某一个特定的语境建立一个特定聊天机器人。所以分类模型作为小词汇量创建，它仅仅将能够识别为训练提供的一小组模式。

说白了就是，所谓的机器学习，就是你重复的教机器做某一件或几件正确的事情，在训练中，你不停的演示怎么做是正确的，然后期望机器在学习中能够举一反三，只不过这次我们不教它很多事情，只一件，用来测试它的反应而已，是不是有点像你在家里训练你的宠物狗？只不过狗子可没法和你聊天。

这里的意向数据变量我就简单举个例子，如果愿意，你可以用语料库对变量进行无限扩充：

intents = {"intents": [  
        {"tag": "打招呼",  
         "patterns": ["你好", "您好", "请问", "有人吗", "师傅","不好意思","美女","帅哥","靓妹","hi"],  
         "responses": ["您好", "又是您啊", "吃了么您内","您有事吗"],  
         "context": [""]  
        },  
        {"tag": "告别",  
         "patterns": ["再见", "拜拜", "88", "回见", "回头见"],  
         "responses": ["再见", "一路顺风", "下次见", "拜拜了您内"],  
         "context": [""]  
        },  
   ]  
}
1
2
3
4
5
6
7
8
9
10
11
12
13

可以看到，我插入了两个语境标签，打招呼和告别，包括用户输入信息以及机器回应数据。

在开始分类模型训练之前，我们需要先建立词汇。模式经过处理后建立词汇库。每一个词都会有词干产生通用词根，这将有助于能够匹配更多用户输入的组合。

for intent in intents['intents']:  
    for pattern in intent['patterns']:  
        # tokenize each word in the sentence  
        w = nltk.word_tokenize(pattern)  
        # add to our words list  
        words.extend(w)  
        # add to documents in our corpus  
        documents.append((w, intent['tag']))  
        # add to our classes list  
        if intent['tag'] not in classes:  
            classes.append(intent['tag'])  
  
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]  
words = sorted(list(set(words)))  
  
classes = sorted(list(set(classes)))  
  
print (len(classes), "语境", classes)  
  
print (len(words), "词数", words)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

输出：

2 语境 ['告别', '打招呼']  
14 词数 ['88', '不好意思', '你好', '再见', '回头见', '回见', '帅哥', '师傅', '您好', '拜拜', '有人吗', '美女', '请问', '靓妹']
1
2

训练不会根据词汇来分析，因为词汇对于机器来说是没有任何意义的，这也是很多中文分词库所陷入的误区，其实机器并不理解你输入的到底是英文还是中文，我们只需要将单词或者中文转化为包含0/1的数组的词袋。数组长度将等于词汇量大小，当当前模式中的一个单词或词汇位于给定位置时，将设置为1。

# create our training data  
training = []  
# create an empty array for our output  
output_empty = [0] * len(classes)  
# training set, bag of words for each sentence  
for doc in documents:  
    # initialize our bag of words  
    bag = []  
  
    pattern_words = doc[0]  
     
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]  
  
    for w in words:  
        bag.append(1) if w in pattern_words else bag.append(0)  
      
   
    output_row = list(output_empty)  
    output_row[classes.index(doc[1])] = 1  
      
    training.append([bag, output_row])  
  
random.shuffle(training)  
training = np.array(training)  
  
train_x = list(training[:,0])  
train_y = list(training[:,1])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

我们开始进行数据训练，模型是用Keras建立的，基于三层。由于数据基数小，分类输出将是多类数组，这将有助于识别编码意图。使用softmax激活来产生多类分类输出（结果返回一个0/1的数组：[1,0,0,…,0]–这个数组可以识别编码意图）。

model = Sequential()  
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))  
model.add(Dropout(0.5))  
model.add(Dense(64, activation='relu'))  
model.add(Dropout(0.5))  
model.add(Dense(len(train_y[0]), activation='softmax'))  
  
  
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)  
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])  
  
  
model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
1
2
3
4
5
6
7
8
9
10
11
12
13

这块是以200次迭代的方式执行训练，批处理量为5个，因为我的测试数据样本小，所以100次也可以，这不是重点。

开始训练：

14/14 [==============================] - 0s 32ms/step - loss: 0.7305 - acc: 0.5000  
Epoch 2/200  
14/14 [==============================] - 0s 391us/step - loss: 0.7458 - acc: 0.4286  
Epoch 3/200  
14/14 [==============================] - 0s 390us/step - loss: 0.7086 - acc: 0.3571  
Epoch 4/200  
14/14 [==============================] - 0s 395us/step - loss: 0.6941 - acc: 0.6429  
Epoch 5/200  
14/14 [==============================] - 0s 426us/step - loss: 0.6358 - acc: 0.7143  
Epoch 6/200  
14/14 [==============================] - 0s 356us/step - loss: 0.6287 - acc: 0.5714  
Epoch 7/200  
14/14 [==============================] - 0s 366us/step - loss: 0.6457 - acc: 0.6429  
Epoch 8/200  
14/14 [==============================] - 0s 899us/step - loss: 0.6336 - acc: 0.6429  
Epoch 9/200  
14/14 [==============================] - 0s 464us/step - loss: 0.5815 - acc: 0.6429  
Epoch 10/200  
14/14 [==============================] - 0s 408us/step - loss: 0.5895 - acc: 0.6429  
Epoch 11/200  
14/14 [==============================] - 0s 548us/step - loss: 0.6050 - acc: 0.6429  
Epoch 12/200  
14/14 [==============================] - 0s 468us/step - loss: 0.6254 - acc: 0.6429  
Epoch 13/200  
14/14 [==============================] - 0s 388us/step - loss: 0.4990 - acc: 0.7857  
Epoch 14/200  
14/14 [==============================] - 0s 392us/step - loss: 0.5880 - acc: 0.7143  
Epoch 15/200  
14/14 [==============================] - 0s 370us/step - loss: 0.5118 - acc: 0.8571  
Epoch 16/200  
14/14 [==============================] - 0s 457us/step - loss: 0.5579 - acc: 0.7143  
Epoch 17/200  
14/14 [==============================] - 0s 432us/step - loss: 0.4535 - acc: 0.7857  
Epoch 18/200  
14/14 [==============================] - 0s 357us/step - loss: 0.4367 - acc: 0.7857  
Epoch 19/200  
14/14 [==============================] - 0s 384us/step - loss: 0.4751 - acc: 0.7857  
Epoch 20/200  
14/14 [==============================] - 0s 346us/step - loss: 0.4404 - acc: 0.9286  
Epoch 21/200  
14/14 [==============================] - 0s 500us/step - loss: 0.4325 - acc: 0.8571  
Epoch 22/200  
14/14 [==============================] - 0s 400us/step - loss: 0.4104 - acc: 0.9286  
Epoch 23/200  
14/14 [==============================] - 0s 738us/step - loss: 0.4296 - acc: 0.7857  
Epoch 24/200  
14/14 [==============================] - 0s 387us/step - loss: 0.3706 - acc: 0.9286  
Epoch 25/200  
14/14 [==============================] - 0s 430us/step - loss: 0.4213 - acc: 0.8571  
Epoch 26/200  
14/14 [==============================] - 0s 351us/step - loss: 0.2867 - acc: 1.0000  
Epoch 27/200  
14/14 [==============================] - 0s 3ms/step - loss: 0.2903 - acc: 1.0000  
Epoch 28/200  
14/14 [==============================] - 0s 366us/step - loss: 0.3010 - acc: 0.9286  
Epoch 29/200  
14/14 [==============================] - 0s 404us/step - loss: 0.2466 - acc: 0.9286  
Epoch 30/200  
14/14 [==============================] - 0s 428us/step - loss: 0.3035 - acc: 0.7857  
Epoch 31/200  
14/14 [==============================] - 0s 407us/step - loss: 0.2075 - acc: 1.0000  
Epoch 32/200  
14/14 [==============================] - 0s 457us/step - loss: 0.2167 - acc: 0.9286  
Epoch 33/200  
14/14 [==============================] - 0s 613us/step - loss: 0.1266 - acc: 1.0000  
Epoch 34/200  
14/14 [==============================] - 0s 534us/step - loss: 0.2906 - acc: 0.9286  
Epoch 35/200  
14/14 [==============================] - 0s 463us/step - loss: 0.2560 - acc: 0.9286  
Epoch 36/200  
14/14 [==============================] - 0s 500us/step - loss: 0.1686 - acc: 1.0000  
Epoch 37/200  
14/14 [==============================] - 0s 387us/step - loss: 0.0922 - acc: 1.0000  
Epoch 38/200  
14/14 [==============================] - 0s 430us/step - loss: 0.1620 - acc: 1.0000  
Epoch 39/200  
14/14 [==============================] - 0s 371us/step - loss: 0.1104 - acc: 1.0000  
Epoch 40/200  
14/14 [==============================] - 0s 488us/step - loss: 0.1330 - acc: 1.0000  
Epoch 41/200  
14/14 [==============================] - 0s 381us/step - loss: 0.1322 - acc: 1.0000  
Epoch 42/200  
14/14 [==============================] - 0s 462us/step - loss: 0.0575 - acc: 1.0000  
Epoch 43/200  
14/14 [==============================] - 0s 1ms/step - loss: 0.1137 - acc: 1.0000  
Epoch 44/200  
14/14 [==============================] - 0s 450us/step - loss: 0.0245 - acc: 1.0000  
Epoch 45/200  
14/14 [==============================] - 0s 470us/step - loss: 0.1824 - acc: 1.0000  
Epoch 46/200  
14/14 [==============================] - 0s 444us/step - loss: 0.0822 - acc: 1.0000  
Epoch 47/200  
14/14 [==============================] - 0s 436us/step - loss: 0.0939 - acc: 1.0000  
Epoch 48/200  
14/14 [==============================] - 0s 396us/step - loss: 0.0288 - acc: 1.0000  
Epoch 49/200  
14/14 [==============================] - 0s 580us/step - loss: 0.1367 - acc: 0.9286  
Epoch 50/200  
14/14 [==============================] - 0s 351us/step - loss: 0.0363 - acc: 1.0000  
Epoch 51/200  
14/14 [==============================] - 0s 379us/step - loss: 0.0272 - acc: 1.0000  
Epoch 52/200  
14/14 [==============================] - 0s 358us/step - loss: 0.0712 - acc: 1.0000  
Epoch 53/200  
14/14 [==============================] - 0s 4ms/step - loss: 0.0426 - acc: 1.0000  
Epoch 54/200  
14/14 [==============================] - 0s 370us/step - loss: 0.0430 - acc: 1.0000  
Epoch 55/200  
14/14 [==============================] - 0s 368us/step - loss: 0.0292 - acc: 1.0000  
Epoch 56/200  
14/14 [==============================] - 0s 494us/step - loss: 0.0777 - acc: 1.0000  
Epoch 57/200  
14/14 [==============================] - 0s 356us/step - loss: 0.0496 - acc: 1.0000  
Epoch 58/200  
14/14 [==============================] - 0s 427us/step - loss: 0.1485 - acc: 1.0000  
Epoch 59/200  
14/14 [==============================] - 0s 381us/step - loss: 0.1006 - acc: 1.0000  
Epoch 60/200  
14/14 [==============================] - 0s 421us/step - loss: 0.0183 - acc: 1.0000  
Epoch 61/200  
14/14 [==============================] - 0s 344us/step - loss: 0.0788 - acc: 0.9286  
Epoch 62/200  
14/14 [==============================] - 0s 529us/step - loss: 0.0176 - acc: 1.0000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123

ok，200次之后，现在模型已经训练好了，现在声明一个方法用来进行词袋转换：

def clean_up_sentence(sentence):  
    # tokenize the pattern - split words into array  
    sentence_words = nltk.word_tokenize(sentence)  
    # stem each word - create short form for word  
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]  
    return sentence_words

def bow(sentence, words, show_details=True):  
    # tokenize the pattern  
    sentence_words = clean_up_sentence(sentence)  
    # bag of words - matrix of N words, vocabulary matrix  
    bag = [0]*len(words)    
    for s in sentence_words:  
        for i,w in enumerate(words):  
            if w == s:   
                # assign 1 if current word is in the vocabulary position  
                bag[i] = 1  
                if show_details:  
                    print ("found in bag: %s" % w)  
    return(np.array(bag))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

测试一下，看看是否可以命中词袋：

p = bow("你好", words)  
print (p)
1
2

返回值：

found in bag: 你好  
[0 0 1 0 0 0 0 0 0 0 0 0 0 0]
1
2

很明显匹配成功，词已入袋。

在我们打包模型之前，可以使用model.predict函数对用户输入进行分类测试，并根据计算出的概率返回用户意图（可以返回多个意图，根据概率倒序输出）：

def classify_local(sentence):  
    ERROR_THRESHOLD = 0.25  
      
    # generate probabilities from the model  
    input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])  
    results = model.predict([input_data])[0]  
    # filter out predictions below a threshold, and provide intent index  
    results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]  
    # sort by strength of probability  
    results.sort(key=lambda x: x[1], reverse=True)  
    return_list = []  
    for r in results:  
        return_list.append((classes[r[0]], str(r[1])))  
    # return tuple of intent and probability  
      
    return return_list
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

测试一下：

print(classify_local('您好'))
1

返回值：

found in bag: 您好  
[('打招呼', '0.999913')]  
liuyue:mytornado liuyue$
1
2
3

再测：

print(classify_local('88'))
1

返回值：

found in bag: 88  
[('告别', '0.9995449')]
1
2

完美，匹配出打招呼的语境标签，如果愿意，可以多测试几个，完善模型。

测试完成之后，我们可以将训练好的模型打包，这样每次调用之前就不用训练了：

json_file = model.to_json()  
with open('v3ucn.json', "w") as file:  
   file.write(json_file)  
  
model.save_weights('./v3ucn.h5f')
1
2
3
4
5

这里模型分为数据文件(json)以及权重文件(h5f)，将它们保存好，一会儿会用到。

接下来，我们来搭建一个聊天机器人的API，这里我们使用目前非常火的框架Fastapi，将模型文件放入到项目的目录之后，编写main.py:

import random  
import uvicorn  
from fastapi import FastAPI  
app = FastAPI()  
  
  
def classify_local(sentence):  
    ERROR_THRESHOLD = 0.25  
      
    # generate probabilities from the model  
    input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])  
    results = model.predict([input_data])[0]  
    # filter out predictions below a threshold, and provide intent index  
    results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]  
    # sort by strength of probability  
    results.sort(key=lambda x: x[1], reverse=True)  
    return_list = []  
    for r in results:  
        return_list.append((classes[r[0]], str(r[1])))  
    # return tuple of intent and probability  
      
    return return_list  
  
@app.get('/')  
async def root(word: str = None):  
      
    from keras.models import model_from_json  
    # # load json and create model  
    file = open("./v3ucn.json", 'r')  
    model_json = file.read()  
    file.close()  
    model = model_from_json(model_json)  
    model.load_weights("./v3ucn.h5f")  
  
    wordlist = classify_local(word)  
    a = ""  
    for intent in intents['intents']:  
        if intent['tag'] == wordlist[0][0]:  
            a = random.choice(intent['responses'])  
  
  
  
    return {'message':a}  
  
if __name__ == "__main__":  
    uvicorn.run(app, host="127.0.0.1", port=8000)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

这里的：

from keras.models import model_from_json  
file = open("./v3ucn.json", 'r')  
model_json = file.read()  
file.close()  
model = model_from_json(model_json)  
model.load_weights("./v3ucn.h5f")
1
2
3
4
5
6

用来导入刚才训练好的模型库，随后启动服务：

uvicorn main:app --reload
1

效果是这样的：

结语：毫无疑问，科技改变生活，聊天机器人可以让我们没有佳人相伴的情况下，也可以听闻莺啼燕语，相信不久的将来，笑语盈盈、衣香鬓影的“机械姬”亦能伴吾等于清风明月之下。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/木道寻08/article/detail/925378