当前位置:   article > 正文

文本分类Keras RNN实践——应用腾讯和百度中文词向量_腾讯词向量文本分类

腾讯词向量文本分类

中文词向量

深度学习在NLP领域大展身手,而深度学习处理文本,离不开文本的向量化。

英语独特的语法规则,使得单用空格就能将句子中的单词分割开来,从而取得词向量,这极大简化了英语的NLP预处理过程,工业界学术界也有非常好的资源支持,如谷歌公司word2vec算法、斯坦福大学GloVe算法等等,应用这些英文开源词向量模型,大大扩展了英文自然语言处理使用的深度和广度。

中文自文言文发展开始,丰富的语义,使得分词断句一直是阅读者需要迈过的第一道障碍,至今演变为白话文,增加标点符号分隔,也没有一套简单规则可循的分词断句方法,仿照英文的自然语言处理过程,无疑加大了中文在NLP领域的分析难度,也使得中文NLP研究进度落后。

目前获取中文词向量流行的方式是利用gensim工具,基于训练集与测试集组成的全体样本给出的中文词向量,词向量的精度取决于全体样本的大小,样本越大,利用获得的中文词向量进行深度学习的效果越好,计算两个词向量的“距离”来判断两者语义相似性的效果也越好。海量的中文资源,巨量的存储空间,强大的运算性能,巧妙的算法设计,要获得高质量的中文词向量,这几样缺一不可。

腾讯AI Lab近期开源了其训练的中文词向量数据(数据下载地址:https://ai.tencent.com/ailab/nlp/embedding.html),包含800多万条中文词汇,一些流行词也收录其中如一颗赛艇、因吹斯听等,发布介绍中其“训练词向量的语料来自腾讯新闻和天天快报的新闻语料,以及自行抓取的互联网网页和小说语料。”,“采用自研的Directional Skip-Gram (DSG)算法作为词向量的训练算法。”,数据大小约为16.7G(更正),词向量维度200,格式为txt文本,下载到本地服务器后上传数据到MySQL,通过查询SQL获取对应文本分词后的词向量。

百度AI开放平台让普通开发者拥有了一步到位的中文自然语言处理能力,通过抓取全网数据并训练模型得到商用级基础NLP资源,试用免费,支持多种NLP任务,如中文词向量、文本分类(目前支持娱乐、体育、科技等26个内容类型)、依存句法分析等等。中文词向量可以通过POST请求方式或者SDK方式获取,数据存放于百度服务器,试用默认调用量限制QPS为5,词向量维度1024,处理小型数据集的NLP任务足够了。

本文分别采用One-hot模型、Embedding模型、腾讯开源中文词向量和百度开放平台中文词向量,使用Keras深度学习API,简单应用于金融类文本分类,验证上述两大公开的中文词向量使用效果。

 

文本预处理

本文使用一个小型金融类文本数据集,经过前期处理(脏数据清洗是个大坑),包含4类标签,每类标签对应的样本数1000条,在Jupyter Notebook平台上采用kerasLSTM模型进行训练。

读取数据:

df_train = pd.read_excel('./data/taxdata.xls',header = 0)

1. One-hot + RNN

 标签编码,使用Tokenizer

  1. #划分标签和数据
  2. data_list = df_train['data']
  3. class_list = df_train['class']
  4. # 对标签进行编码
  5. y_labels = list(y_train.value_counts().index)
  6. le = LabelEncoder()
  7. le.fit(y_labels)
  8. num_labels = len(y_labels)
  9. y_train = to_categorical(y_train.map(lambda x: le.transform([x])[0]), num_labels)
  10. y_test = to_categorical(y_test.map(lambda x: le.transform([x])[0]), num_labels)
  11. # 构建单词-id词典
  12. tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")
  13. tokenizer.fit_on_texts(data_list)
  14. vocab = tokenizer.word_index
  15. x_train_word_ids = tokenizer.texts_to_sequences(X_train)
  16. x_test_word_ids = tokenizer.texts_to_sequences(X_test)

获取One-hot编码,由于One-hot编码后词向量维度为24271,为减轻运算负担,采用One-hot思想(每个词语出现的概率相互独立)的序列编码(每个词语占一个序列号码,若样本出现该词语置为词语对应的序列号码,否则置0)方式获取词向量,并增加一个维度:

  1. #序列编码
  2. x_train_padded_seqs = pad_sequences(x_train_word_ids, maxlen=200)
  3. x_test_padded_seqs = pad_sequences(x_test_word_ids, maxlen=200)
  4. x_train_padded_seqs=np.expand_dims(x_train_padded_seqs,axis=2)
  5. x_test_padded_seqs=np.expand_dims(x_test_padded_seqs,axis=2)

 构建RNN模型:

  1. # 序列 + RNN model
  2. model = Sequential()
  3. model.add(LSTM(256, dropout=0.5,recurrent_dropout=0.1))
  4. model.add(Dense(256, activation='relu'))
  5. model.add(Dense(4, activation='softmax'))
  6. model.compile(loss='categorical_crossentropy',
  7. optimizer='adam',
  8. metrics=['accuracy'])
  9. model.fit(x_train_padded_seqs, y_train,
  10. batch_size=32,
  11. epochs=12,
  12. validation_data=(x_test_padded_seqs, y_test))

训练结果:

  1. Train on 3200 samples, validate on 800 samples
  2. Epoch 1/12
  3. 3200/3200 [==============================] - 30s 9ms/step - loss: 1.4094 - acc: 0.2594 - val_loss: 1.4007 - val_acc: 0.2437
  4. Epoch 2/12
  5. 3200/3200 [==============================] - 30s 9ms/step - loss: 1.3944 - acc: 0.2559 - val_loss: 1.3860 - val_acc: 0.2550
  6. Epoch 3/12
  7. 3200/3200 [==============================] - 30s 9ms/step - loss: 1.3905 - acc: 0.2672 - val_loss: 1.3946 - val_acc: 0.2637
  8. Epoch 4/12
  9. 3200/3200 [==============================] - 29s 9ms/step - loss: 1.3937 - acc: 0.2625 - val_loss: 1.4082 - val_acc: 0.2662
  10. Epoch 5/12
  11. 3200/3200 [==============================] - 30s 9ms/step - loss: 1.3904 - acc: 0.2591 - val_loss: 1.3824 - val_acc: 0.2575
  12. Epoch 6/12
  13. 3200/3200 [==============================] - 27s 9ms/step - loss: 1.3907 - acc: 0.2497 - val_loss: 1.3866 - val_acc: 0.2687
  14. Epoch 7/12
  15. 3200/3200 [==============================] - 27s 8ms/step - loss: 1.3875 - acc: 0.2753 - val_loss: 1.3793 - val_acc: 0.2687
  16. Epoch 8/12
  17. 3200/3200 [==============================] - 27s 8ms/step - loss: 1.3876 - acc: 0.2581 - val_loss: 1.3811 - val_acc: 0.2662
  18. Epoch 9/12
  19. 3200/3200 [==============================] - 28s 9ms/step - loss: 1.3872 - acc: 0.2638 - val_loss: 1.3919 - val_acc: 0.2700
  20. Epoch 10/12
  21. 3200/3200 [==============================] - 28s 9ms/step - loss: 1.3871 - acc: 0.2697 - val_loss: 1.3900 - val_acc: 0.2475
  22. Epoch 11/12
  23. 3200/3200 [==============================] - 28s 9ms/step - loss: 1.3946 - acc: 0.2603 - val_loss: 1.3881 - val_acc: 0.2725
  24. Epoch 12/12
  25. 3200/3200 [==============================] - 29s 9ms/step - loss: 1.3888 - acc: 0.2544 - val_loss: 1.3786 - val_acc: 0.2650

可见One-hot + RNN 模型训练,训练集和验证集精度非常低。

 

2. Embedding + RNN

对每个样本,用主流的jieba分词包分词,同时根据停用词表去除诸如“的”、“是”等无意义词,使用keras的tokenizer,可以训练并获得相对于本地数据集的词嵌入向量,词向量维度300。

  1. MAX_SEQUENCE_LENGTH = 100 # 每篇文章选取100个词
  2. EMBEDDING_DIM = 300 # 词向量维度,300维
  3. tokenizer = Tokenizer(num_words=15000)
  4. tokenizer.fit_on_texts(data_list) # 传入数据集,得到完整数据集部分的词向量
  5. sequences = tokenizer.texts_to_sequences(data_list)
  6. word_index_dict = tokenizer.word_index
  7. data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # 限制每篇文章的长度
  8. labels = to_categorical(np.asarray(class_list)) # label one hot表示

打乱文章顺序,切割数据:

  1. # 打乱文章顺序
  2. VALIDATION_SPLIT = 0.2
  3. indices = np.arange(data.shape[0])
  4. np.random.shuffle(indices)
  5. data = data[indices]
  6. labels = labels[indices]
  7. num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
  8. # 切割数据
  9. x_train = data[:-num_validation_samples]
  10. y_train = labels[:-num_validation_samples]
  11. x_val = data[-num_validation_samples:]
  12. y_val = labels[-num_validation_samples:]

定义embedding层,当trainable为true时,由keras训练本地数据集获得词嵌入向量,并将这部分词嵌入向量作用于训练集和验证集的内容数据中:

  1. # 设置 trainable = True,代表词向量作为参数进行更新
  2. num_words = len(word_index_dict)
  3. embedding_layer = Embedding(num_words,
  4. EMBEDDING_DIM,
  5. input_length=MAX_SEQUENCE_LENGTH,
  6. trainable=True)

加载RNN,这里选用LSTM层:

  1. from keras import backend as K
  2. from keras.layers import Embedding,Input,Dense,Conv1D,MaxPooling1D,Dense,Flatten,Lambda,LSTM
  3. K.set_image_dim_ordering('tf')
  4. sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  5. embedded_sequences = embedding_layer(sequence_input)
  6. x = LSTM(256, dropout=0.5)(embedded_sequences)
  7. x = Dense(256, activation='relu')(x)
  8. preds = Dense(num_labels, activation='softmax')(x)
  9. model = Model(sequence_input, preds)
  10. model.summary()
  11. model.compile(loss='categorical_crossentropy',
  12. optimizer='rmsprop',
  13. metrics=['acc'])
  14. model.fit(x_train, y_train, validation_data=(x_val, y_val),
  15. epochs=12, batch_size=128)

获得模型训练结果:

  1. Train on 3200 samples, validate on 800 samples
  2. Epoch 1/12
  3. 3200/3200 [==============================] - 15s 5ms/step - loss: 1.3157 - acc: 0.4903 - val_loss: 0.9386 - val_acc: 0.6562
  4. Epoch 2/12
  5. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.7593 - acc: 0.7184 - val_loss: 0.7342 - val_acc: 0.7163
  6. Epoch 3/12
  7. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.6163 - acc: 0.7737 - val_loss: 0.6189 - val_acc: 0.7688
  8. Epoch 4/12
  9. 3200/3200 [==============================] - 14s 5ms/step - loss: 0.4820 - acc: 0.8147 - val_loss: 0.6228 - val_acc: 0.7512
  10. Epoch 5/12
  11. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.4272 - acc: 0.8491 - val_loss: 0.6892 - val_acc: 0.7550
  12. Epoch 6/12
  13. 3200/3200 [==============================] - 14s 4ms/step - loss: 0.3695 - acc: 0.8606 - val_loss: 0.6637 - val_acc: 0.7475
  14. Epoch 7/12
  15. 3200/3200 [==============================] - 14s 5ms/step - loss: 0.3239 - acc: 0.8791 - val_loss: 0.7030 - val_acc: 0.7500
  16. Epoch 8/12
  17. 3200/3200 [==============================] - 13s 4ms/step - loss: 0.2788 - acc: 0.9056 - val_loss: 0.7326 - val_acc: 0.7538
  18. Epoch 9/12
  19. 3200/3200 [==============================] - 14s 4ms/step - loss: 0.2457 - acc: 0.9138 - val_loss: 0.7843 - val_acc: 0.7362
  20. Epoch 10/12
  21. 3200/3200 [==============================] - 13s 4ms/step - loss: 0.2107 - acc: 0.9241 - val_loss: 0.8854 - val_acc: 0.7350
  22. Epoch 11/12
  23. 3200/3200 [==============================] - 13s 4ms/step - loss: 0.1975 - acc: 0.9331 - val_loss: 0.8781 - val_acc: 0.7350
  24. Epoch 12/12
  25. 3200/3200 [==============================] - 13s 4ms/step - loss: 0.1770 - acc: 0.9384 - val_loss: 0.8481 - val_acc: 0.7538

Embedding + RNN 验证集精度明显提高,最高为0.7688,而训练集精度达到90%以上,明显有过拟合现象。经过调参,验证集可能获得更高精确度。

 

3. 腾讯中文词向量 + RNN

将腾讯AI Lab开源的中文词向量全量上传至MySQL,通过查询MySQL获取样本分词后的中文词向量,词向量维度200。

  1. #连接MySQL
  2. db = pymysql.connect(host="***",user="***", password="***",db="***", port=***)
  3. cursor = db.cursor()
  4. #查询词向量
  5. def querywordvec(word):
  6. vec_int = []
  7. query_sql= "select * from XXX where word='%s'" % word
  8. try:
  9. cursor.execute(query_sql)
  10. results = cursor.fetchall()
  11. vec = results[0][2]
  12. for i in range(len(vec)):
  13. vec_int.append(float(vec[i]))
  14. except:
  15. print("Error: unable to fetch data")
  16. return vec_int

举个例子,查询“开发”,结果:

print(querywordvec("开发"))
[0.205318, 0.02924, 0.025059, -0.031507, -0.035252, 0.147428, 0.064118, 0.402488, 0.424321, 0.437024, 0.012467, -0.098729, -0.158572, -0.088177, -0.043449, 0.089409, -0.099055, -0.283804, 0.112545, 0.025541, -0.01726, -0.150909, -0.083299, 0.037459, 0.29605, 0.01388, -0.287553, 0.117286, 0.13666, 0.493275, 0.302443, 0.082535, -0.009056, 0.24045, -0.007371, 0.119541, 0.432921, 0.025741, -0.29922, 0.21198, 0.021523, 0.220857, 0.44779, 0.291499, -0.184952, -0.006434, -0.115189, -0.266904, 0.003495, -0.159119, -0.384113, -0.387713, 0.170096, 0.198, 0.07035, 0.177311, -0.019644, -0.188508, 0.031889, -0.392723, 0.227364, 0.616728, -0.059071, -0.364697, -0.077505, -0.260351, -0.268732, 0.238778, 0.427052, 0.321993, -0.037369, -0.159352, 0.400518, 0.229699, -0.3446, 0.046306, -0.066257, 0.377816, -0.055773, -0.325963, -0.102563, 0.205084, 0.118749, 0.403796, 0.085079, -0.134903, -0.035444, 0.126386, -0.142862, -0.293126, 0.142639, 0.108202, -0.022327, 0.011597, 0.426736, -0.090832, 0.168828, -0.330628, -0.333454, 0.214868, -0.452769, -0.024319, 0.072544, 0.127925, -0.389489, -0.088286, -0.190296, -0.085807, 0.077976, -0.020705, -0.219265, -0.360965, 0.207212, -0.285199, -0.211401, -0.120366, 0.086652, 0.090502, -0.144671, -0.392925, 0.285612, -0.427401, -0.148718, -0.061124, 0.139129, 0.199844, -0.39937, -0.123523, -0.21283, -0.129926, 0.249201, -0.196021, 0.166995, -0.452386, -0.371797, -0.234237, -0.063534, 0.056776, -0.159089, 0.260067, -0.412732, -0.195597, -0.431985, -0.471183, -0.170963, 0.180416, 0.197121, 0.296519, 0.081022, -0.383157, -0.227555, -0.285242, 0.040487, -0.224609, -0.121581, 0.186237, 0.010203, 0.502054, -0.2188, -0.088945, 0.219765, -0.045787, 0.119763, 0.30921, 0.231384, -0.163442, -0.1442, 0.06971, -0.325053, -0.247143, 0.112627, 0.034369, -0.096266, 0.194837, -0.301401, -0.099836, -0.075422, 0.367559, -0.319538, 0.470193, -0.165735, -0.350219, 0.295977, 0.009617, 0.201713, 0.33146, 0.03736, 0.224218, -0.09293, 0.10523, -0.018303, 0.191042, 0.260462, 0.025095, 0.122858, 0.635381, 0.26528, 0.309128, -0.30828, -0.015132]

获取词向量矩阵,统计未查询到的词共有k=1002, 如“租代建”,“按平销”,“65%”,“税盘”等,全体数据集可获得13281-1002 = 12279个中文词向量。

  1. k = 1
  2. num_words = len(word_index_dict)
  3. embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
  4. for word, i in word_index_dict.items():
  5. vec = querywordvec(word)
  6. if len(vec) != 0:
  7. embedding_matrix[i-1] = vec
  8. else:
  9. k+=1
  10. db.close()

设置嵌入层:

  1. # trainable = False,代表词向量不作为参数进行更新
  2. embedding_layer = Embedding(num_words,
  3. EMBEDDING_DIM,
  4. weights=[embedding_matrix],
  5. input_length=MAX_SEQUENCE_LENGTH,
  6. trainable=False)

同样的,调用第二节Embedding + RNN中LSTM模型训练,修改epochs=20,结果:

  1. Train on 3200 samples, validate on 800 samples
  2. Epoch 1/20
  3. 3200/3200 [==============================] - 23s 7ms/step - loss: 1.3878 - acc: 0.3259 - val_loss: 1.3240 - val_acc: 0.3287
  4. Epoch 2/20
  5. 3200/3200 [==============================] - 21s 7ms/step - loss: 1.2471 - acc: 0.4431 - val_loss: 1.2147 - val_acc: 0.4225
  6. Epoch 3/20
  7. 3200/3200 [==============================] - 21s 6ms/step - loss: 1.1815 - acc: 0.4797 - val_loss: 1.1546 - val_acc: 0.4888
  8. Epoch 4/20
  9. 3200/3200 [==============================] - 21s 7ms/step - loss: 1.1166 - acc: 0.5075 - val_loss: 1.0379 - val_acc: 0.5587
  10. Epoch 5/20
  11. 3200/3200 [==============================] - 22s 7ms/step - loss: 1.0926 - acc: 0.5291 - val_loss: 0.9903 - val_acc: 0.5962
  12. Epoch 6/20
  13. 3200/3200 [==============================] - 21s 6ms/step - loss: 1.0647 - acc: 0.5384 - val_loss: 0.9586 - val_acc: 0.5850
  14. Epoch 7/20
  15. 3200/3200 [==============================] - 12s 4ms/step - loss: 1.0155 - acc: 0.5725 - val_loss: 1.1208 - val_acc: 0.4913
  16. Epoch 8/20
  17. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.9712 - acc: 0.5816 - val_loss: 0.8816 - val_acc: 0.6275
  18. Epoch 9/20
  19. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.9227 - acc: 0.6134 - val_loss: 0.8344 - val_acc: 0.6338
  20. Epoch 10/20
  21. 3200/3200 [==============================] - 11s 4ms/step - loss: 0.9085 - acc: 0.6163 - val_loss: 0.8979 - val_acc: 0.6312
  22. Epoch 11/20
  23. 3200/3200 [==============================] - 12s 4ms/step - loss: 0.8666 - acc: 0.6331 - val_loss: 0.9032 - val_acc: 0.6138
  24. Epoch 12/20
  25. 3200/3200 [==============================] - 12s 4ms/step - loss: 0.8023 - acc: 0.6597 - val_loss: 0.7601 - val_acc: 0.6800
  26. Epoch 13/20
  27. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.7834 - acc: 0.6778 - val_loss: 0.8206 - val_acc: 0.6488
  28. Epoch 14/20
  29. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.7422 - acc: 0.6966 - val_loss: 0.8015 - val_acc: 0.6637
  30. Epoch 15/20
  31. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.7179 - acc: 0.7041 - val_loss: 0.7536 - val_acc: 0.6637
  32. Epoch 16/20
  33. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6965 - acc: 0.7084 - val_loss: 0.7869 - val_acc: 0.6937
  34. Epoch 17/20
  35. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6855 - acc: 0.7216 - val_loss: 0.7411 - val_acc: 0.6800
  36. Epoch 18/20
  37. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6342 - acc: 0.7341 - val_loss: 0.7292 - val_acc: 0.7125
  38. Epoch 19/20
  39. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6188 - acc: 0.7550 - val_loss: 0.7849 - val_acc: 0.6987
  40. Epoch 20/20
  41. 3200/3200 [==============================] - 11s 3ms/step - loss: 0.5950 - acc: 0.7641 - val_loss: 0.7121 - val_acc: 0.7188

腾讯中文词向量的效果,训练集与验证集精度大致相当,验证集精度最高0.7188。

 

4. 百度中文词向量 + RNN 

使用百度开放平台获取基础NLP资源(传送门:https://cloud.baidu.com/doc/NLP/index.html),需要注册百度云账号,并按程序获得APP_ID,API_KEY,SECRET_KEY配置信息。

  1. #使用python sdk获取中文词向量
  2. APP_ID = '***'
  3. API_KEY = '***'
  4. SECRET_KEY = '***'
  5. client = AipNlp(APP_ID, API_KEY, SECRET_KEY)

获取中文词向量,词向量维度1024:

  1. k = 1
  2. num_words = len(word_index_dict)
  3. embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
  4. for word, i in word_index_dict.items():
  5. #去除数字
  6. pattern = re.compile('[0-9]+')
  7. match = pattern.findall(word)
  8. if len(word) <= 2 and not match:
  9. wordinfo = client.wordEmbedding(word)
  10. time.sleep(0.6)
  11. try :
  12. wordinfo['vec']
  13. except Exception as e:
  14. k+=1
  15. else:
  16. embedding_vector = wordinfo['vec']
  17. if embedding_vector is not None:
  18. embedding_matrix[i-1] = embedding_vector

百度中文词向量只能查询两个字的词语,共有2566个两字词未获取词向量,两字以上的词语有4658个,数字类的词590个,全体数据集可获得13281-2566-590-4658 = 5467个中文词向量。

训练模型,调用第二节Embedding + RNN中LSTM模型训练,修改epochs=20,结果:

  1. Train on 3200 samples, validate on 800 samples
  2. Epoch 1/20
  3. 3200/3200 [==============================] - 18s 5ms/step - loss: 1.1565 - acc: 0.5000 - val_loss: 0.8308 - val_acc: 0.6700
  4. Epoch 2/20
  5. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.8307 - acc: 0.6653 - val_loss: 0.7311 - val_acc: 0.7163
  6. Epoch 3/20
  7. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.7342 - acc: 0.7109 - val_loss: 0.8210 - val_acc: 0.6613
  8. Epoch 4/20
  9. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.6248 - acc: 0.7591 - val_loss: 0.7152 - val_acc: 0.7175
  10. Epoch 5/20
  11. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.5598 - acc: 0.7812 - val_loss: 0.6750 - val_acc: 0.7350
  12. Epoch 6/20
  13. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.4954 - acc: 0.8069 - val_loss: 0.6367 - val_acc: 0.7625
  14. Epoch 7/20
  15. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.4472 - acc: 0.8341 - val_loss: 0.6642 - val_acc: 0.7625
  16. Epoch 8/20
  17. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.4024 - acc: 0.8453 - val_loss: 0.6992 - val_acc: 0.7488
  18. Epoch 9/20
  19. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.3542 - acc: 0.8656 - val_loss: 0.7054 - val_acc: 0.7650
  20. Epoch 10/20
  21. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.3318 - acc: 0.8694 - val_loss: 0.8433 - val_acc: 0.7300
  22. Epoch 11/20
  23. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.3042 - acc: 0.8809 - val_loss: 0.8931 - val_acc: 0.7113
  24. Epoch 12/20
  25. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.2804 - acc: 0.8919 - val_loss: 0.7937 - val_acc: 0.7525
  26. Epoch 13/20
  27. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.2454 - acc: 0.9106 - val_loss: 0.8601 - val_acc: 0.7475
  28. Epoch 14/20
  29. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.2255 - acc: 0.9131 - val_loss: 0.8541 - val_acc: 0.7450
  30. Epoch 15/20
  31. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1984 - acc: 0.9266 - val_loss: 0.9004 - val_acc: 0.7550
  32. Epoch 16/20
  33. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1934 - acc: 0.9300 - val_loss: 0.8683 - val_acc: 0.7612
  34. Epoch 17/20
  35. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1783 - acc: 0.9291 - val_loss: 0.9406 - val_acc: 0.7525
  36. Epoch 18/20
  37. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.1735 - acc: 0.9353 - val_loss: 0.9408 - val_acc: 0.7600
  38. Epoch 19/20
  39. 3200/3200 [==============================] - 15s 5ms/step - loss: 0.1453 - acc: 0.9519 - val_loss: 1.0023 - val_acc: 0.7612
  40. Epoch 20/20
  41. 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1463 - acc: 0.9425 - val_loss: 1.0696 - val_acc: 0.7575

训练集精度很高0.9519,验证集最高0.7650,也出现过拟合。

 

5. 经典分类器

除了用深度学习方法外,我们也可以试着用经典的贝叶斯分类器、随机森林、支持向量机等模型,试试精度如何。

  1. #中文文本处理
  2. def TextProcessing(data_list, class_list):
  3. #切分训练集和验证集
  4. train_data_list, test_data_list, train_class_list, test_class_list = train_test_split(data_list, class_list, test_size=0.25)
  5. # 统计训练集词频
  6. all_words_dict = {}
  7. for word_list in train_data_list:
  8. for word in word_list:
  9. if word in all_words_dict.keys():
  10. all_words_dict[word] += 1
  11. else:
  12. all_words_dict[word] = 1
  13. # 根据键的值倒序排序
  14. all_words_tuple_list = sorted(all_words_dict.items(), key=lambda f: f[1], reverse=True)
  15. all_words_list, all_words_nums = zip(*all_words_tuple_list) # 解压缩
  16. all_words_list = list(all_words_list) # 转换成列表
  17. return all_words_list, train_data_list, test_data_list, train_class_list, test_class_list
  18. #预先需要jieba切词,取前N个词
  19. def Words_dict(all_words_list, deleteN, stopwords_set=set()):
  20. feature_words = [] # 特征列表
  21. n = 1
  22. for t in range(deleteN, len(all_words_list), 1):
  23. if n > 1000: # feature_words的维度为1000
  24. break
  25. # 如果这个词不是数字,并且不是指定的结束语,并且单词长度大于1小于5,那么这个词就可以作为特征词
  26. if not all_words_list[t].isdigit() and all_words_list[t] not in stopwords_set and 1 < len(all_words_list[t]) < 5:
  27. feature_words.append(all_words_list[t])
  28. n += 1
  29. return feature_words
  30. #获取One-hot向量
  31. def TextFeatures(train_data_list, test_data_list, feature_words):
  32. def text_features(text, feature_words): # 出现在特征集中,则置1
  33. text_words = set(text)
  34. features = [1 if word in text_words else 0 for word in feature_words]
  35. return features
  36. train_feature_list = [text_features(text, feature_words) for text in train_data_list]
  37. test_feature_list = [text_features(text, feature_words) for text in test_data_list]
  38. return train_feature_list, test_feature_list # 返回结果

设计分类器:

  1. #多分类贝叶斯分类器
  2. def TextClassifier_MultinomialNB(train_feature_list, test_feature_list, train_class_list, test_class_list):
  3. classifier = MultinomialNB().fit(train_feature_list, train_class_list)
  4. test_accuracy = classifier.score(test_feature_list, test_class_list)
  5. return test_accuracy
  6. #支持向量机分类器
  7. def TextClassifier_SVC(train_feature_list, test_feature_list, train_class_list, test_class_list):
  8. clf = SVC(C=150,kernel='rbf', degree=51, gamma='auto',coef0=0.0,shrinking=False,probability=False,tol=0.001,cache_size=300, class_weight=None,verbose=False,max_iter=-1,decision_function_shape=None,random_state=None)
  9. clf.fit(train_feature_list, train_class_list)
  10. test_accuracy=clf.score(test_feature_list, test_class_list)
  11. return test_accuracy
  12. #随机森林分类器
  13. def TextClassifier_RF(train_feature_list, test_feature_list, train_class_list, test_class_list):
  14. forest = RandomForestClassifier(n_estimators=500,random_state=5, warm_start=False,
  15. min_impurity_decrease=0.0,
  16. min_samples_split=15)
  17. forest.fit(train_feature_list, train_class_list)
  18. test_accuracy=forest.score(test_feature_list, test_class_list)
  19. return test_accuracy

看看各个分类器的效果

  1. if __name__ == '__main__':
  2. # 文本预处理
  3. all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(data_list,class_list)
  4. feature_words = Words_dict(all_words_list, 0, stopwords_set)
  5. train_feature_list, test_feature_list = TextFeatures(train_data_list, test_data_list, feature_words)
  6. #比较各个分类器的精确度
  7. test_accuracy_MultinomialNB = TextClassifier_MultinomialNB(train_feature_list, test_feature_list, train_class_list, test_class_list)
  8. print("MultinomialNB Accuracy:",test_accuracy_MultinomialNB)
  9. test_accuracy_SVC = TextClassifier_SVC(train_feature_list, test_feature_list, train_class_list, test_class_list)
  10. print("SVC Accuracy:",test_accuracy_SVC)
  11. test_accuracy_RF = TextClassifier_RF(train_feature_list, test_feature_list, train_class_list, test_class_list)
  12. print("RF Accuracy:",test_accuracy_RF)

结果:

  1. MultinomialNB Accuracy: 0.71
  2. SVC Accuracy: 0.7542857142857143
  3. RF Accuracy: 0.7557142857142857

可见,对于小数据集,采用经典分类器,也能获得不输于深度学习的效果。

 

总结

通过以上试验可以看出,在小数据集上,经典分类器有着很好的效果。在采用深度学习和词向量的例子中,有中文词向量的模型要比One-hot 模型精度提高不少,奇怪的是,从百度获得的中文词向量个数要比从腾讯获得的中文词向量个数要少一半多,训练效果却比较好,有待更进一步的试验。

最后,笔者获悉Chinese Word Vectors中文词向量已经推出(传送门:https://github.com/Embedding/Chinese-Word-Vectors),有兴趣者可以根据这个来做一些试验。

祝大家学习愉快~

 

欢迎关注微信公众号:a_white_deer

欢迎讨论技术问题:xiezj2010@126.com

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/327394
推荐阅读
相关标签
  

闽ICP备14008679号