赞
踩
深度学习在NLP领域大展身手,而深度学习处理文本,离不开文本的向量化。
英语独特的语法规则,使得单用空格就能将句子中的单词分割开来,从而取得词向量,这极大简化了英语的NLP预处理过程,工业界学术界也有非常好的资源支持,如谷歌公司word2vec算法、斯坦福大学GloVe算法等等,应用这些英文开源词向量模型,大大扩展了英文自然语言处理使用的深度和广度。
中文自文言文发展开始,丰富的语义,使得分词断句一直是阅读者需要迈过的第一道障碍,至今演变为白话文,增加标点符号分隔,也没有一套简单规则可循的分词断句方法,仿照英文的自然语言处理过程,无疑加大了中文在NLP领域的分析难度,也使得中文NLP研究进度落后。
目前获取中文词向量流行的方式是利用gensim工具,基于训练集与测试集组成的全体样本给出的中文词向量,词向量的精度取决于全体样本的大小,样本越大,利用获得的中文词向量进行深度学习的效果越好,计算两个词向量的“距离”来判断两者语义相似性的效果也越好。海量的中文资源,巨量的存储空间,强大的运算性能,巧妙的算法设计,要获得高质量的中文词向量,这几样缺一不可。
腾讯AI Lab近期开源了其训练的中文词向量数据(数据下载地址:https://ai.tencent.com/ailab/nlp/embedding.html),包含800多万条中文词汇,一些流行词也收录其中如一颗赛艇、因吹斯听等,发布介绍中其“训练词向量的语料来自腾讯新闻和天天快报的新闻语料,以及自行抓取的互联网网页和小说语料。”,“采用自研的Directional Skip-Gram (DSG)算法作为词向量的训练算法。”,数据大小约为16.7G(更正),词向量维度200,格式为txt文本,下载到本地服务器后上传数据到MySQL,通过查询SQL获取对应文本分词后的词向量。
百度AI开放平台让普通开发者拥有了一步到位的中文自然语言处理能力,通过抓取全网数据并训练模型得到商用级基础NLP资源,试用免费,支持多种NLP任务,如中文词向量、文本分类(目前支持娱乐、体育、科技等26个内容类型)、依存句法分析等等。中文词向量可以通过POST请求方式或者SDK方式获取,数据存放于百度服务器,试用默认调用量限制QPS为5,词向量维度1024,处理小型数据集的NLP任务足够了。
本文分别采用One-hot模型、Embedding模型、腾讯开源中文词向量和百度开放平台中文词向量,使用Keras深度学习API,简单应用于金融类文本分类,验证上述两大公开的中文词向量使用效果。
本文使用一个小型金融类文本数据集,经过前期处理(脏数据清洗是个大坑),包含4类标签,每类标签对应的样本数1000条,在Jupyter Notebook平台上采用keras的LSTM模型进行训练。
读取数据:
df_train = pd.read_excel('./data/taxdata.xls',header = 0)
标签编码,使用Tokenizer:
- #划分标签和数据
- data_list = df_train['data']
- class_list = df_train['class']
-
- # 对标签进行编码
- y_labels = list(y_train.value_counts().index)
- le = LabelEncoder()
- le.fit(y_labels)
- num_labels = len(y_labels)
- y_train = to_categorical(y_train.map(lambda x: le.transform([x])[0]), num_labels)
- y_test = to_categorical(y_test.map(lambda x: le.transform([x])[0]), num_labels)
-
- # 构建单词-id词典
- tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")
- tokenizer.fit_on_texts(data_list)
- vocab = tokenizer.word_index
- x_train_word_ids = tokenizer.texts_to_sequences(X_train)
- x_test_word_ids = tokenizer.texts_to_sequences(X_test)
获取One-hot编码,由于One-hot编码后词向量维度为24271,为减轻运算负担,采用One-hot思想(每个词语出现的概率相互独立)的序列编码(每个词语占一个序列号码,若样本出现该词语置为词语对应的序列号码,否则置0)方式获取词向量,并增加一个维度:
- #序列编码
- x_train_padded_seqs = pad_sequences(x_train_word_ids, maxlen=200)
- x_test_padded_seqs = pad_sequences(x_test_word_ids, maxlen=200)
- x_train_padded_seqs=np.expand_dims(x_train_padded_seqs,axis=2)
- x_test_padded_seqs=np.expand_dims(x_test_padded_seqs,axis=2)
构建RNN模型:
- # 序列 + RNN model
- model = Sequential()
- model.add(LSTM(256, dropout=0.5,recurrent_dropout=0.1))
- model.add(Dense(256, activation='relu'))
- model.add(Dense(4, activation='softmax'))
-
- model.compile(loss='categorical_crossentropy',
- optimizer='adam',
- metrics=['accuracy'])
-
- model.fit(x_train_padded_seqs, y_train,
- batch_size=32,
- epochs=12,
- validation_data=(x_test_padded_seqs, y_test))
训练结果:
Train on 3200 samples, validate on 800 samples Epoch 1/12 3200/3200 [==============================] - 30s 9ms/step - loss: 1.4094 - acc: 0.2594 - val_loss: 1.4007 - val_acc: 0.2437 Epoch 2/12 3200/3200 [==============================] - 30s 9ms/step - loss: 1.3944 - acc: 0.2559 - val_loss: 1.3860 - val_acc: 0.2550 Epoch 3/12 3200/3200 [==============================] - 30s 9ms/step - loss: 1.3905 - acc: 0.2672 - val_loss: 1.3946 - val_acc: 0.2637 Epoch 4/12 3200/3200 [==============================] - 29s 9ms/step - loss: 1.3937 - acc: 0.2625 - val_loss: 1.4082 - val_acc: 0.2662 Epoch 5/12 3200/3200 [==============================] - 30s 9ms/step - loss: 1.3904 - acc: 0.2591 - val_loss: 1.3824 - val_acc: 0.2575 Epoch 6/12 3200/3200 [==============================] - 27s 9ms/step - loss: 1.3907 - acc: 0.2497 - val_loss: 1.3866 - val_acc: 0.2687 Epoch 7/12 3200/3200 [==============================] - 27s 8ms/step - loss: 1.3875 - acc: 0.2753 - val_loss: 1.3793 - val_acc: 0.2687 Epoch 8/12 3200/3200 [==============================] - 27s 8ms/step - loss: 1.3876 - acc: 0.2581 - val_loss: 1.3811 - val_acc: 0.2662 Epoch 9/12 3200/3200 [==============================] - 28s 9ms/step - loss: 1.3872 - acc: 0.2638 - val_loss: 1.3919 - val_acc: 0.2700 Epoch 10/12 3200/3200 [==============================] - 28s 9ms/step - loss: 1.3871 - acc: 0.2697 - val_loss: 1.3900 - val_acc: 0.2475 Epoch 11/12 3200/3200 [==============================] - 28s 9ms/step - loss: 1.3946 - acc: 0.2603 - val_loss: 1.3881 - val_acc: 0.2725 Epoch 12/12 3200/3200 [==============================] - 29s 9ms/step - loss: 1.3888 - acc: 0.2544 - val_loss: 1.3786 - val_acc: 0.2650
可见One-hot + RNN 模型训练,训练集和验证集精度非常低。
对每个样本,用主流的jieba分词包分词,同时根据停用词表去除诸如“的”、“是”等无意义词,使用keras的tokenizer,可以训练并获得相对于本地数据集的词嵌入向量,词向量维度300。
- MAX_SEQUENCE_LENGTH = 100 # 每篇文章选取100个词
- EMBEDDING_DIM = 300 # 词向量维度,300维
-
- tokenizer = Tokenizer(num_words=15000)
- tokenizer.fit_on_texts(data_list) # 传入数据集,得到完整数据集部分的词向量
- sequences = tokenizer.texts_to_sequences(data_list)
- word_index_dict = tokenizer.word_index
-
- data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # 限制每篇文章的长度
- labels = to_categorical(np.asarray(class_list)) # label one hot表示
打乱文章顺序,切割数据:
- # 打乱文章顺序
- VALIDATION_SPLIT = 0.2
- indices = np.arange(data.shape[0])
- np.random.shuffle(indices)
- data = data[indices]
- labels = labels[indices]
- num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
-
- # 切割数据
- x_train = data[:-num_validation_samples]
- y_train = labels[:-num_validation_samples]
- x_val = data[-num_validation_samples:]
- y_val = labels[-num_validation_samples:]
定义embedding层,当trainable为true时,由keras训练本地数据集获得词嵌入向量,并将这部分词嵌入向量作用于训练集和验证集的内容数据中:
- # 设置 trainable = True,代表词向量作为参数进行更新
- num_words = len(word_index_dict)
- embedding_layer = Embedding(num_words,
- EMBEDDING_DIM,
- input_length=MAX_SEQUENCE_LENGTH,
- trainable=True)
加载RNN,这里选用LSTM层:
- from keras import backend as K
- from keras.layers import Embedding,Input,Dense,Conv1D,MaxPooling1D,Dense,Flatten,Lambda,LSTM
- K.set_image_dim_ordering('tf')
-
- sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
- embedded_sequences = embedding_layer(sequence_input)
- x = LSTM(256, dropout=0.5)(embedded_sequences)
- x = Dense(256, activation='relu')(x)
- preds = Dense(num_labels, activation='softmax')(x)
-
- model = Model(sequence_input, preds)
- model.summary()
- model.compile(loss='categorical_crossentropy',
- optimizer='rmsprop',
- metrics=['acc'])
-
- model.fit(x_train, y_train, validation_data=(x_val, y_val),
- epochs=12, batch_size=128)
获得模型训练结果:
Train on 3200 samples, validate on 800 samples Epoch 1/12 3200/3200 [==============================] - 15s 5ms/step - loss: 1.3157 - acc: 0.4903 - val_loss: 0.9386 - val_acc: 0.6562 Epoch 2/12 3200/3200 [==============================] - 15s 5ms/step - loss: 0.7593 - acc: 0.7184 - val_loss: 0.7342 - val_acc: 0.7163 Epoch 3/12 3200/3200 [==============================] - 16s 5ms/step - loss: 0.6163 - acc: 0.7737 - val_loss: 0.6189 - val_acc: 0.7688 Epoch 4/12 3200/3200 [==============================] - 14s 5ms/step - loss: 0.4820 - acc: 0.8147 - val_loss: 0.6228 - val_acc: 0.7512 Epoch 5/12 3200/3200 [==============================] - 15s 5ms/step - loss: 0.4272 - acc: 0.8491 - val_loss: 0.6892 - val_acc: 0.7550 Epoch 6/12 3200/3200 [==============================] - 14s 4ms/step - loss: 0.3695 - acc: 0.8606 - val_loss: 0.6637 - val_acc: 0.7475 Epoch 7/12 3200/3200 [==============================] - 14s 5ms/step - loss: 0.3239 - acc: 0.8791 - val_loss: 0.7030 - val_acc: 0.7500 Epoch 8/12 3200/3200 [==============================] - 13s 4ms/step - loss: 0.2788 - acc: 0.9056 - val_loss: 0.7326 - val_acc: 0.7538 Epoch 9/12 3200/3200 [==============================] - 14s 4ms/step - loss: 0.2457 - acc: 0.9138 - val_loss: 0.7843 - val_acc: 0.7362 Epoch 10/12 3200/3200 [==============================] - 13s 4ms/step - loss: 0.2107 - acc: 0.9241 - val_loss: 0.8854 - val_acc: 0.7350 Epoch 11/12 3200/3200 [==============================] - 13s 4ms/step - loss: 0.1975 - acc: 0.9331 - val_loss: 0.8781 - val_acc: 0.7350 Epoch 12/12 3200/3200 [==============================] - 13s 4ms/step - loss: 0.1770 - acc: 0.9384 - val_loss: 0.8481 - val_acc: 0.7538
Embedding + RNN 验证集精度明显提高,最高为0.7688,而训练集精度达到90%以上,明显有过拟合现象。经过调参,验证集可能获得更高精确度。
将腾讯AI Lab开源的中文词向量全量上传至MySQL,通过查询MySQL获取样本分词后的中文词向量,词向量维度200。
- #连接MySQL
- db = pymysql.connect(host="***",user="***", password="***",db="***", port=***)
- cursor = db.cursor()
-
- #查询词向量
- def querywordvec(word):
- vec_int = []
- query_sql= "select * from XXX where word='%s'" % word
- try:
- cursor.execute(query_sql)
- results = cursor.fetchall()
- vec = results[0][2]
- for i in range(len(vec)):
- vec_int.append(float(vec[i]))
- except:
- print("Error: unable to fetch data")
- return vec_int
举个例子,查询“开发”,结果:
print(querywordvec("开发"))
[0.205318, 0.02924, 0.025059, -0.031507, -0.035252, 0.147428, 0.064118, 0.402488, 0.424321, 0.437024, 0.012467, -0.098729, -0.158572, -0.088177, -0.043449, 0.089409, -0.099055, -0.283804, 0.112545, 0.025541, -0.01726, -0.150909, -0.083299, 0.037459, 0.29605, 0.01388, -0.287553, 0.117286, 0.13666, 0.493275, 0.302443, 0.082535, -0.009056, 0.24045, -0.007371, 0.119541, 0.432921, 0.025741, -0.29922, 0.21198, 0.021523, 0.220857, 0.44779, 0.291499, -0.184952, -0.006434, -0.115189, -0.266904, 0.003495, -0.159119, -0.384113, -0.387713, 0.170096, 0.198, 0.07035, 0.177311, -0.019644, -0.188508, 0.031889, -0.392723, 0.227364, 0.616728, -0.059071, -0.364697, -0.077505, -0.260351, -0.268732, 0.238778, 0.427052, 0.321993, -0.037369, -0.159352, 0.400518, 0.229699, -0.3446, 0.046306, -0.066257, 0.377816, -0.055773, -0.325963, -0.102563, 0.205084, 0.118749, 0.403796, 0.085079, -0.134903, -0.035444, 0.126386, -0.142862, -0.293126, 0.142639, 0.108202, -0.022327, 0.011597, 0.426736, -0.090832, 0.168828, -0.330628, -0.333454, 0.214868, -0.452769, -0.024319, 0.072544, 0.127925, -0.389489, -0.088286, -0.190296, -0.085807, 0.077976, -0.020705, -0.219265, -0.360965, 0.207212, -0.285199, -0.211401, -0.120366, 0.086652, 0.090502, -0.144671, -0.392925, 0.285612, -0.427401, -0.148718, -0.061124, 0.139129, 0.199844, -0.39937, -0.123523, -0.21283, -0.129926, 0.249201, -0.196021, 0.166995, -0.452386, -0.371797, -0.234237, -0.063534, 0.056776, -0.159089, 0.260067, -0.412732, -0.195597, -0.431985, -0.471183, -0.170963, 0.180416, 0.197121, 0.296519, 0.081022, -0.383157, -0.227555, -0.285242, 0.040487, -0.224609, -0.121581, 0.186237, 0.010203, 0.502054, -0.2188, -0.088945, 0.219765, -0.045787, 0.119763, 0.30921, 0.231384, -0.163442, -0.1442, 0.06971, -0.325053, -0.247143, 0.112627, 0.034369, -0.096266, 0.194837, -0.301401, -0.099836, -0.075422, 0.367559, -0.319538, 0.470193, -0.165735, -0.350219, 0.295977, 0.009617, 0.201713, 0.33146, 0.03736, 0.224218, -0.09293, 0.10523, -0.018303, 0.191042, 0.260462, 0.025095, 0.122858, 0.635381, 0.26528, 0.309128, -0.30828, -0.015132]
获取词向量矩阵,统计未查询到的词共有k=1002, 如“租代建”,“按平销”,“65%”,“税盘”等,全体数据集可获得13281-1002 = 12279个中文词向量。
- k = 1
- num_words = len(word_index_dict)
- embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
- for word, i in word_index_dict.items():
- vec = querywordvec(word)
- if len(vec) != 0:
- embedding_matrix[i-1] = vec
- else:
- k+=1
- db.close()
设置嵌入层:
- # trainable = False,代表词向量不作为参数进行更新
- embedding_layer = Embedding(num_words,
- EMBEDDING_DIM,
- weights=[embedding_matrix],
- input_length=MAX_SEQUENCE_LENGTH,
- trainable=False)
同样的,调用第二节Embedding + RNN中LSTM模型训练,修改epochs=20,结果:
Train on 3200 samples, validate on 800 samples Epoch 1/20 3200/3200 [==============================] - 23s 7ms/step - loss: 1.3878 - acc: 0.3259 - val_loss: 1.3240 - val_acc: 0.3287 Epoch 2/20 3200/3200 [==============================] - 21s 7ms/step - loss: 1.2471 - acc: 0.4431 - val_loss: 1.2147 - val_acc: 0.4225 Epoch 3/20 3200/3200 [==============================] - 21s 6ms/step - loss: 1.1815 - acc: 0.4797 - val_loss: 1.1546 - val_acc: 0.4888 Epoch 4/20 3200/3200 [==============================] - 21s 7ms/step - loss: 1.1166 - acc: 0.5075 - val_loss: 1.0379 - val_acc: 0.5587 Epoch 5/20 3200/3200 [==============================] - 22s 7ms/step - loss: 1.0926 - acc: 0.5291 - val_loss: 0.9903 - val_acc: 0.5962 Epoch 6/20 3200/3200 [==============================] - 21s 6ms/step - loss: 1.0647 - acc: 0.5384 - val_loss: 0.9586 - val_acc: 0.5850 Epoch 7/20 3200/3200 [==============================] - 12s 4ms/step - loss: 1.0155 - acc: 0.5725 - val_loss: 1.1208 - val_acc: 0.4913 Epoch 8/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.9712 - acc: 0.5816 - val_loss: 0.8816 - val_acc: 0.6275 Epoch 9/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.9227 - acc: 0.6134 - val_loss: 0.8344 - val_acc: 0.6338 Epoch 10/20 3200/3200 [==============================] - 11s 4ms/step - loss: 0.9085 - acc: 0.6163 - val_loss: 0.8979 - val_acc: 0.6312 Epoch 11/20 3200/3200 [==============================] - 12s 4ms/step - loss: 0.8666 - acc: 0.6331 - val_loss: 0.9032 - val_acc: 0.6138 Epoch 12/20 3200/3200 [==============================] - 12s 4ms/step - loss: 0.8023 - acc: 0.6597 - val_loss: 0.7601 - val_acc: 0.6800 Epoch 13/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.7834 - acc: 0.6778 - val_loss: 0.8206 - val_acc: 0.6488 Epoch 14/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.7422 - acc: 0.6966 - val_loss: 0.8015 - val_acc: 0.6637 Epoch 15/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.7179 - acc: 0.7041 - val_loss: 0.7536 - val_acc: 0.6637 Epoch 16/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6965 - acc: 0.7084 - val_loss: 0.7869 - val_acc: 0.6937 Epoch 17/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6855 - acc: 0.7216 - val_loss: 0.7411 - val_acc: 0.6800 Epoch 18/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6342 - acc: 0.7341 - val_loss: 0.7292 - val_acc: 0.7125 Epoch 19/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.6188 - acc: 0.7550 - val_loss: 0.7849 - val_acc: 0.6987 Epoch 20/20 3200/3200 [==============================] - 11s 3ms/step - loss: 0.5950 - acc: 0.7641 - val_loss: 0.7121 - val_acc: 0.7188
腾讯中文词向量的效果,训练集与验证集精度大致相当,验证集精度最高0.7188。
使用百度开放平台获取基础NLP资源(传送门:https://cloud.baidu.com/doc/NLP/index.html),需要注册百度云账号,并按程序获得APP_ID,API_KEY,SECRET_KEY配置信息。
- #使用python sdk获取中文词向量
- APP_ID = '***'
- API_KEY = '***'
- SECRET_KEY = '***'
- client = AipNlp(APP_ID, API_KEY, SECRET_KEY)
获取中文词向量,词向量维度1024:
- k = 1
- num_words = len(word_index_dict)
- embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
- for word, i in word_index_dict.items():
- #去除数字
- pattern = re.compile('[0-9]+')
- match = pattern.findall(word)
- if len(word) <= 2 and not match:
- wordinfo = client.wordEmbedding(word)
- time.sleep(0.6)
- try :
- wordinfo['vec']
- except Exception as e:
- k+=1
- else:
- embedding_vector = wordinfo['vec']
- if embedding_vector is not None:
- embedding_matrix[i-1] = embedding_vector
百度中文词向量只能查询两个字的词语,共有2566个两字词未获取词向量,两字以上的词语有4658个,数字类的词590个,全体数据集可获得13281-2566-590-4658 = 5467个中文词向量。
训练模型,调用第二节Embedding + RNN中LSTM模型训练,修改epochs=20,结果:
Train on 3200 samples, validate on 800 samples Epoch 1/20 3200/3200 [==============================] - 18s 5ms/step - loss: 1.1565 - acc: 0.5000 - val_loss: 0.8308 - val_acc: 0.6700 Epoch 2/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.8307 - acc: 0.6653 - val_loss: 0.7311 - val_acc: 0.7163 Epoch 3/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.7342 - acc: 0.7109 - val_loss: 0.8210 - val_acc: 0.6613 Epoch 4/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.6248 - acc: 0.7591 - val_loss: 0.7152 - val_acc: 0.7175 Epoch 5/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.5598 - acc: 0.7812 - val_loss: 0.6750 - val_acc: 0.7350 Epoch 6/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.4954 - acc: 0.8069 - val_loss: 0.6367 - val_acc: 0.7625 Epoch 7/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.4472 - acc: 0.8341 - val_loss: 0.6642 - val_acc: 0.7625 Epoch 8/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.4024 - acc: 0.8453 - val_loss: 0.6992 - val_acc: 0.7488 Epoch 9/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.3542 - acc: 0.8656 - val_loss: 0.7054 - val_acc: 0.7650 Epoch 10/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.3318 - acc: 0.8694 - val_loss: 0.8433 - val_acc: 0.7300 Epoch 11/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.3042 - acc: 0.8809 - val_loss: 0.8931 - val_acc: 0.7113 Epoch 12/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.2804 - acc: 0.8919 - val_loss: 0.7937 - val_acc: 0.7525 Epoch 13/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.2454 - acc: 0.9106 - val_loss: 0.8601 - val_acc: 0.7475 Epoch 14/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.2255 - acc: 0.9131 - val_loss: 0.8541 - val_acc: 0.7450 Epoch 15/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1984 - acc: 0.9266 - val_loss: 0.9004 - val_acc: 0.7550 Epoch 16/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1934 - acc: 0.9300 - val_loss: 0.8683 - val_acc: 0.7612 Epoch 17/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1783 - acc: 0.9291 - val_loss: 0.9406 - val_acc: 0.7525 Epoch 18/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.1735 - acc: 0.9353 - val_loss: 0.9408 - val_acc: 0.7600 Epoch 19/20 3200/3200 [==============================] - 15s 5ms/step - loss: 0.1453 - acc: 0.9519 - val_loss: 1.0023 - val_acc: 0.7612 Epoch 20/20 3200/3200 [==============================] - 16s 5ms/step - loss: 0.1463 - acc: 0.9425 - val_loss: 1.0696 - val_acc: 0.7575
训练集精度很高0.9519,验证集最高0.7650,也出现过拟合。
除了用深度学习方法外,我们也可以试着用经典的贝叶斯分类器、随机森林、支持向量机等模型,试试精度如何。
- #中文文本处理
- def TextProcessing(data_list, class_list):
- #切分训练集和验证集
- train_data_list, test_data_list, train_class_list, test_class_list = train_test_split(data_list, class_list, test_size=0.25)
-
- # 统计训练集词频
- all_words_dict = {}
- for word_list in train_data_list:
- for word in word_list:
- if word in all_words_dict.keys():
- all_words_dict[word] += 1
- else:
- all_words_dict[word] = 1
-
- # 根据键的值倒序排序
- all_words_tuple_list = sorted(all_words_dict.items(), key=lambda f: f[1], reverse=True)
- all_words_list, all_words_nums = zip(*all_words_tuple_list) # 解压缩
- all_words_list = list(all_words_list) # 转换成列表
-
- return all_words_list, train_data_list, test_data_list, train_class_list, test_class_list
-
- #预先需要jieba切词,取前N个词
- def Words_dict(all_words_list, deleteN, stopwords_set=set()):
- feature_words = [] # 特征列表
- n = 1
- for t in range(deleteN, len(all_words_list), 1):
- if n > 1000: # feature_words的维度为1000
- break
- # 如果这个词不是数字,并且不是指定的结束语,并且单词长度大于1小于5,那么这个词就可以作为特征词
- if not all_words_list[t].isdigit() and all_words_list[t] not in stopwords_set and 1 < len(all_words_list[t]) < 5:
- feature_words.append(all_words_list[t])
- n += 1
- return feature_words
-
- #获取One-hot向量
- def TextFeatures(train_data_list, test_data_list, feature_words):
- def text_features(text, feature_words): # 出现在特征集中,则置1
- text_words = set(text)
- features = [1 if word in text_words else 0 for word in feature_words]
- return features
-
- train_feature_list = [text_features(text, feature_words) for text in train_data_list]
- test_feature_list = [text_features(text, feature_words) for text in test_data_list]
- return train_feature_list, test_feature_list # 返回结果
设计分类器:
- #多分类贝叶斯分类器
- def TextClassifier_MultinomialNB(train_feature_list, test_feature_list, train_class_list, test_class_list):
- classifier = MultinomialNB().fit(train_feature_list, train_class_list)
- test_accuracy = classifier.score(test_feature_list, test_class_list)
- return test_accuracy
-
- #支持向量机分类器
- def TextClassifier_SVC(train_feature_list, test_feature_list, train_class_list, test_class_list):
- clf = SVC(C=150,kernel='rbf', degree=51, gamma='auto',coef0=0.0,shrinking=False,probability=False,tol=0.001,cache_size=300, class_weight=None,verbose=False,max_iter=-1,decision_function_shape=None,random_state=None)
- clf.fit(train_feature_list, train_class_list)
- test_accuracy=clf.score(test_feature_list, test_class_list)
- return test_accuracy
-
- #随机森林分类器
- def TextClassifier_RF(train_feature_list, test_feature_list, train_class_list, test_class_list):
- forest = RandomForestClassifier(n_estimators=500,random_state=5, warm_start=False,
- min_impurity_decrease=0.0,
- min_samples_split=15)
- forest.fit(train_feature_list, train_class_list)
- test_accuracy=forest.score(test_feature_list, test_class_list)
- return test_accuracy
看看各个分类器的效果
- if __name__ == '__main__':
- # 文本预处理
- all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(data_list,class_list)
- feature_words = Words_dict(all_words_list, 0, stopwords_set)
- train_feature_list, test_feature_list = TextFeatures(train_data_list, test_data_list, feature_words)
-
- #比较各个分类器的精确度
- test_accuracy_MultinomialNB = TextClassifier_MultinomialNB(train_feature_list, test_feature_list, train_class_list, test_class_list)
- print("MultinomialNB Accuracy:",test_accuracy_MultinomialNB)
-
- test_accuracy_SVC = TextClassifier_SVC(train_feature_list, test_feature_list, train_class_list, test_class_list)
- print("SVC Accuracy:",test_accuracy_SVC)
-
- test_accuracy_RF = TextClassifier_RF(train_feature_list, test_feature_list, train_class_list, test_class_list)
- print("RF Accuracy:",test_accuracy_RF)
结果:
- MultinomialNB Accuracy: 0.71
- SVC Accuracy: 0.7542857142857143
- RF Accuracy: 0.7557142857142857
可见,对于小数据集,采用经典分类器,也能获得不输于深度学习的效果。
通过以上试验可以看出,在小数据集上,经典分类器有着很好的效果。在采用深度学习和词向量的例子中,有中文词向量的模型要比One-hot 模型精度提高不少,奇怪的是,从百度获得的中文词向量个数要比从腾讯获得的中文词向量个数要少一半多,训练效果却比较好,有待更进一步的试验。
最后,笔者获悉Chinese Word Vectors中文词向量已经推出(传送门:https://github.com/Embedding/Chinese-Word-Vectors),有兴趣者可以根据这个来做一些试验。
祝大家学习愉快~
欢迎关注微信公众号:a_white_deer
欢迎讨论技术问题:xiezj2010@126.com
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。