赞
踩
textCNN最早在2014年由纽约大学的Yoon Kim提出(作者就他自己一个人),论文题目Convolutional Neural Networks for Sentence Classification,在文中作者用精炼的语句介绍了使用卷积神经网络进行文本分类任务的原理和网络结构,并用7个数据集证明了模型的泛化能力。如下图所示是textCNN与其他模型在MR,SST-1,SST-2,Subj,TREC,CR,MPQA七个数据集上的比较,可以看出在四个数据集上textCNN相关模型的表现均最好。
本文在简单阐述论文中模型的基础上,将会详细介绍如何使用tensorflow对textCNN模型进行搭建,使用的数据集为短评论型文本,实现情感分类积极-消极。
上图1来自论文原文。
上图2来自网络大神。
textCNN的本质就是通过构建等长的词向量,将输入的每句话都表示为一个长宽固定的矩阵,并在得到的矩阵上按照列的方向做一维卷积,再将一维卷积的结果进行max_pooling后,将各个卷积核得到的结果进行拼接,之后将结果输入全连接层并进行dropout,最后再连接我们的softmax分类器。
整个过程遵循了卷积神经网络提取图像特征的本质,实际上重点也是对于文本特征的提取。在输入词语词向量进行固定长宽矩阵构建时,可以有三种方法(不止):1 按照单个字进行特征提取与构建。2 按照TF-IDF进行词语的特征提取与构建。3 使用word2vec进行词语的预训练过程完成特征提取与构建。上述三种方法尤其对于汉语的处理过程相异之处比较大,在本文使用第一种提取方法。
本文使用的数据集为网站爬取的短文本评论,其中由于数据量较少,不区分验证集与测试集,并按照4:1的比例将原始有标签数据分为训练集与测试集。
原始数据文件压缩包网盘链接:https://pan.baidu.com/s/13vwd3lfKWfXlD1a8uB6ngg
提取码:urj2
注:解压后的text_data文件夹放置在与程序的同级目录下。
原始数据以txt文本格式保存,标签分为pos与neg,因此预处理主要是将标签与数据分开对应保存,并将数据分为训练集与测试集。
import os # 获取text_data文件夹下的所有文件路径 temp_list = list(os.walk(r"text_data")) original = temp_list[0][0] file_name = temp_list[0][2] path_list = [original + "\\" + eve_name for eve_name in file_name] # 创建所需文件 train_data = open(r"train_data.txt", "w", encoding="utf-8") train_label = open(r"train_label.txt", "w", encoding="utf-8") test_data = open(r"test_data.txt", "w", encoding="utf-8") test_label = open(r"test_label.txt", "w", encoding="utf-8") vocabulary = open(r"vocabulary.txt", "w", encoding="utf-8") # 将原始数据进行标签分离与训练测试集分离 for every_path in path_list: with open(every_path, "r", encoding="utf-8") as temp_file: corpus = [eve for eve in temp_file if len(eve.strip("\n")) != 0] limit1 = len(corpus)*0.9 limit2 = len(corpus)*0.1 for i in range(len(corpus)): if limit2 < i < limit1: if corpus[i][:3] == "pos": train_data.write(corpus[i][3:]) train_label.write("1" + "\n") else: train_data.write(corpus[i][3:]) train_label.write("0" + "\n") else: if corpus[i][:3] == "pos": test_data.write(corpus[i][3:]) test_label.write("1" + "\n") else: test_data.write(corpus[i][3:]) test_label.write("0" + "\n") # 创建字库vocabulary,包含原始数据中所有的字,写入vocabulary.txt待用 with open(r"test_data.txt", "r", encoding="utf-8") as file1: corpus1 = [eve for eve in file1] with open(r"train_data.txt", "r", encoding="utf-8") as file2: corpus2 = [eve for eve in file2] with open(r"vocabulary.txt","w",encoding="utf-8") as file3: word_list = [] corpus = corpus1 + corpus2 for line in corpus: for word in line: word_list.append(word) word_list = list(set(word_list)) for word in word_list: file3.write(word + "\n")
本次按照上述图2网络结构进行搭建,采用大小(,2),(,3),(,4)的三个卷积核进行分别的卷积操作,每种卷积核只设置2个。由于评论大多是短文本,所以文本矩阵的长度都固定在了100。具体程序如下所示。
# 打开预处理后的训练集文件进行读取 with open(r"train_data.txt", "r", encoding="utf-8") as file1: corpus = [eve.strip("\n") for eve in file1] with open(r"vocabulary.txt", "r", encoding="utf-8") as file2: vocabulary = [word.strip("\n") for word in file2] with open(r"train_label.txt", "r", encoding="utf-8") as file3: label_list = [int(eve.strip("\n")) for eve in file3] assert len(label_list) == len(corpus) # 将每一个字库中的字都对应一个id,按照顺序排列即可 word2id = {word:id_ for id_, word in enumerate(vocabulary)} # 定义匿名函数line2id, 可以将一句话转化为id的列表list,如"普通攻击往后拉"->[23,25,45,98,12,98,36] line2id = lambda line: [word2id[word] for word in line] # 将每句话都转化为ID表示 train_list = [line2id(line) for line in corpus] import tensorflow.contrib.keras as kr # 对每句话固定长度100 train_x = kr.preprocessing.sequence.pad_sequences(train_list, 100) # 长度一致train_x id二维矩阵 # 将label值转化为one-hot编码 如[0,1,0,1,0]->[0 0, 0 1, 0 0, 0 1, 0 0] train_y = kr.utils.to_categorical(label_list, num_classes=2) import tensorflow as tf tf.compat.v1.reset_default_graph() # 设置占位变量 X_holder = tf.compat.v1.placeholder(tf.int32, [None, 100]) Y_holder = tf.compat.v1.placeholder(tf.float32, [None, 2]) # 做随机向量的词嵌入工作 注意2775是自由生成的行向量,必须大于等于实际的数据 # 一种初始化变量的方法,随机初始化了矩阵变量 embedding = tf.compat.v1.get_variable('embedding', [2775, 60]) embedding_inputs = tf.nn.embedding_lookup(embedding, X_holder) num_filters = 2 # 卷积核数目 hidden_dim = 120 # 全连接层神经元 learning_rate = 1e-2 # 学习率 # 神经网络结构: 卷积-池化-全连接-拼接-dropout-激活-全连接-输出二分类器 conv1 = tf.layers.conv1d(embedding_inputs, num_filters, 2) # 卷积 (24864, 96, 200) max_pooling1 = tf.reduce_max(conv1, reduction_indices=[1]) # 池化 按列求最大值 (24864, 200) conv2 = tf.layers.conv1d(embedding_inputs, num_filters, 3) # (24864, 197, 50) max_pooling2 = tf.reduce_max(conv2, reduction_indices=[1]) # 池化 (24864, 50) conv3 = tf.layers.conv1d(embedding_inputs, num_filters, 4) max_pooling3 = tf.reduce_max(conv3, reduction_indices=[1]) gmp = tf.concat([max_pooling1, max_pooling2, max_pooling3], 1) full_connect = tf.layers.dense(gmp, hidden_dim) # 全连接层 (24864, 120) full_connect_dropout = tf.contrib.layers.dropout(full_connect, keep_prob=0.75) # dropout full_connect_activate = tf.nn.relu(full_connect_dropout) # 激活 softmax_before = tf.layers.dense(full_connect_activate, 2) # 全连接层,2分类所有最后一个参数为2 predict_Y = tf.nn.softmax(softmax_before) # 输出 # 计算softmax交叉熵损失值 cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=Y_holder, logits=softmax_before) loss = tf.reduce_mean(cross_entropy) # 熵的平均值 optimizer = tf.train.AdamOptimizer(learning_rate) # 定义优化器 train = optimizer.minimize(loss) # 将优化器与损失值连接起来 isCorrect = tf.equal(tf.argmax(Y_holder, 1), tf.argmax(predict_Y, 1)) # 判断是否正确 accuracy = tf.reduce_mean(tf.cast(isCorrect, tf.float32)) # 获取平均的准确率 init = tf.global_variables_initializer() # 初始化所有参数 session = tf.Session() session.run(init) # 打开测试集文件进行读取 with open(r"test_data.txt", "r", encoding="utf-8") as file4: corpus_ = [eve.strip("\n") for eve in file4] with open(r"test_label.txt", "r", encoding="utf-8") as file5: label_list_ = [int(eve.strip("\n")) for eve in file5] assert len(label_list_) == len(corpus_) test_list = [line2id(line) for line in corpus_] test_x = kr.preprocessing.sequence.pad_sequences(test_list, 100) # 长度一致,均设置为100 test_y = kr.utils.to_categorical(label_list_, num_classes=2) import random for i in range(3000): selected_index = random.sample(list(range(len(train_y))), k=60) # 批训练数 batch_X = train_x[selected_index] batch_Y = train_y[selected_index] session.run(train, {X_holder:batch_X, Y_holder:batch_Y}) step = i + 1 if step % 100 == 0: selected_index = random.sample(list(range(len(test_y))), k=150) batch_X = test_x[selected_index] batch_Y = test_y[selected_index] loss_value, accuracy_value = session.run([loss, accuracy], {X_holder:batch_X, Y_holder:batch_Y}) print('step:%d loss:%.4f accuracy:%.4f' %(step, loss_value, accuracy_value))
step:100 loss:0.3313 accuracy:0.8733 step:200 loss:0.2813 accuracy:0.8733 step:300 loss:0.2850 accuracy:0.9000 step:400 loss:0.1988 accuracy:0.9467 step:500 loss:0.1910 accuracy:0.9333 step:600 loss:0.2706 accuracy:0.8733 step:700 loss:0.2093 accuracy:0.9067 step:800 loss:0.2728 accuracy:0.9200 step:900 loss:0.3189 accuracy:0.9000 step:1000 loss:0.2950 accuracy:0.9000 step:1100 loss:0.2883 accuracy:0.9067 step:1200 loss:0.2701 accuracy:0.8800 step:1300 loss:0.1406 accuracy:0.9533 step:1400 loss:0.2119 accuracy:0.9267 step:1500 loss:0.2927 accuracy:0.9133 step:1600 loss:0.1648 accuracy:0.9333 step:1700 loss:0.1925 accuracy:0.9267 step:1800 loss:0.2637 accuracy:0.9133 step:1900 loss:0.2819 accuracy:0.9133 step:2000 loss:0.4662 accuracy:0.8933 step:2100 loss:0.1797 accuracy:0.9333 step:2200 loss:0.3282 accuracy:0.9000 step:2300 loss:0.1991 accuracy:0.9333 step:2400 loss:0.2945 accuracy:0.8933 step:2500 loss:0.5351 accuracy:0.8733 step:2600 loss:0.3430 accuracy:0.9200 step:2700 loss:0.3642 accuracy:0.8933 step:2800 loss:0.1883 accuracy:0.9200 step:2900 loss:0.4503 accuracy:0.8333 step:3000 loss:0.2943 accuracy:0.9200
整体效果一般,但是对比使用LSTM处理相同数据集时发现,textCNN效果稍微好一点。准确率低主要是由于原始数据的问题,尤其是文本长短不一、长度过短、训练数量少带来的问题。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。