赞
踩
读研期间使用过TensorFlow实现过简单的CNN情感分析(分类),当然这是比较low的二分类情况,后来进行多分类情况。但之前的学习基本上都是在英文词库上训练的。断断续续,想整理一下手头的项目资料,于是就拾起读研期间的文本分类的小项目,花了一点时间,把原来英文文本分类的项目,应用在中文文本分类,效果还不错,在THUCNews中文数据集上,准确率93.9%左右。
目录
基于Python,MATLAB设计,OpenCV,,CNN,机器学习,R-CNN,GCN,LSTM,SVM,BP目标检测、语义分割、Re-ID、医学图像分割、目标跟踪、人脸识别、数据增广、
人脸检测、显著性目标检测、自动驾驶、人群密度估计、3D目标检测、CNN、AutoML、图像分割、SLAM、实例分割、人体姿态估计、视频目标分割,PyTorch、人脸检测、车道线检测、去雾 、全景分割、
行人检测、文本检测、OCR、姿态估计、边缘检测、场景文本检测、视频实例分割、3D点云、模型压缩、人脸对齐、超分辨、去噪、强化学习、行为识别、OpenCV、场景文本识别、去雨、机器学习、风格迁移、
视频目标检测、去模糊、活体检测、人脸关键点检测、3D目标跟踪、视频修复、人脸表情识别、时序动作检测、图像检索、异常检测等毕设指导,毕设选题,毕业设计开题报告,
THUCNews是根据新浪新闻RSS订阅频道2005~2011年间的历史数据筛选过滤生成,包含74万篇新闻文档(2.19 GB),均为UTF-8纯文本格式。我们在原始新浪新闻分类体系的基础上,重新整合划分出14个候选分类类别:财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐。相关介绍,可以看这里[THUCTC: 一个高效的中文文本分类工具](http://thuctc.thunlp.org/)
下载地址: 1.官方数据集下载链接: http://thuctc.thunlp.org/message 2.百度网盘下载链接: https://pan.baidu.com/s/1DT5xY9m2yfu1YGaGxpWiBQ 提取码: bbpe
CNN文本分类的网络结,如下:
简单分析一下:
(1)我们假定输入CNN的数据是二维的,其中每一行表示一个样本(即一个字词),如图中“I”、“like”等。每一个样本(字词)有d个维度,可以看成是词向量长度,即每个字词的维度,程序中用embedding_dim表示。
(2)使用CNN的卷积对这个二维数据进行卷积:在图像的CNN卷积中,卷积核的大小一般是33,55等,但在NLP中就不就不能这么搞了,因为这里的输入数据每行是一个样本了!假设卷积核的大小为[filter_height,filter_width],那么卷积核的高度filter_height可以为1,2,3等任意值,而宽度filter_width只能是embedding_dim的大小,这样才能把完整的样本框进去!
下面是使用TensorFlow实现的CNN文本分类网络:TextCNN,
max_sentence_length = 300 # 最大句子长度,也就是说文本样本中字词的最大长度,不足补零,多余的截断 embedding_dim = 128 #词向量长度,即每个字词的维度 filter_sizes = [3, 4, 5, 6] #卷积核大小 num_filters = 200 # Number of filters per filter size 卷价个数 base_lr=0.001 # 学习率 dropout_keep_prob = 0.5 l2_reg_lambda = 0.0 # "L2 regularization lambda (default: 0.0)
import tensorflow as tf import numpy as np class TextCNN(object): ''' A CNN for text classification Uses and embedding layer, followed by a convolutional, max-pooling and softmax layer. ''' def __init__( self, sequence_length, num_classes, embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0): # Placeholders for input, output, dropout self.input_x = tf.placeholder(tf.float32, [None, sequence_length, embedding_size], name = "input_x") self.input_y = tf.placeholder(tf.float32, [None, num_classes], name = "input_y") self.dropout_keep_prob = tf.placeholder(tf.float32, name = "dropout_keep_prob") # Keeping track of l2 regularization loss (optional) l2_loss = tf.constant(0.0) # Embedding layer # self.embedded_chars = [None(batch_size), sequence_size, embedding_size] # self.embedded_chars = [None(batch_size), sequence_size, embedding_size, 1(num_channels)] self.embedded_chars = self.input_x self.embedded_chars_expended = tf.expand_dims(self.embedded_chars, -1) # Create a convolution + maxpool layer for each filter size pooled_outputs = [] for i, filter_size in enumerate(filter_sizes):# "filter_sizes", "3,4,5", with tf.name_scope("conv-maxpool-%s" % filter_size): # Convolution layer filter_shape = [filter_size, embedding_size, 1, num_filters] # num_filters= 200 # filter_shape =[height, width, in_channels, output_channels] W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") conv = tf.nn.conv2d(self.embedded_chars_expended, W, strides=[1,1,1,1], padding="VALID", name="conv") # Apply nonlinearity h = tf.nn.relu(tf.nn.bias_add(conv, b), name = "relu") # Maxpooling over the outputs pooled = tf.nn.max_pool( h, ksize=[1, sequence_length - filter_size + 1, 1, 1], strides=[1,1,1,1], padding="VALID", name="pool") pooled_outputs.append(pooled) # Combine all the pooled features num_filters_total = num_filters * len(filter_sizes) self.h_pool = tf.concat(pooled_outputs, 3) # self.h_pool = tf.concat(3, pooled_outputs) self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) # Add dropout with tf.name_scope("dropout"): self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) # Final (unnomalized) scores and predictions with tf.name_scope("output"): W = tf.get_variable( "W", shape = [num_filters_total, num_classes], initializer = tf.contrib.layers.xavier_initializer()) b = tf.Variable(tf.constant(0.1, shape=[num_classes], name = "b")) l2_loss += tf.nn.l2_loss(W) l2_loss += tf.nn.l2_loss(b) self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name = "scores") self.predictions = tf.argmax(self.scores, 1, name = "predictions") # Calculate Mean cross-entropy loss with tf.name_scope("loss"): losses = tf.nn.softmax_cross_entropy_with_logits(logits = self.scores, labels = self.input_y) self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss # Accuracy with tf.name_scope("accuracy"): correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1)) self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name = "accuracy")
本博客使用jieba工具进行中文分词,使用词进行训练会比使用字进行训练,效果更好。
这部分:已经在《[使用gensim训练中文语料word2vec](https://blog.csdn.net/guyuealian/article/details/84072158)》[使用gensim训练中文语料word2vec_pan_jinquan的博客-CSDN博客_gensim 中文](https://blog.csdn.net/guyuealian/article/details/84072158),详解讲解,自己看吧!
这个需要自己安装:`pip install jieba` 或者`pip3 install jieba`
这里使用jieba工具对THUCNews数据集进行分词,并利用gensim训练基于THUCNews的word2vec模型,这里提供已经训练好的word2vec模型:*链接: https://pan.baidu.com/s/1n4ZgiF0gbY0zsK0706wZiw 提取码: **mtrj *
有了word2vec模型,我就可以用word2vec词向量处理THUCNews数据:先使用jieba工具将中文句子转为字词,再将字词根据word2vec模型转为embadding 的索引,有了索引就可以获得词向量embadding 。这里并把这些索引数据保存为npy文件。后续训练时,CNN网络只需要读取这些npy文件,并将索引转为embadding,就可以进行训练了。
处理好的THUCNews数据下载地址:链接: *https://pan.baidu.com/s/12Hdf36QafQ3y6KgV_vLTsw *提取码: m9dx
下面的代码实现的功能:使用jieba工具将中文句子转为字词,再将字词根据word2vec模型转为embadding 的索引矩阵,然后把这些索引矩阵保存下来(*.npy文件),源代码中batchSize=20000表示:将20000中文TXT文件处理成字词,转为索引矩阵并保存为一个*.npy文件,相当于将20000中文TXT文件保存为一个*.npy文件,主要是为了压缩数据,避免单个文件过大的情况。
# -*-coding: utf-8 -*- """ @Project: nlp-learning-tutorials @File : create_word2vec.py @Author : panjq @E-mail : pan_jinquan@163.com @Date : 2018-11-08 17:37:21 """ from gensim.models import Word2Vec import random import numpy as np import os import math from utils import files_processing,segment def info_npy(file_list): sizes=0 for file in file_list: data = np.load(file) print("data.shape:{}".format(data.shape)) size = data.shape[0] sizes+=size print("files nums:{}, data nums:{}".format(len(file_list), sizes)) return sizes def save_multi_file(files_list,labels_list,word2vec_path,out_dir,prefix,batchSize,max_sentence_length,labels_set=None,shuffle=False): ''' 将文件内容映射为索引矩阵,并且将数据保存为多个文件 :param files_list: :param labels_list: :param word2vec_path: word2vec模型的位置 :param out_dir: 文件保存的目录 :param prefix: 保存文件的前缀名 :param batchSize: 将多个文件内容保存为一个文件 :param labels_set: labels集合 :return: ''' if not os.path.exists(out_dir): os.mkdir(out_dir) # 把该目录下的所有文件都删除 files_processing.delete_dir_file(out_dir) if shuffle: random.seed(100) random.shuffle(files_list) random.seed(100) random.shuffle(labels_list) sample_num = len(files_list) w2vModel=load_wordVectors(word2vec_path) if labels_set is None: labels_set= files_processing.get_labels_set(label_list) labels_list, labels_set = files_processing.labels_encoding(labels_list, labels_set) labels_list=labels_list.tolist() batchNum = int(math.ceil(1.0 * sample_num / batchSize)) for i in range(batchNum): start = i * batchSize end = min((i + 1) * batchSize, sample_num) batch_files = files_list[start:end] batch_labels = labels_list[start:end] # 读取文件内容,字词分割 batch_content = files_processing.read_files_list_to_segment(batch_files, max_sentence_length, padding_token='<PAD>', segment_type='word') # 将字词转为索引矩阵 batch_indexMat = word2indexMat(w2vModel, batch_content, max_sentence_length) batch_labels=np.asarray(batch_labels) batch_labels = batch_labels.reshape([len(batch_labels), 1]) # 保存*.npy文件 filename = os.path.join(out_dir,prefix + '{0}.npy'.format(i)) labels_indexMat = cat_labels_indexMat(batch_labels, batch_indexMat) np.save(filename, labels_indexMat) print('step:{}/{}, save:{}, data.shape{}'.format(i,batchNum,filename,labels_indexMat.shape)) def cat_labels_indexMat(labels,indexMat): indexMat_labels = np.concatenate([labels,indexMat], axis=1) return indexMat_labels def split_labels_indexMat(indexMat_labels,label_index=0): labels = indexMat_labels[:, 0:label_index+1] # 第一列是labels indexMat = indexMat_labels[:, label_index+1:] # 其余是indexMat return labels, indexMat def load_wordVectors(word2vec_path): w2vModel = Word2Vec.load(word2vec_path) return w2vModel def word2vector_lookup(w2vModel, sentences): ''' 将字词转换为词向量 :param w2vModel: word2vector模型 :param sentences: type->list[list[str]] :return: sentences对应的词向量,type->list[list[ndarray[list]] ''' all_vectors = [] embeddingDim = w2vModel.vector_size embeddingUnknown = [0 for i in range(embeddingDim)] for sentence in sentences: this_vector = [] for word in sentence: if word in w2vModel.wv.vocab: v=w2vModel[word] this_vector.append(v) else: this_vector.append(embeddingUnknown) all_vectors.append(this_vector) all_vectors=np.array(all_vectors) return all_vectors def word2indexMat(w2vModel, sentences, max_sentence_length): ''' 将字词word转为索引矩阵 :param w2vModel: :param sentences: :param max_sentence_length: :return: ''' nums_sample=len(sentences) indexMat = np.zeros((nums_sample, max_sentence_length), dtype='int32') rows = 0 for sentence in sentences: indexCounter = 0 for word in sentence: try: index = w2vModel.wv.vocab[word].index # 获得单词word的下标 indexMat[rows][indexCounter] = index except : indexMat[rows][indexCounter] = 0 # Vector for unkown words indexCounter = indexCounter + 1 if indexCounter >= max_sentence_length: break rows+=1 return indexMat def indexMat2word(w2vModel, indexMat, max_sentence_length=None): ''' 将索引矩阵转为字词word :param w2vModel: :param indexMat: :param max_sentence_length: :return: ''' if max_sentence_length is None: row,col =indexMat.shape max_sentence_length=col sentences=[] for Mat in indexMat: indexCounter = 0 sentence=[] for index in Mat: try: word = w2vModel.wv.index2word[index] # 获得单词word的下标 sentence+=[word] except : sentence+=['<PAD>'] indexCounter = indexCounter + 1 if indexCounter >= max_sentence_length: break sentences.append(sentence) return sentences def save_indexMat(indexMat,path): np.save(path, indexMat) def load_indexMat(path): indexMat = np.load(path) return indexMat def indexMat2vector_lookup(w2vModel,indexMat): ''' 将索引矩阵转为词向量 :param w2vModel: :param indexMat: :return: 词向量 ''' all_vectors = w2vModel.wv.vectors[indexMat] return all_vectors def pos_neg_test(): positive_data_file = "./data/ham_5000.utf8" negative_data_file = './data/spam_5000.utf8' word2vec_path = 'out/trained_word2vec.model' sentences, labels = files_processing.load_pos_neg_files(positive_data_file, negative_data_file) # embedding_test(positive_data_file,negative_data_file) sentences, max_document_length = segment.padding_sentences(sentences, '<PADDING>', padding_sentence_length=190) # train_wordVectors(sentences,embedding_size=128,word2vec_path=word2vec_path) # 训练word2vec,并保存word2vec_path w2vModel=load_wordVectors(word2vec_path) #加载训练好的word2vec模型 ''' 转换词向量提供有两种方法: [1]直接转换:根据字词直接映射到词向量:word2vector_lookup [2]间接转换:先将字词转为索引矩阵,再由索引矩阵映射到词向量:word2indexMat->indexMat2vector_lookup ''' # [1]根据字词直接映射到词向量 x1=word2vector_lookup(w2vModel, sentences) # [2]先将字词转为索引矩阵,再由索引矩阵映射到词向量 indexMat_path = 'out/indexMat.npy' indexMat=word2indexMat(w2vModel, sentences, max_sentence_length=190) # 将字词转为索引矩阵 save_indexMat(indexMat, indexMat_path) x2=indexMat2vector_lookup(w2vModel, indexMat) # 索引矩阵映射到词向量 print("x.shape = {}".format(x2.shape))# shape=(10000, 190, 128)->(样本个数10000,每个样本的字词个数190,每个字词的向量长度128) if __name__=='__main__': # THUCNews_path='/home/ubuntu/project/tfTest/THUCNews/test' # THUCNews_path='/home/ubuntu/project/tfTest/THUCNews/spam' THUCNews_path='/home/ubuntu/project/tfTest/THUCNews/THUCNews' # 读取所有文件列表 files_list, label_list = files_processing.gen_files_labels(THUCNews_path) max_sentence_length=300 word2vec_path="../../word2vec/models/THUCNews_word2Vec/THUCNews_word2Vec_128.model" # 获得标签集合,并保存在本地 # labels_set=['星座','财经','教育'] # labels_set = files_processing.get_labels_set(label_list) labels_file='../data/THUCNews_labels.txt' # files_processing.write_txt(labels_file, labels_set) # 将数据划分为train val数据集 train_files, train_label, val_files, val_label= files_processing.split_train_val_list(files_list, label_list, facror=0.9, shuffle=True) # contents, labels=files_processing.read_files_labels(files_list,label_list) # word2vec_path = 'out/trained_word2vec.model' train_out_dir='../data/train_data' prefix='train_data' batchSize=20000 labels_set=files_processing.read_txt(labels_file) # labels_set2 = files_processing.read_txt(labels_file) save_multi_file(files_list=train_files, labels_list=train_label, word2vec_path=word2vec_path, out_dir=train_out_dir, prefix=prefix, batchSize=batchSize, max_sentence_length=max_sentence_length, labels_set=labels_set, shuffle=True) print("*******************************************************") val_out_dir='../data/val_data' prefix='val_data' save_multi_file(files_list=val_files, labels_list=val_label, word2vec_path=word2vec_path, out_dir=val_out_dir, prefix=prefix, batchSize=batchSize, max_sentence_length=max_sentence_length, labels_set=labels_set, shuffle=True)
训练代码如下,注意,Github上不能上传大文件,所以你需要把上面提供的文件都下载下来,并放在对应的文件目录,就可以训练了。
训练中需要读取训练数据,即*.npy文件,*.npy文件保存的是索引数据,因此需要转为CNN的embadding数据:这个过程由函数:***indexMat2vector_lookup***完成:***train_batch_data = create_word2vec.indexMat2vector_lookup(w2vModel, train_batch_data)***
#! /usr/bin/env python # encoding: utf-8 import tensorflow as tf import numpy as np import os from text_cnn import TextCNN from utils import create_batch_data, create_word2vec, files_processing def train(train_dir,val_dir,labels_file,word2vec_path,batch_size,max_steps,log_step,val_step,snapshot,out_dir): ''' 训练... :param train_dir: 训练数据目录 :param val_dir: val数据目录 :param labels_file: labels文件目录 :param word2vec_path: 词向量模型文件 :param batch_size: batch size :param max_steps: 最大迭代次数 :param log_step: log显示间隔 :param val_step: 测试间隔 :param snapshot: 保存模型间隔 :param out_dir: 模型ckpt和summaries输出的目录 :return: ''' max_sentence_length = 300 embedding_dim = 128 filter_sizes = [3, 4, 5, 6] num_filters = 200 # Number of filters per filter size base_lr=0.001# 学习率 dropout_keep_prob = 0.5 l2_reg_lambda = 0.0 # "L2 regularization lambda (default: 0.0) allow_soft_placement = True # 如果你指定的设备不存在,允许TF自动分配设备 log_device_placement = False # 是否打印设备分配日志 print("Loading data...") w2vModel = create_word2vec.load_wordVectors(word2vec_path) labels_set = files_processing.read_txt(labels_file) labels_nums = len(labels_set) train_file_list = create_batch_data.get_file_list(file_dir=train_dir, postfix='*.npy') train_batch = create_batch_data.get_data_batch(train_file_list, labels_nums=labels_nums, batch_size=batch_size, shuffle=False, one_hot=True) val_file_list = create_batch_data.get_file_list(file_dir=val_dir, postfix='*.npy') val_batch = create_batch_data.get_data_batch(val_file_list, labels_nums=labels_nums, batch_size=batch_size, shuffle=False, one_hot=True) print("train data info *****************************") train_nums=create_word2vec.info_npy(train_file_list) print("val data info *****************************") val_nums = create_word2vec.info_npy(val_file_list) print("labels_set info *****************************") files_processing.info_labels_set(labels_set) # Training with tf.Graph().as_default(): session_conf = tf.ConfigProto(allow_soft_placement = allow_soft_placement,log_device_placement = log_device_placement) sess = tf.Session(config = session_conf) with sess.as_default(): cnn = TextCNN(sequence_length = max_sentence_length, num_classes = labels_nums, embedding_size = embedding_dim, filter_sizes = filter_sizes, num_filters = num_filters, l2_reg_lambda = l2_reg_lambda) # Define Training procedure global_step = tf.Variable(0, name="global_step", trainable=False) optimizer = tf.train.AdamOptimizer(learning_rate=base_lr) # optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9) grads_and_vars = optimizer.compute_gradients(cnn.loss) train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) # Keep track of gradient values and sparsity (optional) grad_summaries = [] for g, v in grads_and_vars: if g is not None: grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g) sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) grad_summaries.append(grad_hist_summary) grad_summaries.append(sparsity_summary) grad_summaries_merged = tf.summary.merge(grad_summaries) # Output directory for models and summaries print("Writing to {}\n".format(out_dir)) # Summaries for loss and accuracy loss_summary = tf.summary.scalar("loss", cnn.loss) acc_summary = tf.summary.scalar("accuracy", cnn.accuracy) # Train Summaries train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged]) train_summary_dir = os.path.join(out_dir, "summaries", "train") train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph) # Dev summaries dev_summary_op = tf.summary.merge([loss_summary, acc_summary]) dev_summary_dir = os.path.join(out_dir, "summaries", "dev") dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph) # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) checkpoint_prefix = os.path.join(checkpoint_dir, "model") if not os.path.exists(checkpoint_dir): os.makedirs(checkpoint_dir) saver = tf.train.Saver(tf.global_variables(), max_to_keep=5) # Initialize all variables sess.run(tf.global_variables_initializer()) def train_step(x_batch, y_batch): """ A single training step """ feed_dict = { cnn.input_x: x_batch, cnn.input_y: y_batch, cnn.dropout_keep_prob: dropout_keep_prob } _, step, summaries, loss, accuracy = sess.run( [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy], feed_dict) if step % log_step==0: print("training: step {}, loss {:g}, acc {:g}".format(step, loss, accuracy)) train_summary_writer.add_summary(summaries, step) def dev_step(x_batch, y_batch, writer=None): """ Evaluates model on a dev set """ feed_dict = { cnn.input_x: x_batch, cnn.input_y: y_batch, cnn.dropout_keep_prob: 1.0 } step, summaries, loss, accuracy = sess.run( [global_step, dev_summary_op, cnn.loss, cnn.accuracy], feed_dict) if writer: writer.add_summary(summaries, step) return loss, accuracy for i in range(max_steps): train_batch_data, train_batch_label = create_batch_data.get_next_batch(train_batch) train_batch_data = create_word2vec.indexMat2vector_lookup(w2vModel, train_batch_data) train_step(train_batch_data, train_batch_label) current_step = tf.train.global_step(sess, global_step) if current_step % val_step == 0: val_losses = [] val_accs = [] # for k in range(int(val_nums/batch_size)): for k in range(100): val_batch_data, val_batch_label = create_batch_data.get_next_batch(val_batch) val_batch_data = create_word2vec.indexMat2vector_lookup(w2vModel, val_batch_data) val_loss, val_acc=dev_step(val_batch_data, val_batch_label, writer=dev_summary_writer) val_losses.append(val_loss) val_accs.append(val_acc) mean_loss = np.array(val_losses, dtype=np.float32).mean() mean_acc = np.array(val_accs, dtype=np.float32).mean() print("--------Evaluation:step {}, loss {:g}, acc {:g}".format(current_step, mean_loss, mean_acc)) if current_step % snapshot == 0: path = saver.save(sess, checkpoint_prefix, global_step=current_step) print("Saved model checkpoint to {}\n".format(path)) def main(): # Data preprocess labels_file = 'data/THUCNews_labels.txt' word2vec_path = "../word2vec/models/THUCNews_word2Vec/THUCNews_word2Vec_128.model" max_steps = 100000 # 迭代次数 batch_size = 128 out_dir = "./models" # 模型ckpt和summaries输出的目录 train_dir = './data/train_data' val_dir = './data/val_data' train(train_dir=train_dir, val_dir=val_dir, labels_file=labels_file, word2vec_path=word2vec_path, batch_size=batch_size, max_steps=max_steps, log_step=50, val_step=500, snapshot=1000, out_dir=out_dir) if __name__=="__main__": main()
这里提供两种测试方法:
(1):text_predict(files_list, labels_file, models_path, word2vec_path, batch_size)
该方法,可以直接测试待分类的中文文本
(2):batch_predict(val_dir,labels_file,models_path,word2vec_path,batch_size)
该方法,用于批量测试,val_dir目录保存的是测试数据的npy文件,这些文件都是上面用word2vec词向量处理THUCNews数据文件。
#! /usr/bin/env python # encoding: utf-8 import tensorflow as tf import numpy as np import os from text_cnn import TextCNN from utils import create_batch_data, create_word2vec, files_processing import math def text_predict(files_list, labels_file, models_path, word2vec_path, batch_size): ''' 预测... :param val_dir: val数据目录 :param labels_file: labels文件目录 :param models_path: 模型文件 :param word2vec_path: 词向量模型文件 :param batch_size: batch size :return: ''' max_sentence_length = 300 embedding_dim = 128 filter_sizes = [3, 4, 5, 6] num_filters = 200 # Number of filters per filter size l2_reg_lambda = 0.0 # "L2 regularization lambda (default: 0.0) print("Loading data...") w2vModel = create_word2vec.load_wordVectors(word2vec_path) labels_set = files_processing.read_txt(labels_file) labels_nums = len(labels_set) sample_num=len(files_list) labels_list=[-1] labels_list=labels_list*sample_num with tf.Graph().as_default(): sess = tf.Session() with sess.as_default(): cnn = TextCNN(sequence_length = max_sentence_length, num_classes = labels_nums, embedding_size = embedding_dim, filter_sizes = filter_sizes, num_filters = num_filters, l2_reg_lambda = l2_reg_lambda) # Initialize all variables sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() saver.restore(sess, models_path) def pred_step(x_batch): """ predictions model on a dev set """ feed_dict = { cnn.input_x: x_batch, cnn.dropout_keep_prob: 1.0 } pred = sess.run([cnn.predictions],feed_dict) return pred batchNum = int(math.ceil(1.0 * sample_num / batch_size)) for i in range(batchNum): start = i * batch_size end = min((i + 1) * batch_size, sample_num) batch_files = files_list[start:end] # 读取文件内容,字词分割 batch_content= files_processing.read_files_list_to_segment(batch_files, max_sentence_length, padding_token='<PAD>') # [1]将字词转为索引矩阵,再映射为词向量 batch_indexMat = create_word2vec.word2indexMat(w2vModel, batch_content, max_sentence_length) val_batch_data = create_word2vec.indexMat2vector_lookup(w2vModel, batch_indexMat) # [2]直接将字词映射为词向量 # val_batch_data = create_word2vec.word2vector_lookup(w2vModel,batch_content) pred=pred_step(val_batch_data) pred=pred[0].tolist() pred=files_processing.labels_decoding(pred,labels_set) for k,file in enumerate(batch_files): print("{}, pred:{}".format(file,pred[k])) def batch_predict(val_dir,labels_file,models_path,word2vec_path,batch_size): ''' 预测... :param val_dir: val数据目录 :param labels_file: labels文件目录 :param models_path: 模型文件 :param word2vec_path: 词向量模型文件 :param batch_size: batch size :return: ''' max_sentence_length = 300 embedding_dim = 128 filter_sizes = [3, 4, 5, 6] num_filters = 200 # Number of filters per filter size l2_reg_lambda = 0.0 # "L2 regularization lambda (default: 0.0) print("Loading data...") w2vModel = create_word2vec.load_wordVectors(word2vec_path) labels_set = files_processing.read_txt(labels_file) labels_nums = len(labels_set) val_file_list = create_batch_data.get_file_list(file_dir=val_dir, postfix='*.npy') val_batch = create_batch_data.get_data_batch(val_file_list, labels_nums=labels_nums, batch_size=batch_size, shuffle=False, one_hot=True) print("val data info *****************************") val_nums = create_word2vec.info_npy(val_file_list) print("labels_set info *****************************") files_processing.info_labels_set(labels_set) # Training with tf.Graph().as_default(): sess = tf.Session() with sess.as_default(): cnn = TextCNN(sequence_length = max_sentence_length, num_classes = labels_nums, embedding_size = embedding_dim, filter_sizes = filter_sizes, num_filters = num_filters, l2_reg_lambda = l2_reg_lambda) # Initialize all variables sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() saver.restore(sess, models_path) def dev_step(x_batch, y_batch): """ Evaluates model on a dev set """ feed_dict = { cnn.input_x: x_batch, cnn.input_y: y_batch, cnn.dropout_keep_prob: 1.0 } loss, accuracy = sess.run( [cnn.loss, cnn.accuracy], feed_dict) return loss, accuracy val_losses = [] val_accs = [] for k in range(int(val_nums/batch_size)): # for k in range(int(10)): val_batch_data, val_batch_label = create_batch_data.get_next_batch(val_batch) val_batch_data = create_word2vec.indexMat2vector_lookup(w2vModel, val_batch_data) val_loss, val_acc=dev_step(val_batch_data, val_batch_label) val_losses.append(val_loss) val_accs.append(val_acc) print("--------Evaluation:step {}, loss {:g}, acc {:g}".format(k, val_loss, val_acc)) mean_loss = np.array(val_losses, dtype=np.float32).mean() mean_acc = np.array(val_accs, dtype=np.float32).mean() print("--------Evaluation:step {}, mean loss {:g}, mean acc {:g}".format(k, mean_loss, mean_acc)) def main(): # Data preprocess labels_file = 'data/THUCNews_labels.txt' # word2vec_path = 'word2vec/THUCNews_word2vec300.model' word2vec_path = "../word2vec/models/THUCNews_word2Vec/THUCNews_word2Vec_128.model" models_path='models/checkpoints/model-30000' batch_size = 128 val_dir = './data/val_data' batch_predict(val_dir=val_dir, labels_file=labels_file, models_path=models_path, word2vec_path=word2vec_path, batch_size=batch_size) test_path='/home/ubuntu/project/tfTest/THUCNews/my_test' files_list = files_processing.get_files_list(test_path,postfix='*.txt') text_predict(files_list, labels_file, models_path, word2vec_path, batch_size) if __name__=="__main__": main()``` ** 该项目比较老旧了,已不再维护;目前已经实现Pytorch版本的TextCNN中文文本分类,请参考:**[Pytorch TextCNN实现中文文本分类(附完整训练代码)](https://blog.csdn.net/guyuealian/article/details/127846717) ata/THUCNews_labels.txt' # word2vec_path = 'word2vec/THUCNews_word2vec300.model' word2vec_path = "../word2vec/models/THUCNews_word2Vec/THUCNews_word2Vec_128.model" models_path='models/checkpoints/model-30000' batch_size = 128 val_dir = './data/val_data' batch_predict(val_dir=val_dir, labels_file=labels_file, models_path=models_path, word2vec_path=word2vec_path, batch_size=batch_size) test_path='/home/ubuntu/project/tfTest/THUCNews/my_test' files_list = files_processing.get_files_list(test_path,postfix='*.txt') text_predict(files_list, labels_file, models_path, word2vec_path, batch_size) if __name__=="__main__": main()
基于Python,MATLAB设计,OpenCV,,CNN,机器学习,R-CNN,GCN,LSTM,SVM,BP目标检测、语义分割、Re-ID、医学图像分割、目标跟踪、人脸识别、数据增广、
人脸检测、显著性目标检测、自动驾驶、人群密度估计、3D目标检测、CNN、AutoML、图像分割、SLAM、实例分割、人体姿态估计、视频目标分割,PyTorch、人脸检测、车道线检测、去雾 、全景分割、
行人检测、文本检测、OCR、姿态估计、边缘检测、场景文本检测、视频实例分割、3D点云、模型压缩、人脸对齐、超分辨、去噪、强化学习、行为识别、OpenCV、场景文本识别、去雨、机器学习、风格迁移、
视频目标检测、去模糊、活体检测、人脸关键点检测、3D目标跟踪、视频修复、人脸表情识别、时序动作检测、图像检索、异常检测等毕设指导,毕设选题,毕业设计开题报告,
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。