当前位置:   article > 正文

【TensorFlow】使用卷积神经网络(CNN)进行文本分类_原本用于计算机视觉的的卷积神经网络是如何移植到文本分类上的

原本用于计算机视觉的的卷积神经网络是如何移植到文本分类上的

CNN应用在图片分类的场景中较多,可能给大家一个思维定势----CNN貌似只能应用在图片场景,其实CNN也可对文本进行分类。

卷积只是特征提取的一种方式,并不是只能处理图像,使用卷积只要能提取特征即可。

一、卷积应用在文本分类的思路

下图为卷积对文本分类的整体思路:

  1. 文本分词-->映射成向量:把文本(字符串)转换成数值(对文本进行编码),上图使用7*5的矩阵存储每一句话的编码
  2. 用三种不同的卷积窗口,每种卷积窗口有2个,得到6个特征图。(例如卷积窗口大小为2*5代表“看前后关注的两个词”)
  3. 池化:把6个特征的大小变成相同
  4. 把池化后的特征图组合在一起
  5. 用得到的特征做二分类

二、使用卷积对姓名进行分类

2.1、训练数据集

训练数据集为.csv文件,存储姓名、性别的映射关系,共351791条数据,我们要训练一个模型,用它来预测一个姓名属于“男”还是“女”。

2.2、实现思路

  1. 性别使用onehot编码进行编码,例如“男”-->[0, 1];“女”-->[1, 0]
  2. 把姓名按照每个字进行分词,并统计每个字的词频,根据词频,对字进行降序排列
  3. 根据降序的序号,对字进行编码,例如“李”在降序队列中排名第9,则对其编码0009(假设降序队列长度在1000~10000之间)
  4. 根据已编码好的字,对姓名进行编码,假设姓名最大长度为8,不足8位的用0补齐。例如“李四”编码后为[0009, 2901, 0, 0, 0, 0, 0, 0]
  5. 使用embedding_lookup()函数,把每个名字映射成(?, 8, 128)维的向量
  6. 使用expand_dims()函数,把三维的(?, 8, 128)填充成四维的(?, 8, 128, 1),方便进行卷积
  7. 分别用大小为(3, 128, 1, 128)、(4, 128, 1, 128)、(5, 128, 1, 128)的卷积窗口对姓名进行卷积操作,得到不同大小的特征图
  8. 使用池化操作,把不同大小的特征图统一变成(?, 1, 1, 128)大小,每个姓名共得到三个(?, 1, 1, 128)大小的特征图
  9. 把三个(?, 1, 1, 128)拼接在一起,得到(?, 1, 1, 128*3=384),然后把拼接在一起的特征(?, 1, 1, 384)执行reshape()操作,拉长变成(?, 384)
  10. 进行二分类操作

2.3、实现代码

训练程序

main.py

  1. # coding:utf-8
  2. import tensorflow as tf
  3. import numpy as np
  4. import csv
  5. name_dataset = 'name.csv'
  6. train_x = []
  7. train_y = []
  8. with open(name_dataset, 'r', encoding='utf-8') as csvfile:
  9. read = csv.reader(csvfile)
  10. # 按行读取CSV文件
  11. for sample in read:
  12. # 数据有标签
  13. if len(sample) == 2:
  14. train_x.append(sample[0])
  15. if sample[1] == '男':
  16. train_y.append([0, 1]) # 男,01,onehot编码
  17. else:
  18. train_y.append([1, 0]) # 女,10
  19. # 指定当前一个人的名字最大长度。多截少补
  20. max_name_length = max([len(name) for name in train_x])
  21. print("最长名字的字符数:", max_name_length)
  22. max_name_length = 8
  23. counter = 0
  24. # 词库表
  25. vocabulary = {}
  26. # 每个名字
  27. for name in train_x:
  28. counter += 1
  29. tokens = [word for word in name]
  30. # 每个字,统计词频
  31. for word in tokens:
  32. if word in vocabulary:
  33. vocabulary[word] += 1
  34. else:
  35. vocabulary[word] = 1
  36. # 排序
  37. vocabulary_list = [' '] + sorted(vocabulary, key=vocabulary.get, reverse=True)
  38. print(len(vocabulary_list))
  39. # 对字进行编码。每个字都有唯一的标识符
  40. vocab = dict([(x, y) for (y, x) in enumerate(vocabulary_list)])
  41. train_x_vec = []
  42. for name in train_x:
  43. name_vec = []
  44. # 对名字中的每个字
  45. for word in name:
  46. name_vec.append(vocab.get(word))
  47. # 当前名字大小未满足最大值,填充
  48. while len(name_vec) < max_name_length:
  49. name_vec.append(0)
  50. train_x_vec.append(name_vec)
  51. #######################################
  52. input_size = max_name_length
  53. num_classes = 2
  54. batch_size = 64
  55. num_batch = len(train_x_vec) // batch_size
  56. X = tf.placeholder(tf.int32, [None, input_size])
  57. Y = tf.placeholder(tf.float32, [None, num_classes])
  58. dropout_keep_prob = tf.placeholder(tf.float32)
  59. # vocabulary_size:词库表总字数;embedding_size:每个名字映射成128维的向量
  60. def neural_network(vocabulary_size, embedding_size=128, num_filters=128):
  61. # embedding layer
  62. with tf.name_scope("embedding"):
  63. W = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  64. # 把名字映射成向量(?,8,128)
  65. embedded_chars = tf.nn.embedding_lookup(W, X)
  66. # 填充维度,把3维变成4维,便于进行卷积。用1进行填充。(?,8,128,1)
  67. embedded_chars_expanded = tf.expand_dims(embedded_chars, -1)
  68. # convolution + maxpool layer
  69. # 用不同的filter_sizes得到不同的特征
  70. filter_sizes = [3, 4, 5]
  71. pooled_outputs = []
  72. for i, filter_size in enumerate(filter_sizes):
  73. with tf.name_scope("conv-maxpool-%s" % filter_size):
  74. filter_shape = [filter_size, embedding_size, 1, num_filters]
  75. W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1))
  76. b = tf.Variable(tf.constant(0.1, shape=[num_filters]))
  77. conv = tf.nn.conv2d(embedded_chars_expanded, W, strides=[1, 1, 1, 1], padding="VALID")
  78. h = tf.nn.relu(tf.nn.bias_add(conv, b))
  79. pooled = tf.nn.max_pool(h, ksize=[1, input_size - filter_size + 1, 1, 1], strides=[1, 1, 1, 1],
  80. padding="VALID")
  81. pooled_outputs.append(pooled)
  82. # 128*3
  83. num_filters_total = num_filters * len(filter_sizes)
  84. # 384特征拼一起
  85. h_pool = tf.concat(pooled_outputs, 3)
  86. # 384维特征
  87. h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])
  88. with tf.name_scope("dropout"):
  89. h_drop = tf.nn.dropout(h_pool_flat, dropout_keep_prob)
  90. with tf.name_scope("output"):
  91. # 384*2
  92. W = tf.get_variable("W", shape=[num_filters_total, num_classes],
  93. initializer=tf.contrib.layers.xavier_initializer())
  94. b = tf.Variable(tf.constant(0.1, shape=[num_classes]))
  95. output = tf.nn.xw_plus_b(h_drop, W, b)
  96. return output
  97. def train_neural_network():
  98. output = neural_network(len(vocabulary_list))
  99. optimizer = tf.train.AdamOptimizer(1e-3)
  100. loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=Y))
  101. # compute_gradients和apply_gradients相当于minimize()。前者用于计算梯度,后者用于使用计算得到的梯度来更新对应的variable
  102. grads_and_vars = optimizer.compute_gradients(loss)
  103. train_op = optimizer.apply_gradients(grads_and_vars)
  104. saver = tf.train.Saver(tf.global_variables())
  105. with tf.Session() as sess:
  106. sess.run(tf.global_variables_initializer())
  107. # 迭代200个epoch
  108. for e in range(201):
  109. # 迭代batch
  110. for i in range(num_batch):
  111. batch_x = train_x_vec[i * batch_size: (i + 1) * batch_size]
  112. batch_y = train_y[i * batch_size: (i + 1) * batch_size]
  113. _, loss_ = sess.run([train_op, loss], feed_dict={X: batch_x, Y: batch_y, dropout_keep_prob: 0.5})
  114. if i % 1000 == 0:
  115. print('epoch:', e, 'iter:', i, 'loss:', loss_)
  116. if e % 100 == 0:
  117. # .meta 存网络架构图;.data 存当前的权重
  118. saver.save(sess, "./model/name2sex", global_step=e)
  119. train_neural_network()

测试程序:

test.py

  1. # coding:utf-8
  2. import tensorflow as tf
  3. import csv
  4. name_dataset = 'name.csv'
  5. train_x = []
  6. train_y = []
  7. with open(name_dataset, 'r', encoding='utf-8') as csvfile:
  8. read = csv.reader(csvfile)
  9. for sample in read:
  10. if len(sample) == 2:
  11. train_x.append(sample[0])
  12. if sample[1] == '男':
  13. train_y.append([0, 1]) # 男
  14. else:
  15. train_y.append([1, 0]) # 女
  16. max_name_length = max([len(name) for name in train_x])
  17. print("最长名字的字符数:", max_name_length)
  18. max_name_length = 8
  19. counter = 0
  20. vocabulary = {}
  21. for name in train_x:
  22. counter += 1
  23. tokens = [word for word in name]
  24. for word in tokens:
  25. if word in vocabulary:
  26. vocabulary[word] += 1
  27. else:
  28. vocabulary[word] = 1
  29. vocabulary_list = [' '] + sorted(vocabulary, key=vocabulary.get, reverse=True)
  30. print(len(vocabulary_list))
  31. vocab = dict([(x, y) for (y, x) in enumerate(vocabulary_list)])
  32. train_x_vec = []
  33. for name in train_x:
  34. name_vec = []
  35. for word in name:
  36. name_vec.append(vocab.get(word))
  37. while len(name_vec) < max_name_length:
  38. name_vec.append(0)
  39. train_x_vec.append(name_vec)
  40. input_size = max_name_length
  41. num_classes = 2
  42. batch_size = 64
  43. num_batch = len(train_x_vec) // batch_size
  44. X = tf.placeholder(tf.int32, [None, input_size])
  45. Y = tf.placeholder(tf.float32, [None, num_classes])
  46. dropout_keep_prob = tf.placeholder(tf.float32)
  47. def neural_network(vocabulary_size, embedding_size=128, num_filters=128):
  48. # embedding layer
  49. with tf.name_scope("embedding"):
  50. W = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  51. embedded_chars = tf.nn.embedding_lookup(W, X)
  52. embedded_chars_expanded = tf.expand_dims(embedded_chars, -1)
  53. filter_sizes = [3, 4, 5]
  54. pooled_outputs = []
  55. for i, filter_size in enumerate(filter_sizes):
  56. with tf.name_scope("conv-maxpool-%s" % filter_size):
  57. filter_shape = [filter_size, embedding_size, 1, num_filters]
  58. W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1))
  59. b = tf.Variable(tf.constant(0.1, shape=[num_filters]))
  60. conv = tf.nn.conv2d(embedded_chars_expanded, W, strides=[1, 1, 1, 1], padding="VALID")
  61. h = tf.nn.relu(tf.nn.bias_add(conv, b))
  62. pooled = tf.nn.max_pool(h, ksize=[1, input_size - filter_size + 1, 1, 1], strides=[1, 1, 1, 1],
  63. padding="VALID")
  64. pooled_outputs.append(pooled)
  65. num_filters_total = num_filters * len(filter_sizes)
  66. h_pool = tf.concat(pooled_outputs, 3)
  67. h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])
  68. with tf.name_scope("dropout"):
  69. h_drop = tf.nn.dropout(h_pool_flat, dropout_keep_prob)
  70. with tf.name_scope("output"):
  71. W = tf.get_variable("W", shape=[num_filters_total, num_classes],
  72. initializer=tf.contrib.layers.xavier_initializer())
  73. b = tf.Variable(tf.constant(0.1, shape=[num_classes]))
  74. output = tf.nn.xw_plus_b(h_drop, W, b)
  75. return output
  76. def detect_sex(name_list):
  77. x = []
  78. for name in name_list:
  79. name_vec = []
  80. for word in name:
  81. name_vec.append(vocab.get(word))
  82. while len(name_vec) < max_name_length:
  83. name_vec.append(0)
  84. x.append(name_vec)
  85. output = neural_network(len(vocabulary_list))
  86. saver = tf.train.Saver(tf.global_variables())
  87. with tf.Session() as sess:
  88. sess.run(tf.global_variables_initializer())
  89. # 恢复前一次训练
  90. '''
  91. ckpt = tf.train.get_checkpoint_state('.')
  92. if ckpt != None:
  93. print(ckpt.model_checkpoint_path)
  94. '''
  95. # 加载当前模型
  96. saver.restore(sess, './model/name2sex-200')
  97. predictions = tf.argmax(output, 1)
  98. res = sess.run(predictions, {X: x, dropout_keep_prob: 1.0})
  99. i = 0
  100. for name in name_list:
  101. print(name, '女' if res[i] == 0 else '男')
  102. i += 1
  103. detect_sex(["张金龙", "段玉刚", "金华花"])

测试结果:

  1. 张金龙 男
  2. 段玉刚 男
  3. 金华花 女

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/235403
推荐阅读
  

闽ICP备14008679号