当前位置:   article > 正文

【NLP入门-中文文本分类】步骤详解,附keras代码_中文文本分类代码

中文文本分类代码

一、NLP 文本分类步骤

第一步:准备数据集,X:句子;Y:类别

第二步:分词,并去除停词(中文理由停词,比如而且,逗号之类;英文的话需要做词的时态转换之类)

第三步:word2idx/word2vec;这里word2vec,可以利用语料库,训练一个单词转为向量的model,这个模型你输入单词,会给你一个向量,并且能计算单词的相似度,相当于提前给词语做了归一化;word2idx就直接用词汇表的id作为向量的元素;

第四步:建模训练

二、代码

1、数据准备➕预处理

  1. 我们采用,头条新闻数据集作为本次demo的数据集。
  2. https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset

  下载好之后,需要进行预处理,个人习惯转为json字典;并且2/8开分为测试和训练集;

  1. file = open("./toutiao.txt", 'r')
  2. file = file.readlines()
  3. print(file[0],len(file))
  4. print(file[0].split("_!_"))
  5. data = {"train": [],
  6. "test": [],
  7. "class_name":{},
  8. "class_info": {}}
  9. # shuffle data
  10. random.shuffle(file)
  11. max_sentence = 0
  12. for i, line in enumerate(file):
  13. if i < int(0.8 * len(file)):
  14. line = line.split("_!_")
  15. if line[1] not in data["class_name"].keys():
  16. data["class_name"][line[1]] = line[2]
  17. if line[1] not in data["class_info"].keys():
  18. data["class_info"][line[1]] = 1
  19. else:
  20. data["class_info"][line[1]] += 1
  21. data["train"].append({"x": line[3],
  22. "y": line[1]})
  23. max_sentence = len(line[3]) if len(line[3]) > max_sentence else max_sentence
  24. else:
  25. line = line.split("_!_")
  26. if line[1] not in data["class_name"].keys():
  27. data["class_name"][line[1]] = line[2]
  28. if line[1] not in data["class_info"].keys():
  29. data["class_info"][line[1]] = 1
  30. else:
  31. data["class_info"][line[1]] += 1
  32. data["test"].append({"x": line[3],
  33. "y": line[1]})
  34. max_sentence = len(line[3]) if len(line[3]) > max_sentence else max_sentence
  35. data["max_sentence"] = max_sentence
  36. data["num_train"] = len(data["train"])
  37. data["num_test"] = len(data["test"])

  

  1. import matplotlib.pyplot as plt
  2. %matplotlib inline
  3. from matplotlib import gridspec
  4. fig = plt.figure(figsize=(20, 4.5))
  5. gs = gridspec.GridSpec(1, 2, width_ratios=[1, 2.5])
  6. ax1 = plt.subplot(gs[0])
  7. ax2 = plt.subplot(gs[1])
  8. counts = [data["num_train"], data["num_test"]]
  9. colors = ['silver', 'purple']
  10. explode = (0.1, 0) # explode 1st slice
  11. labels = ['train','test']
  12. ax1.pie(counts, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
  13. counts = []
  14. labels = []
  15. for namecode in data["class_name"].keys():
  16. counts.append(data["class_info"][namecode])
  17. labels.append(data["class_name"][namecode])
  18. print(len(counts),len(labels))
  19. print(counts)
  20. print(labels)
  21. df = pd.DataFrame({"labels": labels,
  22. "counts": counts})
  23. ax2.bar(df["labels"], df["counts"])
  24. ax2.set_title("nums")
  25. ax2.set_ylabel("% nums")
  26. # ax2.set_xticks(rotation=-15)
  27. ax2.set_xticklabels(labels = labels, rotation=-15)
  28. plt.show()

到这里为止,数据分析和预处理就结束了

2、对句子进行分词

  1. stopwords = [i.strip() for i in open('stop_words.txt').readlines()]
  2. def pretty_cut(sentence):
  3. cut_list = jieba.lcut(''.join(re.findall('[\u4e00-\u9fa5]', sentence)), cut_all = True)
  4. for i in range(len(cut_list)-1, -1, -1):
  5. if cut_list[i] in stopwords:
  6. del cut_list[i]
  7. return cut_list

 这里的stop_word.txt 可以在github下载

GitHub - goto456/stopwords: 中文常用停用词表(哈工大停用词表、百度停用词表等)

用“cn_stopwords.txt” 就可以了

效果如下:

3、接下来可以采用多种方式把中文转为向量

方法一:

  1. import logging
  2. import sys
  3. # import gensim.models as word2vec
  4. from gensim.models.word2vec import LineSentence, logger
  5. from gensim.models import Word2Vec
  6. train_data_wv = []
  7. for sentence in data["train"]:
  8. jieba_word = " ".join(pretty_cut(sentence["x"]))
  9. train_data_wv.append(jieba_word)
  10. sentence["x_jieba"] = jieba_word
  11. for sentence in data["test"]:
  12. jieba_word = " ".join(pretty_cut(sentence["x"]))
  13. train_data_wv.append(jieba_word)
  14. sentence["x_jieba"] = jieba_word
  15. train_w2v = Word2Vec(train_data_wv, window=5, min_count=0,vector_size=50, workers=10)
  16. train_w2v.train(train_data_wv, total_examples=len(train_data_wv), epochs=10)

 方法二:

  1. vocabs = {}
  2. index_word = 1
  3. for se in train_data_wv:
  4. vo = set(se.split(' '))
  5. for word in vo:
  6. if word not in vocabs.keys():
  7. vocabs[word]=index_word
  8. index_word += 1
  9. def get_pretrain_pad_seq(vocab, sentence, maxlen):
  10. transformed_sentence = []
  11. for word in sentence:
  12. tran_word = vocab.get(word, None)
  13. if tran_word:
  14. transformed_sentence.append(tran_word)
  15. else:
  16. transformed_sentence.append(107335)
  17. transformed_sentence += [0 for _ in range(abs(maxlen - len(sentence)))]
  18. return np.array(transformed_sentence)
  19. max_len = 0
  20. for sentence in train_data_wv:
  21. max_len = len(sentence.split(' ')) if len(sentence.split(' ')) > max_len else max_len

  采用方法二,通常需要词汇量+1,多的那个用来pad,通常是0;

  接下来就是把整个数据集进行整理,转化为训练举证

  1. trainX = []
  2. trainY = []
  3. testX = []
  4. testY = []
  5. for sample in data["train"]:
  6. jieba_word = sample["x_jieba"].split(' ')
  7. x = get_pretrain_pad_seq(vocabs, jieba_word, max_len)
  8. trainX.append(x)
  9. trainY.append(int(sample["y"]))
  10. for sample in data["test"]:
  11. jieba_word = sample["x_jieba"].split(' ')
  12. x = get_pretrain_pad_seq(vocabs, jieba_word, max_len)
  13. testX.append(x)
  14. testY.append(int(sample["y"]))
  15. trainX = np.array(trainX)
  16. trainY = np.array(trainY)
  17. testX = np.array(testX)
  18. testY = np.array(testY)
  19. import tensorflow as tf
  20. trainY = tf.keras.utils.to_categorical(trainY)
  21. testY = tf.keras.utils.to_categorical(testY)
  22. print(trainX.shape, trainY.shape, testX.shape, testY.shape)

4、 建模训练

  1. from tensorflow.keras import Model
  2. from tensorflow.keras.layers import Embedding, Dense, LSTM
  3. class TextRNN(Model):
  4. def __init__(self,
  5. maxlen,
  6. max_features,
  7. embedding_dims,
  8. class_num=1,
  9. last_activation='sigmoid'):
  10. super(TextRNN, self).__init__()
  11. self.maxlen = maxlen
  12. self.max_features = max_features
  13. self.embedding_dims = embedding_dims
  14. self.class_num = class_num
  15. self.last_activation = last_activation
  16. self.embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen)
  17. self.rnn = LSTM(128) # LSTM or GRU
  18. self.classifier = Dense(self.class_num, activation=self.last_activation)
  19. def call(self, inputs):
  20. if len(inputs.get_shape()) != 2:
  21. raise ValueError('The rank of inputs of TextRNN must be 2, but now is %d' % len(inputs.get_shape()))
  22. if inputs.get_shape()[1] != self.maxlen:
  23. raise ValueError('The maxlen of inputs of TextRNN must be %d, but now is %d' % (self.maxlen, inputs.get_shape()[1]))
  24. embedding = self.embedding(inputs)
  25. x = self.rnn(embedding)
  26. output = self.classifier(x)
  27. return output

  上面是模型代码,接下来就是训练代码了:

  1. from tensorflow.keras.callbacks import EarlyStopping
  2. from tensorflow.keras.preprocessing import sequence
  3. max_features = 107335 # 单次数+1
  4. maxlen = 80
  5. batch_size = 32
  6. embedding_dims = 32
  7. epochs = 10
  8. print('Build model...') # 15类,但是编码用的百位数
  9. model = TextRNN(max_len, max_features, embedding_dims, class_num=117, last_activation='softmax')
  10. model.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])
  11. print('Train...')
  12. early_stopping = EarlyStopping(monitor='val_accuracy', patience=3, mode='max')
  13. model.fit(trainX, trainY,
  14. batch_size=batch_size,
  15. epochs=epochs,
  16. callbacks=[early_stopping],
  17. validation_data=(testX, testY))
  18. print('Test...')
  19. result = model.predict(testX)

 

 这里会采用embedding层把每个单词原本index转为词向量:

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/242619
推荐阅读
相关标签
  

闽ICP备14008679号