本文中所用的数据集清华NLP组提供的THUCNews新闻文本分类数据集的一个子集(原始的数据集大约74万篇文档,训练起来需要花较长的时间)。 本次训练使用了其中的体育, 财经, 房产, 家居, 教育, 科技, 时尚, 时政, 游戏, 娱乐10个分类,每个分类6500条,总共65000条新闻数据。项目在和鲸社区的平台上跑的,数据集直接引用了和鲸的数据集,每个分类6500条,总共65000条新闻数据。

数据集划分如下: cnews.train.txt: 训练集(50000条) cnews.val.txt: 验证集(5000条) cnews.test.txt: 测试集(10000条)


import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from pprint import pprint
from time import time
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report

from data_loader.cnews_loader import *
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
# 设置数据读取、模型、结果保存路径
base_dir = '/home/kesci/input/new3021'
train_dir = os.path.join(base_dir, 'cnews.train.txt')
test_dir = os.path.join(base_dir, 'cnews.test.txt')
val_dir = os.path.join(base_dir, 'cnews.val.txt')
vocab_dir = os.path.join(base_dir, 'cnews.vocab.txt')
save_dir = 'checkpoints/textcnn'
save_path = os.path.join(save_dir, 'best_validation')
train_contents, train_labels = read_file(train_dir)
test_contents, test_labels = read_file(test_dir)
val_counts = Counter(train_labels)
    Counter({'体育': 5000,
             '娱乐': 5000,
             '家居': 5000,
             '房产': 5000,
             '教育': 5000,
             '时尚': 5000,
             '时政': 5000,
             '游戏': 5000,
             '科技': 5000,
             '财经': 5000})
    time: 577 ms
import re
def clear_character(sentence):
    pattern1= '\[.*?\]'     
    pattern2 = re.compile('[^\u4e00-\u9fa5^a-z^A-Z^0-9]')   
    new_sentence=''.join(line2.split()) #去除空白
    return new_sentence
train_text=list(map(lambda s: clear_character(s), train_contents))
test_text=list(map(lambda s: clear_character(s), test_contents))
import jieba
train_seg_text=list(map(lambda s: jieba.lcut(s), train_text))
test_seg_text=list(map(lambda s: jieba.lcut(s), test_text))
stop_words_path = "/home/kesci/work/data_loader/百度停用词列表.txt"
def get_stop_words():
    file = open(stop_words_path, 'rb').read().decode('gbk').split('\r\n')
    return set(file)
stopwords = get_stop_words()
# 去掉文本中的停用词
def drop_stopwords(line, stopwords):
    line_clean = []
    for word in line:
        if word in stopwords:
    return line_clean
train_st_text=list(map(lambda s: drop_stopwords(s,stopwords), train_seg_text))
test_st_text=list(map(lambda s: drop_stopwords(s,stopwords), test_seg_text))
le = LabelEncoder()
train_c_text=list(map(lambda s: ' '.join(s), train_st_text))
test_c_text=list(map(lambda s: ' '.join(s), test_st_text))
tfidf_model = TfidfVectorizer(binary=False,token_pattern=r"(?u)\b\w+\b")
train_Data = tfidf_model.fit_transform(train_c_text)
test_Data = tfidf_model.transform(test_c_text)
逻辑回归(Logistic Regression)是一种用于解决二分类(0 or 1)问题的机器学习方法,用于估计某种事物的可能性。比如某用户购买某商品的可能性,某病人患有某种疾病的可能性,以及某广告被用户点击的可能性等。

逻辑回归(Logistic Regression)与线性回归(Linear Regression)都是一种广义线性模型(generalized linear model)。逻辑回归假设因变量 y 服从伯努利分布,而线性回归假设因变量 y 服从高斯分布。 因此与线性回归有很多相同之处,去除Sigmoid映射函数的话,逻辑回归算法就是一个线性回归。可以说,逻辑回归是以线性回归为理论支持的,但是逻辑回归通过Sigmoid函数引入了非线性因素,因此可以轻松处理0/1分类问题。

from sklearn.linear_model import LogisticRegression
classifier.fit(train_Data, label_train_id)
pred = classifier.predict(test_Data)
from sklearn.metrics import classification_report
print(classification_report(label_test_id, pred,digits=4))
                  precision    recall  f1-score   support
               0     0.9970    0.9950    0.9960      1000
               1     0.9850    0.9850    0.9850      1000
               2     0.9651    0.8560    0.9073      1000
               3     0.8963    0.9080    0.9021      1000
               4     0.9680    0.9070    0.9365      1000
               5     0.9676    0.9850    0.9762      1000
               6     0.9251    0.9630    0.9437      1000
               7     0.9682    0.9750    0.9716      1000
               8     0.9438    0.9910    0.9668      1000
               9     0.9457    0.9920    0.9683      1000
        accuracy                         0.9557     10000
       macro avg     0.9562    0.9557    0.9553     10000
    weighted avg     0.9562    0.9557    0.9553     10000
    time: 1min 9s
from sklearn.linear_model import SGDClassifier

clf = SGDClassifier(loss="hinge", penalty="l2")
clf.fit(train_Data, label_train_id)
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))
                  precision    recall  f1-score   support
               0     0.9980    1.0000    0.9990      1000
               1     0.9841    0.9880    0.9860      1000
               2     0.9793    0.8500    0.9101      1000
               3     0.9105    0.9160    0.9133      1000
               4     0.9755    0.9160    0.9448      1000
               5     0.9519    0.9900    0.9706      1000
               6     0.9370    0.9660    0.9513      1000
               7     0.9771    0.9820    0.9796      1000
               8     0.9447    0.9910    0.9673      1000
               9     0.9403    0.9930    0.9660      1000
        accuracy                         0.9592     10000
       macro avg     0.9598    0.9592    0.9588     10000
    weighted avg     0.9598    0.9592    0.9588     10000
    time: 4.24 s
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion = 'entropy' ,random_state = 0)
clf.fit(train_Data, label_train_id)
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))
                 precision    recall  f1-score   support
               0     0.9619    0.9850    0.9733      1000
               1     0.9360    0.9070    0.9213      1000
               2     0.8000    0.5440    0.6476      1000
               3     0.9681    0.9700    0.9690      1000
               4     0.7939    0.7240    0.7573      1000
               5     0.8329    0.9170    0.8729      1000
               6     0.7935    0.8070    0.8002      1000
               7     0.8624    0.9090    0.8851      1000
               8     0.8170    0.8930    0.8533      1000
               9     0.8458    0.9710    0.9041      1000
        accuracy                         0.8627     10000
       macro avg     0.8612    0.8627    0.8584     10000
    weighted avg     0.8612    0.8627    0.8584     10000
    time: 1min 45s
随机森林是一个元估计器,它适合数据集的各个子样本上的多个决策树分类器,并使用平均值来提高预测精度和控制过度拟合。 子样本大小始终与原始输入样本大小相同,但如果bootstrap = True(默认值),则会使用替换来绘制样本。

from sklearn.ensemble import RandomForestClassifier
import numpy as np
clf = RandomForestClassifier(criterion='gini')  
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))
                  precision    recall  f1-score   support
               0     0.9831    0.9870    0.9850      1000
               1     0.9221    0.9700    0.9454      1000
               2     0.7577    0.5690    0.6499      1000
               3     0.8127    0.9330    0.8687      1000
               4     0.8690    0.7960    0.8309      1000
               5     0.9248    0.9220    0.9234      1000
               6     0.8408    0.8660    0.8532      1000
               7     0.8966    0.9020    0.8993      1000
               8     0.9325    0.9390    0.9357      1000
               9     0.8962    0.9760    0.9344      1000
        accuracy                         0.8860     10000
       macro avg     0.8835    0.8860    0.8826     10000
    weighted avg     0.8835    0.8860    0.8826     10000
    time: 31.6 s
支持向量机(support vector machines, SVM)是一种二分类模型,它的基本模型是定义在特征空间上的间隔最大的线性分类器,间隔最大使它有别于感知机;SVM还包括核技巧,这使它成为实质上的非线性分类器。SVM的的学习策略就是间隔最大化,可形式化为一个求解凸二次规划的问题,也等价于正则化的合页损失函数的最小化问题。SVM的的学习算法就是求解凸二次规划的最优化算法。

from sklearn import svm
import numpy as np
clf = svm.SVC()       # kernel = 'linear' or 'rbf' (default) or 'poly' or custom kernels; penalty C = 1.0 (default)
# Option 2: NuSVC()
# clf = svm.NuSVC() 
# Option 3: LinearSVC()
# clf = svm.LinearSVC()     # penalty : str, ‘l1’ or ‘l2’ (default=’l2’)
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))

                  precision    recall  f1-score   support
               0     1.0000    0.6530    0.7901      1000
               1     0.9913    0.6850    0.8102      1000
               2     0.6515    0.7030    0.6763      1000
               3     0.9700    0.3880    0.5543      1000
               4     1.0000    0.2690    0.4240      1000
               5     1.0000    0.6090    0.7570      1000
               6     0.6097    0.9450    0.7412      1000
               7     0.9927    0.1360    0.2392      1000
               8     0.2565    0.9990    0.4082      1000
               9     1.0000    0.7170    0.8352      1000
        accuracy                         0.6104     10000
       macro avg     0.8472    0.6104    0.6236     10000
    weighted avg     0.8472    0.6104    0.6236     10000
    time: 2h 16min 32s
from sklearn import svm
import numpy as np
# clf = svm.SVC()       # kernel = 'linear' or 'rbf' (default) or 'poly' or custom kernels; penalty C = 1.0 (default)
# Option 2: NuSVC()
clf = svm.NuSVC() 
# Option 3: LinearSVC()
# clf = svm.LinearSVC()     # penalty : str, ‘l1’ or ‘l2’ (default=’l2’)
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))

                  precision    recall  f1-score   support
               0     0.9979    0.9430    0.9697      1000
               1     0.9862    0.9290    0.9567      1000
               2     0.3267    0.8180    0.4669      1000
               3     0.7879    0.7060    0.7447      1000
               4     0.9816    0.6920    0.8117      1000
               5     0.9949    0.7810    0.8751      1000
               6     0.8250    0.8060    0.8154      1000
               7     0.9891    0.4550    0.6233      1000
               8     0.2503    0.2460    0.2481      1000
               9     0.9564    0.7680    0.8519      1000
        accuracy                         0.7144     10000
       macro avg     0.8096    0.7144    0.7364     10000
    weighted avg     0.8096    0.7144    0.7364     10000
    time: 1h 7min 16s
from sklearn import svm
import numpy as np
# clf = svm.SVC()       # kernel = 'linear' or 'rbf' (default) or 'poly' or custom kernels; penalty C = 1.0 (default)
# Option 2: NuSVC()
# clf = svm.NuSVC() 
# Option 3: LinearSVC()
clf = svm.LinearSVC()     # penalty : str, ‘l1’ or ‘l2’ (default=’l2’)
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))

                  precision    recall  f1-score   support
               0     0.9970    1.0000    0.9985      1000
               1     0.9889    0.9840    0.9865      1000
               2     0.9627    0.8780    0.9184      1000
               3     0.9147    0.9110    0.9128      1000
               4     0.9660    0.9100    0.9372      1000
               5     0.9725    0.9900    0.9812      1000
               6     0.9467    0.9590    0.9528      1000
               7     0.9666    0.9840    0.9752      1000
               8     0.9521    0.9940    0.9726      1000
               9     0.9377    0.9930    0.9645      1000
        accuracy                         0.9603     10000
       macro avg     0.9605    0.9603    0.9600     10000
    weighted avg     0.9605    0.9603    0.9600     10000
    time: 4.99 s
from sklearn.naive_bayes import ComplementNB

clf = ComplementNB()
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))
                  precision    recall  f1-score   support
               0     0.9940    0.9990    0.9965      1000
               1     0.9482    0.9890    0.9682      1000
               2     0.9798    0.7260    0.8340      1000
               3     0.8016    0.9210    0.8571      1000
               4     0.9413    0.9300    0.9356      1000
               5     0.9722    0.9810    0.9766      1000
               6     0.9525    0.9220    0.9370      1000
               7     0.9879    0.9790    0.9834      1000
               8     0.9441    0.9960    0.9693      1000
               9     0.9477    0.9960    0.9712      1000
        accuracy                         0.9439     10000
       macro avg     0.9469    0.9439    0.9429     10000
    weighted avg     0.9469    0.9439    0.9429     10000
    time: 386 ms
AdaBoost是adaptive boosting的缩写,boosting是一种与bagging很类似的技术,将原始数据集选择S次后得到S个新数据集,新数据集与原始数据集大小相等,每个数据集都是通过在原始数据集中随机选择一个样本来替换得到的,这就意味着可以多次选择同一个样本。在S个数据集建好之后,将某个学习算法分别作用于每个数据集就得到了S个分类器,当我们要对新数据分类时,就可以用这S个分类器进行分类,选择分类器投票结果最多的类别作为最后分类结果。boosting通过集中关注被已有分类器错分的数据来获得新的分类器,boosting给每个分类器的权重不相等,每个权重代表的是对应的分类器在上一轮迭代中的成功度,分类结果是基于所有分类器的加权求和得到的。

# 决策树集成
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adaboost_dt_clf = AdaBoostClassifier(
                                        max_depth=2,   # 决策树最大深度,默认可不输入即不限制子树深度
                                        min_samples_split=20, # 内部结点再划分所需最小样本数,默认值为2,若样本量不大,无需更改,反之增大
                                        min_samples_leaf=5    # 叶子节点最少样本数,默认值为1,若样本量不大,无需更改,反之增大
                                    algorithm="SAMME", # boosting 算法 {‘SAMME’, ‘SAMME.R’}, 默认为后者
                                    n_estimators=200,  # 最多200个弱分类器,默认值为50
                                    learning_rate=0.8  # 学习率,默认值为1
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))
                  precision    recall  f1-score   support
               0     0.9940    0.9990    0.9965      1000
               1     0.9482    0.9890    0.9682      1000
               2     0.9798    0.7260    0.8340      1000
               3     0.8016    0.9210    0.8571      1000
               4     0.9413    0.9300    0.9356      1000
               5     0.9722    0.9810    0.9766      1000
               6     0.9525    0.9220    0.9370      1000
               7     0.9879    0.9790    0.9834      1000
               8     0.9441    0.9960    0.9693      1000
               9     0.9477    0.9960    0.9712      1000
        accuracy                         0.9439     10000
       macro avg     0.9469    0.9439    0.9429     10000
    weighted avg     0.9469    0.9439    0.9429     10000
    time: 386 ms
和Adaboost不同,Gradient Boosting 在迭代的时候选择梯度下降的方向来保证最后的结果最好。损失函数用来描述模型的“靠谱”程度,假设模型没有过拟合,损失函数越大,模型的错误率越高。如果我们的模型能够让损失函数持续的下降,则说明我们的模型在不停的改进,而最好的方式就是让损失函数在其梯度方向上下降。

from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(max_depth=4,   # 决策树最大深度,默认可不输入,即不限制子树深度
                                max_features="auto",  # 寻找最优分割的特征数量,可为int,float,"auto","sqrt","log2",None:
                                n_estimators=100 # Boosting阶段的数量,默认值为100。
pred = clf.predict(test_Data)
print(classification_report(label_test_id, pred,digits=4))
                  precision    recall  f1-score   support
               0     0.9980    0.9970    0.9975      1000
               1     0.9810    0.9800    0.9805      1000
               2     0.8911    0.8760    0.8835      1000
               3     0.9910    0.9930    0.9920      1000
               4     0.9412    0.8970    0.9186      1000
               5     0.9437    0.9720    0.9576      1000
               6     0.9309    0.9430    0.9369      1000
               7     0.9832    0.9340    0.9579      1000
               8     0.9315    0.9660    0.9485      1000
               9     0.9613    0.9940    0.9774      1000
        accuracy                         0.9552     10000
       macro avg     0.9553    0.9552    0.9550     10000
    weighted avg     0.9553    0.9552    0.9550     10000
    time: 1h 16min 36s
