赞
踩
跟之前的文本一样,还是原来的数据格式。
sentence,label
游戏太坑,暴率太低,太克金,平民不能玩,negative
让人失望,negative
能解决一下服务器问题?网络正常老掉线,换手机也一样。。。,negative
期待,positive
一星也不想给,这特么简直龟速,炫舞老年版?,negative
衣服不好看游戏内容无特色,界面乱糟糟的,negative
喜欢喜欢,positive
从有了这个手游就一直玩,很喜欢呀,希望更多漂漂衣服,positive
因违反评价条例规定被折叠,negative
import time
import jieba
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import xgboost as xgb
def get_stop_words():
filename = "your stop words file path"
stop_word_list = []
with open(filename, encoding='utf-8') as f:
for line in f.readlines():
stop_word_list.append(line.strip())
return stop_word_list
def processing_sentence(x, stop_words):
cut_word = jieba.cut(str(x).strip())
words = [word for word in cut_word if word not in stop_words and word != ' ']
return ' '.join(words)
def data_processing():
train_file = "your train file path"
df = pd.read_csv(train_file)
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.1)
stop_words = get_stop_words()
x_train = x_train.apply(lambda x: processing_sentence(x, stop_words))
x_test = x_test.apply(lambda x: processing_sentence(x, stop_words))
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)
x_test = tf.transform(x_test)
x_train_weight = x_train.toarray()
x_test_weight = x_test.toarray()
return x_train_weight, x_test_weight, y_train, y_test
文本预处理得到的,仍然是分词以后的tf-idf特征。
def train_model():
x_train_weight, x_test_weight, y_train, y_test = data_processing()
start = time.time()
print("start time is: ", start)
model = xgb.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=100,
silent=False, objective='binary:logistic')
model.fit(x_train_weight, y_train)
end = time.time()
print("end time is: ", end)
print("cost time is: ", (end - start))
y_predict = model.predict(x_test_weight)
confusion_mat = metrics.confusion_matrix(y_test, y_predict)
print('准确率:', metrics.accuracy_score(y_test, y_predict))
print("confusion_matrix is: ", confusion_mat)
print('分类报告:', metrics.classification_report(y_test, y_predict))
代码训练运行的结果为
start time is: 1649228843.700035
end time is: 1649229253.274875
cost time is: 409.57483983039856
准确率: 0.7524366471734892
confusion_matrix is: [[137 80]
[ 47 249]]
分类报告: precision recall f1-score support
negative 0.74 0.63 0.68 217
positive 0.76 0.84 0.80 296
accuracy 0.75 513
macro avg 0.75 0.74 0.74 513
weighted avg 0.75 0.75 0.75 513
xgb的参数还是比较多的,而且在实际使用过程中,调参也是比较重要的一环,下面我们一起看看xgb里面的参数。
booster: string
Specify which booster to use: gbtree, gblinear or dart.
n_jobs : int
Number of parallel threads used to run xgboost. (replaces ``nthread``)
verbosity : int
The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
scale_pos_weight : float
Balancing of positive and negative weights.
booster指定树的类型,默认值为gbtree。
scale_pos_weight主要是处理样本不平衡问题,默认值为1。当样本高度不平衡的时候,比如正负样本比为1:100,可以将scale_pos_weight=10,加快模型收敛。
n_estimators : int
Number of trees to fit.
max_depth : int
Maximum tree depth for base learners.
min_child_weight : int
Minimum sum of instance weight(hessian) needed in a child.
gamma : float
Minimum loss reduction required to make a further partition on a leaf node of the tree.
max_delta_step : int
Maximum delta step we allow each tree's weight estimation to be.
subsample : float
Subsample ratio of the training instance.
colsample_bytree : float
Subsample ratio of columns when constructing each tree.
n_estimators:树棵数
max_depth:树最大深度
min_child_weight:每棵树上的叶子节点样本权重和的最小值
gamma:在每棵树上进行进一步分裂所需要的最小损失函数减小值
max_delta_step:每棵树的最大权重
subsample:每棵树训练时每个样本被抽样选择的概率
colsample_bytree:每棵树训练时使用的特征比例
learning_rate : float
Boosting learning rate (xgb's "eta")
objective : string or callable
Specify the learning task and the corresponding learning objective or
a custom objective function to be used (see note below).
reg_alpha : float (xgb's alpha)
L1 regularization term on weights
reg_lambda : float (xgb's lambda)
L2 regularization term on weights
objective包括:
回归任务
reg:linear (默认)
reg:logistic
二分类
binary:logistic 概率
binary:logitraw 类别
多分类
multi:softmax num_class=n 返回类别
multi:softprob num_class=n 返回概率
排序
rank:pairwise
n_estimators
max_depth
min_chlid_weight
gamma
subsample
colsample_bytree
learning_rate
num_round
scale_pos_weight
将train方法的代码稍作修改,加入刻画特征重要性代码
import matplotlib.pyplot as plt
from xgboost import plot_importance
def train_model():
x_train_weight, x_test_weight, y_train, y_test = data_processing()
start = time.time()
print("start time is: ", start)
model = xgb.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=50, n_jobs=2,
silent=False, objective='binary:logistic')
model.fit(x_train_weight, y_train)
end = time.time()
print("end time is: ", end)
print("cost time is: ", (end - start))
y_predict = model.predict(x_test_weight)
confusion_mat = metrics.confusion_matrix(y_test, y_predict)
print('准确率:', metrics.accuracy_score(y_test, y_predict))
print("confusion_matrix is: ", confusion_mat)
print('分类报告:', metrics.classification_report(y_test, y_predict))
fig, ax = plt.subplots(figsize=(15, 15))
plot_importance(model,
height=0.5,
ax=ax,
max_num_features=10)
plt.show()
最后会输出如下图
上面的图像就将F值排前十的特征进行了可视化。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。