赞
踩
语种判断:拉丁语系,字母组成的,甚至字母也一样 => 字母的使用(次序、频次)不一样
词袋模型(中文):
①分词:
第1句话:[w1 w3 w5 w2 w1…]
第2句话:[w11 w32 w51 w21 w15…]
第3句话…
…
②统计词频:
w3 count3
w7 count7
wi count_i
…
③构建词典:
选出频次最高的N个词
开[1*n]这样的向量空间
(每个位置是哪个词)
④映射:把每句话共构建的词典进行映射
第1句话:[1 0 1 0 1 0…]
第2句话:[0 0 0 0 0 0…1, 0…1,0…]
⑤提升信息的表达充分度:
⑥上述的表达都是独立表达(没有词和词在含义空间的分布)
喜欢 = 在乎 = “稀罕” = “中意”
对向量化的输入去做建模
①NB/LR/SVM…建模
②MLP/CNN/LSTM
我们试试用朴素贝叶斯完成一个中文文本分类器,一般在数据量足够,数据丰富度够的情况下,用朴素贝叶斯完成这个任务,准确度还是很不错的。
机器学习的算法要取得好效果,离不开数据,咱们先把数据加载进来看看。
准备好数据,我们挑选 科技、汽车、娱乐、军事、运动 总共5类文本数据进行处理。
import jieba import pandas as pd df_technology = pd.read_csv("/jhub/students/data/course11/项目2/origin_data/technology_news.csv", encoding='utf-8') df_technology = df_technology.dropna() df_car = pd.read_csv("/jhub/students/data/course11/项目2/origin_data/car_news.csv", encoding='utf-8') df_car = df_car.dropna() df_entertainment = pd.read_csv("/jhub/students/data/course11/项目2/origin_data/entertainment_news.csv", encoding='utf-8') df_entertainment = df_entertainment.dropna() df_military = pd.read_csv("/jhub/students/data/course11/项目2/origin_data/military_news.csv", encoding='utf-8') df_military = df_military.dropna() df_sports = pd.read_csv("/jhub/students/data/course11/项目2/origin_data/sports_news.csv", encoding='utf-8') df_sports = df_sports.dropna() # 每类数据取20000条 让类别平衡 technology = df_technology.content.values.tolist()[1000:21000] car = df_car.content.values.tolist()[1000:21000] entertainment = df_entertainment.content.values.tolist()[:20000] military = df_military.content.values.tolist()[:20000] sports = df_sports.content.values.tolist()[:20000] # 随便挑几条看看 print(technology[12]) print(car[100])
stopwords=pd.read_csv("/jhub/students/data/course11/项目2/origin_data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values
我们对数据做一些预处理,并把处理过后的数据写入新文件夹,避免每次重复操作
def preprocess_text(content_lines, sentences, category, target_path): for line in content_lines: try: segs=jieba.lcut(line) segs = list(filter(lambda x:len(x)>1, segs)) #没有解析出来的新闻过滤掉 segs = list(filter(lambda x:x not in stopwords, segs)) #把停用词过滤掉 sentences.append((" ".join(segs), category)) except Exception as e: print(line) continue #生成训练数据 sentences = [] preprocess_text(technology, sentences, 'technology', '../data/lesson2_data/data') preprocess_text(car, sentences, 'car', '../data/lesson2_data/data') preprocess_text(entertainment, sentences, 'entertainment', '../data/lesson2_data/data') preprocess_text(military, sentences, 'military', '../data/lesson2_data/data') preprocess_text(sports, sentences, 'sports', '../data/lesson2_data/data')
我们打乱一下顺序,生成更可靠的训练集
import random
random.shuffle(sentences)
print(sentences[:10])
为了一会儿检测一下咱们的分类器效果怎么样,我们需要一份测试集。
所以把原数据集分成训练集的测试集,咱们用sklearn自带的分割函数。
from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
print(len(y))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1234)
print(len(x_train)) # 65696
下一步要做的就是在降噪数据上抽取出来有用的特征啦,我们对文本抽取词袋模型特征
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(
analyzer='word', # tokenise by character ngrams
max_features=4000, # keep the most common 4000 ngrams
)
vec.fit(x_train)
def get_features(x):
vec.transform(x)
把分类器import进来并且训练
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)
print(classifier.score(vec.transform(x_test), y_test)) # 0.831
print(len(x_test)) #21899
我们可以看到在2w多个样本上,我们能在5个类别上达到83%的准确率。
有没有办法把准确率提高一些呢?
我们可以把特征做得更棒一点,比如说,我们试试加入抽取2-gram和3-gram的统计特征,比如可以把词库的量放大一点。
[‘我’, ‘爱’, ‘自然语言’, ‘处理’] 2-gram: [‘我爱’, ‘爱自然语言’, ‘自然语言处理’] 3-gram: []
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(
analyzer='word', # tokenise by character ngrams
ngram_range=(1,4), # use ngrams of size 1, 2, 3, 4
max_features=20000, # keep the most common 2000 ngrams
)
vec.fit(x_train)
def get_features(x):
vec.transform(x)
分类训练
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)
classifier.score(vec.transform(x_test), y_test) # 0.873
更可靠的验证效果的方式是交叉验证,但是交叉验证最好保证每一份里面的样本类别也是相对均衡的,我们这里使用StratifiedKFold
from sklearn.model_selection import StratifiedKFold from sklearn.metrics import accuracy_score, precision_score import numpy as np def stratifiedkfold_cv(x, y, clf_class, shuffle=True, n_folds=5, **kwargs): stratifiedk_fold = StratifiedKFold(n_splits=n_folds, shuffle=shuffle) y_pred = y[:] for train_index, test_index in stratifiedk_fold.split(x, y): X_train, X_test = x[train_index], x[test_index] y_train = y[train_index] clf = clf_class(**kwargs) clf.fit(X_train,y_train) y_pred[test_index] = clf.predict(X_test) return y_pred NB = MultinomialNB print(precision_score(y, stratifiedkfold_cv(vec.transform(x),np.array(y),NB), average='macro'))
我们做完K折的交叉验证,可以看到在5个类别上的结果平均准确度约为88%
import re from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB class TextClassifier(): def __init__(self, classifier=MultinomialNB()): self.classifier = classifier self.vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,4), max_features=20000) def features(self, X): return self.vectorizer.transform(X) def fit(self, X, y): self.vectorizer.fit(X) self.classifier.fit(self.features(X), y) def predict(self, x): return self.classifier.predict(self.features([x])) def score(self, X, y): return self.classifier.score(self.features(X), y) def save_model(self, path): dump((self.classifier, self.vectorizer), path) def load_model(self, path): self.classifier, self.vectorizer = load(path)
text_classifier = TextClassifier()
text_classifier.fit(x_train, y_train)
print(text_classifier.predict('这 是 有史以来 最 大 的 一 次 军舰 演习'))
print(text_classifier.score(x_test, y_test))
我们来试试支持向量机的作用
from sklearn.svm import SVC
svm = SVC(kernel='linear')
# svm = SVC() # 可以试试rbf核
svm.fit(vec.transform(x_train), y_train)
svm.score(vec.transform(x_test), y_test)
注意:Windows笔记本下数据预处理很慢。
之前已经把数据做好分词与去停用词操作,放到processed_data下,这里我们直接读取相应的数据。
from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.ml.feature import HashingTF, IDF, Tokenizer from pyspark.ml.linalg import Vectors from pyspark.ml.classification import NaiveBayes from pyspark.ml.classification import NaiveBayesModel from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler def parse_lines(p): lines = p[1].split('\n') category = p[0].split('/')[-1].split('.')[0] return [Row(cate=category, sentence=sent) for sent in lines] def words_classify_main(spark): sc = spark.sparkContext # Tokenizer将输入的字符串格式化为小写,并按空格进行分割 tokenizer = Tokenizer(inputCol="sentence", outputCol="words") # 自定义使用numFeatures个hash桶来存储特征值 hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=8000) # Inverse Document Frequency(计算逆文本频率指数) idf = IDF(inputCol="rawFeatures", outputCol="features") # 从HDFS中加载输入源到dataframe中 srcdf = sc.wholeTextFiles("file://"+'./processed_data').map(parse_lines).flatmap(lambda x:x) # 这里按80%-20%的比例分成训练集和测试集 training, testing = srcdf.randomSplit([0.8, 0.2]) # 得到训练集的词条集合 wordsData = tokenizer.transform(training) # 将词条集合转换为特征向量集合 featurizedData = hashingTF.transform(wordsData) # 在特征向量上应用fit()来得到model idfModel = idf.fit(featurizedData) # 得到每个单词对应的TF-IDF度量值 rescaledData = idfModel.transform(featurizedData) # 类别编码 label_stringIdx = StringIndexer(inputCol = "cate", outputCol = "label") pipeline = Pipeline(stages=[label_stringIdx]) pipelineFit = pipeline.fit(rescaledData) trainData = pipelineFit.transform(rescaledData) # 持久化,避免重复加载 trainData.persist() # 转换数据集用于NaiveBayes的输入 trainDF = trainData.select("features", "label").rdd.map( lambda x:Row(label=x['label'], features=Vectors.dense(x['features'])) ).toDF() # NaiveBayes分类器 naivebayes = NaiveBayes(smoothing=1.0, modelType="multinomial") # 通过训练集得到NaiveBayesModel model = naivebayes.fit(trainDF) # 得到测试集的词条集合 testWordsData = tokenizer.transform(testing) # 将词条集合转换为特征向量集合 testFeaturizedData = hashingTF.transform(testWordsData) # 在特征向量上应用fit()来得到model testIDFModel = idf.fit(testFeaturizedData) # 得到每个单词对应的TF-IDF度量值 testRescaledData = testIDFModel.transform(testFeaturizedData) # 测试集 testData = pipelineFit.transform(testRescaledData) # 持久化,避免重复加载 testData.persist() testDF = testRescaledData.select("features", "label").rdd.map( lambda x:Row(label=x['label'], features=Vectors.dense(x['features'])) ).toDF() # 使用训练模型对测试集进行预测 predictions = model.transform(testDF) predictions.show() # 计算model在测试集上的准确性 evaluator = MulticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="accuracy") accuracy = evaluator.evaluate(predictions) print("测试集上的准确率为 " + str(accuracy)) if __name__ == "__main__": spark = SparkSession \ .builder \ .appName("spark_naivebayes_classify") \ .getOrCreate() words_classify_main(spark) spark.stop()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。