当前位置:   article > 正文

自然语言处理中的文本分类_def get_document_feature(document): document_words

def get_document_feature(document): document_words = set(document) features

       声明:代码的运行环境为Python3。Python3与Python2在一些细节上会有所不同,希望广大读者注意。本博客以代码为主,代码中会有详细的注释。相关文章将会发布在我的个人博客专栏《Python自然语言处理》,欢迎大家关注。


1、首先来看一个使用朴素贝叶斯分类器对性别进行分类鉴定的例子。

  1. # 构造特征提取器
  2. def gender_features(word): # 提取出字符串的最后一个字母
  3. return {'last_letter': word[-1]}
  4. names_set = ([(name, 'male') for name in names.words('male.txt')] +
  5. [(name, 'female') for name in names.words('female.txt')]) # 获取语料库中人名的名称及性别
  6. print(names_set[:10])
  7. random.shuffle(names_set) # 随机打乱names_set中的名字
  8. print(names_set[:10])
  9. featuresets = [(gender_features(n), g) for (n, g) in names_set] # 选取特征集合
  10. train_set, test_set = featuresets[500:], featuresets[:500] # 抽取训练数据和测试数据
  11. classifier = nltk.NaiveBayesClassifier.train(train_set) # 朴素贝叶斯分类器
  12. # 测试
  13. classifier.classify(gender_features('Neo'))
  14. classifier.classify(gender_features('Trinity'))

查看测试结果:

  1. 'male'
  2. 'female'

2、计算分类的正确率以及最有效的特征

  1. # 分类正确率判断
  2. print(nltk.classify.accuracy(classifier, test_set))
  3. 0.768
  4. # 最有效的特征
  5. classifier.show_most_informative_features(5) # 输出5个最有效的特征
  6. Most Informative Features
  7. last_letter = 'a' female : male = 40.2 : 1.0
  8. last_letter = 'k' male : female = 30.8 : 1.0
  9. last_letter = 'f' male : female = 16.6 : 1.0
  10. last_letter = 'p' male : female = 11.9 : 1.0
  11. last_letter = 'v' male : female = 9.8 : 1.0

3、当数据量很大时可以用如下方法进行数据集的划分。

  1. # 大型数据时的数据集划分
  2. from nltk.classify import apply_features
  3. train_set = apply_features(gender_features, names_set[500:])
  4. test_set = apply_features(gender_features, names_set[:500])

4、贝叶斯公式及代码实现。

贝叶斯公式如下图所示:

贝叶斯分类器为:

用代码的形式实现贝叶斯分类器:

  1. # 手动计算贝叶斯分类器
  2. # 计算P(特征|类别)
  3. def f_c(data, fea, cla):
  4. cfd = nltk.ConditionalFreqDist((classes, features) for (features, classes) in data)
  5. return cfd[cla].freq(fea)
  6. # 计算P(特征)
  7. def p_feature(data, fea):
  8. fd = nltk.FreqDist(fea for (fea, cla) in data)
  9. return fd.freq(fea)
  10. # 计算P(类别)
  11. def p_class(data, cla):
  12. fd = nltk.FreqDist(cla for (fea, cla) in data)
  13. return fd.freq(cla)
  14. # 计算P(类别│特征)
  15. def res(data, fea, cla):
  16. return f_c(data, fea, cla) * p_class(data, cla) / p_feature(data, fea)

测试贝叶斯公式:

  1. # 构造输入数据集
  2. data = ([(name[-1], 'male') for name in names.words('male.txt')] +
  3. [(name[-1], 'female') for name in names.words('female.txt')])
  4. random.shuffle(data)
  5. train, test = data[500:], data[:500]
  6. # 计算Neo的为男性的概率
  7. res(train, 'k', 'male')
  8. res(train, 'a', 'female')

测试结果为:

  1. 0.955223880597015
  2. 0.9829612220916567

5、选择正确的特征。

(1)过度拟合

  1. # 过度拟合
  2. def gender_features2(name):
  3. features = {}
  4. features["firstletter"] = name[0].lower()
  5. features["lastletter"] = name[-1].lower()
  6. for letter in 'abcdefghijklmnopqrstuvwxyz':
  7. features["count(%s)" % letter] = name.lower().count(letter)
  8. features["has(%s)" % letter] = (letter in name.lower())
  9. return features
  10. featuresets = [(gender_features2(n), g) for (n, g) in names_set]
  11. train_set, test_set = featuresets[500:], featuresets[:500]
  12. classifier = nltk.NaiveBayesClassifier.train(train_set)
  13. print(nltk.classify.accuracy(classifier, test_set))

结果:

0.76

(2)划分数据集,重新测试

  1. # 数据划分为训练集、开发测试集、测试集
  2. train_names = names_set[1500:]
  3. devtest_names = names_set[500:1500]
  4. test_names = names_set[:500]
  5. # 重新训练模型
  6. train_set = [(gender_features(n), g) for (n, g) in train_names]
  7. devtest_set = [(gender_features(n), g) for (n, g) in devtest_names]
  8. test_set = [(gender_features(n), g) for (n, g) in test_names]
  9. classifier = nltk.NaiveBayesClassifier.train(train_set)
  10. print(nltk.classify.accuracy(classifier, devtest_set))

测试结果:

0.761

(3)将测试出错的元素打印出来

  1. # 打印错误列表
  2. errors = []
  3. for (name, tag) in devtest_names:
  4. guess = classifier.classify(gender_features(name))
  5. if guess != tag:
  6. errors.append((tag, guess, name))
  7. for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
  8. print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

结果:

  1. correct=female guess=male name=Adelind
  2. correct=female guess=male name=Aeriel
  3. correct=female guess=male name=Aeriell
  4. correct=female guess=male name=Ag
  5. correct=female guess=male name=Aidan
  6. correct=female guess=male name=Allsun
  7. correct=female guess=male name=Anabel
  8. correct=female guess=male name=Ardelis
  9. correct=female guess=male name=Aryn
  10. correct=female guess=male name=Betteann
  11. correct=female guess=male name=Bill
  12. correct=female guess=male name=Blondell

(4)重构特征,进行训练

  1. # 重新构建特征
  2. def gender_features(word):
  3. return {'suffix1': word[-1:],
  4. 'suffix2': word[-2:]}
  5. # 重新训练模型
  6. train_set = [(gender_features(n), g) for (n, g) in train_names]
  7. devtest_set = [(gender_features(n), g) for (n, g) in devtest_names]
  8. classifier = nltk.NaiveBayesClassifier.train(train_set)
  9. print(nltk.classify.accuracy(classifier, devtest_set))

测试结果:

0.791

6、文档分类

  1. from nltk.corpus import movie_reviews
  2. documents = [(list(movie_reviews.words(fileid)), category)
  3. for category in movie_reviews.categories()
  4. for fileid in movie_reviews.fileids(category)]
  5. random.shuffle(documents)
  6. # 文档分类特征提取器
  7. all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
  8. word_features = all_words.most_common()[:2000]
  9. def document_features(document):
  10. document_words = set(document)
  11. features = {}
  12. for (word, freq) in word_features:
  13. features['contains(%s)' % word] = (word in document_words)
  14. return features
  15. print(document_features(movie_reviews.words('pos/cv957_8737.txt')))
  16. # 构造分类器
  17. featuresets = [(document_features(d), c) for (d, c) in documents]
  18. train_set, test_set = featuresets[100:], featuresets[:100]
  19. classifier = nltk.NaiveBayesClassifier.train(train_set)
  20. print(nltk.classify.accuracy(classifier, test_set))
  21. print(classifier.show_most_informative_features(5))

测试结果:

  1. 0.86
  2. Most Informative Features
  3. contains(outstanding) = True pos : neg = 10.4 : 1.0
  4. contains(seagal) = True neg : pos = 8.7 : 1.0
  5. contains(mulan) = True pos : neg = 8.1 : 1.0
  6. contains(wonderfully) = True pos : neg = 6.3 : 1.0
  7. contains(damon) = True pos : neg = 5.7 : 1.0
  8. None

7、词性标注

  1. # 词性标注
  2. from nltk.corpus import brown
  3. suffix_fdist = nltk.FreqDist()
  4. for word in brown.words():
  5. word = word.lower()
  6. suffix_fdist[word[-1:]] += 1
  7. suffix_fdist[word[-2:]] += 1
  8. suffix_fdist[word[-3:]] += 1
  9. common_suffixes = suffix_fdist.most_common()[:100]
  10. print(common_suffixes)
  11. # 定义特征提取器
  12. def pos_features(word):
  13. features = {}
  14. for (suffix, freq) in common_suffixes:
  15. features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
  16. return features
  17. # 训练分类器
  18. tagged_words = brown.tagged_words(categories='news')
  19. featuresets = [(pos_features(n), g) for (n, g) in tagged_words]
  20. size = int(len(featuresets) * 0.1)
  21. train_set, test_set = featuresets[:1000], featuresets[2000:3000]
  22. classifier = nltk.DecisionTreeClassifier.train(train_set)
  23. print(nltk.classify.accuracy(classifier, test_set))
  24. print(classifier.classify(pos_features('cats')))

结果:

  1. 0.611
  2. NNS

决策树输出:

  1. # 决策树输出
  2. print(classifier.pseudocode(depth=4))

输出结果:

  1. if endswith(he) == False:
  2. if endswith(s) == False:
  3. if endswith(.) == False:
  4. if endswith(of) == False: return 'CD'
  5. if endswith(of) == True: return 'IN'
  6. if endswith(.) == True: return '.'
  7. if endswith(s) == True:
  8. if endswith(as) == False:
  9. if endswith('s) == False: return 'NPS'
  10. if endswith('s) == True: return 'NN$'
  11. if endswith(as) == True:
  12. if endswith(was) == False: return 'CS'
  13. if endswith(was) == True: return 'BEDZ'
  14. if endswith(he) == True:
  15. if endswith(the) == False: return 'PPS'
  16. if endswith(the) == True: return 'AT'

8、根据上下文构造特征提取器

  1. # 根据上下文构造特征提取器
  2. def pos_features(sentence, i):
  3. features = {"suffix(1)": sentence[i][-1:],
  4. "suffix(2)": sentence[i][-2:],
  5. "suffix(3)": sentence[i][-3:]}
  6. if i == 0:
  7. features["prev-word"] = "<START>"
  8. else:
  9. features["prev-word"] = sentence[i - 1]
  10. return features
  11. pos_features(brown.sents()[0], 8)
  12. tagged_sents = brown.tagged_sents(categories='news')
  13. featuresets = []
  14. for tagged_sent in tagged_sents:
  15. untagged_sent = nltk.tag.untag(tagged_sent)
  16. for i, (word, tag) in enumerate(tagged_sent):
  17. featuresets.append((pos_features(untagged_sent, i), tag))
  18. size = int(len(featuresets) * 0.1)
  19. train_set, test_set = featuresets[size:], featuresets[:size]
  20. classifier = nltk.NaiveBayesClassifier.train(train_set)
  21. nltk.classify.accuracy(classifier, test_set)

结果:

0.7891596220785678

9、序列分类

  1. # 序列分类
  2. # 定义特征提取器
  3. def pos_features(sentence, i, history):
  4. features = {"suffix(1)": sentence[i][-1:],
  5. "suffix(2)": sentence[i][-2:],
  6. "suffix(3)": sentence[i][-3:]}
  7. if i == 0:
  8. features["prev-word"] = "<START>"
  9. features["prev-tag"] = "<START>"
  10. else:
  11. features["prev-word"] = sentence[i - 1]
  12. features["prev-tag"] = history[i - 1]
  13. return features
  14. # 构建序列分类器
  15. class ConsecutivePosTagger(nltk.TaggerI):
  16. def __init__(self, train_sents):
  17. train_set = []
  18. for tagged_sent in train_sents:
  19. untagged_sent = nltk.tag.untag(tagged_sent)
  20. history = []
  21. for i, (word, tag) in enumerate(tagged_sent):
  22. featureset = pos_features(untagged_sent, i, history)
  23. train_set.append((featureset, tag))
  24. history.append(tag)
  25. self.classifier = nltk.NaiveBayesClassifier.train(train_set)
  26. def tag(self, sentence):
  27. history = []
  28. for i, word in enumerate(sentence):
  29. featureset = pos_features(sentence, i, history)
  30. tag = self.classifier.classify(featureset)
  31. history.append(tag)
  32. return zip(sentence, history)
  33. tagged_sents = brown.tagged_sents(categories='news')
  34. size = int(len(tagged_sents) * 0.1)
  35. train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
  36. tagger = ConsecutivePosTagger(train_sents)
  37. print(tagger.evaluate(test_sents))

结果:

0.7980528511821975

10、句子分割

  1. # 句子分割
  2. # 获取已分割的句子数据
  3. sents = nltk.corpus.treebank_raw.sents()
  4. tokens = []
  5. boundaries = set()
  6. offset = 0
  7. for sent in nltk.corpus.treebank_raw.sents():
  8. tokens.extend(sent)
  9. offset += len(sent)
  10. boundaries.add(offset - 1)
  11. # 定义特征提取器
  12. def punct_features(tokens, i):
  13. return {'next-word-capitalized': tokens[i + 1][0].isupper(),
  14. 'prevword': tokens[i - 1].lower(),
  15. 'punct': tokens[i],
  16. 'prev-word-is-one-char': len(tokens[i - 1]) == 1}
  17. # 定义标注
  18. featuresets = [(punct_features(tokens, i), (i in boundaries))
  19. for i in range(1, len(tokens) - 1)
  20. if tokens[i] in '.?!']
  21. # 构建分类器
  22. size = int(len(featuresets) * 0.1)
  23. train_set, test_set = featuresets[size:], featuresets[:size]
  24. classifier = nltk.NaiveBayesClassifier.train(train_set)
  25. nltk.classify.accuracy(classifier, test_set)

结果: 

 0.936026936026936

基于分类的断句器:

  1. # 基于分类的断句器
  2. def segment_sentences(words):
  3. start = 0
  4. sents = []
  5. for i, word in words:
  6. if word in '.?!' and classifier.classify(words, i) == True:
  7. sents.append(words[start:i + 1])
  8. start = i + 1
  9. if start < len(words):
  10. sents.append(words[start:])

11、识别对话行为类型

  1. # 识别对话行为类型
  2. posts = nltk.corpus.nps_chat.xml_posts()[:10000]
  3. # 定义特征提取器
  4. def dialogue_act_features(post):
  5. features = {}
  6. for word in nltk.word_tokenize(post):
  7. features['contains(%s)' % word.lower()] = True
  8. return features
  9. # 训练分类器
  10. featuresets = [(dialogue_act_features(post.text), post.get('class'))
  11. for post in posts]
  12. size = int(len(featuresets) * 0.1)
  13. train_set, test_set = featuresets[size:], featuresets[:size]
  14. classifier = nltk.NaiveBayesClassifier.train(train_set)
  15. print(nltk.classify.accuracy(classifier, test_set))

结果:

0.668

12、评估

(1)准确度

  1. ####评估####
  2. # 创建训练集与测试集
  3. import random
  4. from nltk.corpus import brown
  5. tagged_sents = list(brown.tagged_sents(categories='news'))
  6. random.shuffle(tagged_sents)
  7. size = int(len(tagged_sents) * 0.1)
  8. train_set, test_set = tagged_sents[size:], tagged_sents[:size]
  9. # 使用同类型文件
  10. file_ids = brown.fileids(categories='news')
  11. size = int(len(file_ids) * 0.1)
  12. train_set = brown.tagged_sents(file_ids[size:])
  13. test_set = brown.tagged_sents(file_ids[:size])
  14. # 使用不同类型文件
  15. train_set = brown.tagged_sents(categories='news')
  16. test_set = brown.tagged_sents(categories='fiction')
  17. ##准确度##
  18. names_set = ([(name, 'male') for name in names.words('male.txt')] +
  19. [(name, 'female') for name in names.words('female.txt')])
  20. random.shuffle(names_set)
  21. featuresets = [(gender_features(n), g) for (n, g) in names_set]
  22. train_set, test_set = featuresets[500:], featuresets[:500]
  23. classifier = nltk.NaiveBayesClassifier.train(train_set)
  24. print('Accuracy: %4.2f' % nltk.classify.accuracy(classifier, test_set))

结果:

Accuracy: 0.78

(2)精确度与召回率

  1. ##精准度与召回率##
  2. from sklearn.metrics import classification_report
  3. test_set_fea = [features for (features, gender) in test_set]
  4. test_set_gen = [gender for (features, gender) in test_set]
  5. pre = classifier.classify_many(test_set_fea)
  6. print(classification_report(test_set_gen, pre))

结果:

  1. precision recall f1-score support
  2. female 0.83 0.82 0.83 316
  3. male 0.70 0.71 0.70 184
  4. avg / total 0.78 0.78 0.78 500

(3)混淆矩阵

  1. ##混淆矩阵##
  2. cm = nltk.ConfusionMatrix(test_set_gen, pre)
  3. print(cm)

结果:

  1. | f |
  2. | e |
  3. | m m |
  4. | a a |
  5. | l l |
  6. | e e |
  7. -------+---------+
  8. female |<260> 56 |
  9. male | 54<130>|
  10. -------+---------+
  11. (row = reference; col = test)

(4)决策树中的熵和信息增益

  1. # 熵和信息增益
  2. import math
  3. def entropy(labels):
  4. freqdist = nltk.FreqDist(labels)
  5. probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]
  6. return -sum([p * math.log(p, 2) for p in probs])
  7. print(entropy(['male', 'male', 'male', 'male']))
  8. -0.0
  9. print(entropy(['male', 'female', 'male', 'male']))
  10. 0.8112781244591328
  11. print(entropy(['female', 'male', 'female', 'male']))
  12. 1.0
  13. print(entropy(['female', 'female', 'male', 'female']))
  14. 0.8112781244591328
  15. print(entropy(['female', 'female', 'female', 'female']))
  16. -0.0

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/367535
推荐阅读
相关标签
  

闽ICP备14008679号