- 优点:在数据较少的情况下仍然有效,可以处理多类别问题。
- 缺点:对于输入数据的准备方式较为敏感。
- 适用数据类型:标称型数据。
条件概率:P(A|B) = P(AB)/P(B)
贝叶斯准则:p(c|x) = p(x|c)p(c) / p(x)
- 收集数据:可以使用任何方法。
- 准备数据:需要数值型或者布尔型数据。
- 分析数据:有大量特征时,绘制特征作用不大,此时使用直方图效果更好。
- 训练算法:计算不同的独立特征的条件概率。
- 测试算法:计算错误率。
- 使用算法:一个常见的朴素贝叶斯应用是文档分类。可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。
4.1 准备数据:从文本中构建词向量
朴素贝叶斯分类器通常有两种实现方式:一种基于贝努利模型实现,一种基于多项式模型实现。 这里采用前一种实现方式。该实现方式中并不考虑词在文档中出现的次数,只考虑出不出现,因此在这个意义上相当于假设词是等权重的。
- # 创建实验样本
- def loadDataSet():
- postingList = [['my', 'dog', 'has', 'flea', \
- 'problems', 'help', 'please'],
- ['maybe', 'not', 'take', 'him', \
- 'to', 'dog', 'park', 'stupid'],
- ['my', 'dalmation', 'is', 'so', 'cute', \
- 'I', 'love', 'him'],
- ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
- ['mr', 'licks', 'ate', 'my', 'steak', 'how', \
- 'to', 'stop', 'him'],
- ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
- classVec = [0, 1, 0, 1, 0, 1] # 1 代表侮辱性文字,0 代表正常言论
- # postingList:进行词条切分后的文档集合,这些文档来自斑点犬爱好者留言板
- # classVec:类别标签。文本类别由人工标注
- return postingList, classVec
- # 创建一个包含在所有文档中出现的不重复词的列表
- def createVocabList(dataSet):
- vocabSet = set([]) # 创建一个空集
- for document in dataSet:
- vocabSet = vocabSet | set(document) # 创建两个集合的并集
- return list(vocabSet)
- def setOfWords2Vec(vocabList, inputSet):
- returnVec = [0] * len(vocabList) # 创建一个其中所含元素都为0的向量,与词汇表等长
- for word in inputSet:
- if word in vocabList:
- returnVec[vocabList.index(word)] = 1
- else:
- print("the word: %s is not in my Vocabulary!" % word)
- return returnVec
- listOPosts, listClasses = loadDataSet()
- myVocabList = createVocabList(listOPosts)
- print(myVocabList)
['cute', 'quit', 'maybe', 'food', 'not', 'garbage', 'help', 'him', 'has', 'problems', 'I', 'posting', 'so', 'buying', 'park', 'dalmation', 'ate', 'mr', 'licks', 'take', 'please', 'dog', 'love', 'stop', 'how', 'steak', 'is', 'stupid', 'worthless', 'to', 'flea', 'my']
print(setOfWords2Vec(myVocabList, listOPosts[0]))
[0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
4.2 训练算法:从词向量计算概率
p(ci|w) = p(w|ci)p(ci) / p(w)
p(ci) = 类别i(侮辱性或非侮辱性留言)中文档数 / 总文档数
接着计算p(w|ci),用到朴素贝叶斯假设。 如果将w展开为一个个独立特征,那么就可以将上述概率写作p(w0, w1,..wN|ci)。 这里假设所有词都相互独立,该假设也称作条件独立性假设,它意味着可以使用p(w0|ci)p(w1|ci)..p(wN|ci)来计算上述概率
- 计算每个类别中的文档数目
- 对每篇训练文档:
- 对每个类别:
- 如果词条出现文档中➡️增加该词条的计数值
- 增加所有词条的计数值
- 对每个类别:
- 对每个词条:
- 将该词条的数目除以总词条数目得到条件概率
- 返回每个类别的条件概率
- from numpy import *
- # trainMatrix:文档矩阵
- # trainCategory:由每篇文档类别标签所构成的向量
- def trainNB0(trainMatrix, trainCategory):
- numTrainDocs = len(trainMatrix) # 文档总数
- numWords = len(trainMatrix[0]) # 词汇表长度
- # 计算文档属于侮辱性文档(class=1)的概率
- pAbusive = sum(trainCategory) / float(numTrainDocs)
- # 初始化概率
- p0Num = zeros(numWords); p1Num = zeros(numWords)
- p0Denom = 0.0; p1Denom = 0.0
- for i in range(numTrainDocs):
- if trainCategory[i] == 1:
- # 向量相加
- p1Num += trainMatrix[i] # 所有侮辱性文档中每个词向量出现个数
- p1Denom += sum(trainMatrix[i]) # 侮辱性文档总词数
- else:
- p0Num += trainMatrix[i]
- p0Denom += sum(trainMatrix[i])
- # 对每个元素做除法
- p1Vect = p1Num / p1Denom # change to log()
- p0Vect = p0Num / p0Denom
- return p0Vect, p1Vect, pAbusive
- trainMat = [] # 文档向量矩阵
- for postinDoc in listOPosts:
- trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
- p0V, p1V, pAb = trainNB0(trainMat, listClasses)
- print(pAb)
- [0.04166667 0. 0. 0. 0. 0.
- 0.04166667 0.08333333 0.04166667 0.04166667 0.04166667 0.
- 0.04166667 0. 0. 0.04166667 0.04166667 0.04166667
- 0.04166667 0. 0.04166667 0.04166667 0.04166667 0.04166667
- 0.04166667 0.04166667 0.04166667 0. 0. 0.04166667
- 0.04166667 0.125 ]
4.3 测试算法:根据现实情况修改分类器
- def trainNB1(trainMatrix, trainCategory):
- numTrainDocs = len(trainMatrix) # 文档总数
- numWords = len(trainMatrix[0]) # 词汇表长度
- # 计算文档属于侮辱性文档(class=1)的概率
- pAbusive = sum(trainCategory) / float(numTrainDocs)
- # 初始化概率
- p0Num = ones(numWords); p1Num = ones(numWords)
- p0Denom = 2.0; p1Denom = 2.0
- for i in range(numTrainDocs):
- if trainCategory[i] == 1:
- # 向量相加
- p1Num += trainMatrix[i] # 所有侮辱性文档中每个词向量出现个数
- p1Denom += sum(trainMatrix[i]) # 侮辱性文档总词数
- else:
- p0Num += trainMatrix[i]
- p0Denom += sum(trainMatrix[i])
- # 对每个元素做除法
- p1Vect = log(p1Num / p1Denom)
- p0Vect = log(p0Num / p0Denom)
- return p0Vect, p1Vect, pAbusive
- # vec2Classify为要分类的向量
- def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
- # 相乘是指对应元素相乘
- p1 = sum(vec2Classify * p1Vec) + log(pClass1)
- p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
- if p1 > p0:
- return 1
- else:
- return 0
- def testingNB():
- listOPosts, listClasses = loadDataSet()
- myVocabList = createVocabList(listOPosts)
- trainMat = []
- for postingDoc in listOPosts:
- trainMat.append(setOfWords2Vec(myVocabList, postingDoc))
- p0V, p1V, pAb = trainNB1(array(trainMat), array(listClasses))
- testEntry = ['love', 'my', 'dalmation']
- thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
- print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
- testEntry = ['stupid', 'garbage']
- thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
- print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
- ['love', 'my', 'dalmation'] classified as: 0
- ['stupid', 'garbage'] classified as: 1
4.4 准备数据:文档词袋模型
**词袋模型:**如果一个词在文档中出现不止一次,这可能意味着包含该词是否出现在文档中所不能表达的某种信息。 在词袋中,每个单词可以出现多次,而在词集中,每个词只能出现一次。
- def bagOfWords2VecMN(vocabList, inputSet):
- returnVec = [0] * len(vocabList)
- for word in inputSet:
- if word in vocabList:
- returnVec[vocabList.index(word)] += 1 # 不只是将对应的数值设为1
- return returnVec
- 收集数据:提供文本文件。
- 准备数据:将文本文件解析成词条向量。
- 分析数据:检查词条确保解析的正确性。
- 训练算法:使用我们之前建立的trainNB1()函数
- 测试算法:使用classifyNB(),并且构建一个新的测试函数来计算文档集的错误率。
- 使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上。
5.1 准备数据:切分文本
- import re
- mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
- regEx = re.compile(r'\W+')
- # \W:匹配特殊字符,即非字母、非数字、非汉字、非_
- # 表示匹配前面的规则至少 1 次,可以多次匹配
- listOfTokens = regEx.split(mySent)
- print(listOfTokens)
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']
- # 去掉空字符串。可以计算每个字符串的长度,只返回长度大于0的字符串
- # 将字符串全部转换成小写
- print([tok.lower() for tok in listOfTokens if len(tok) > 0])
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']
- emailText = open('email/ham/6.txt', "r", encoding='utf-8', errors='ignore').read()
- listOfTokens = regEx.split(emailText)
- # 6.txt文件非常长,这是某公司告知他们不再进行某些支持的一封邮件。
- # 由于是URL:answer.py?hl=en&answer=174623的一部分,因而会出现en和py这样的单词。
- # 当对URL进行切分时,会得到很多的词,因而在实现时会过滤掉长度小于3的字符串。
5.2 测试算法:使用朴素贝叶斯进行交叉验证
- def textParse(bigString):
- import re
- listOfTokens = re.split(r'\W+', bigString)
- return [tok.lower() for tok in listOfTokens if len(tok) > 2]
- # 对贝叶斯垃圾邮件分类器进行自动化处理
- def spamTest():
- docList = []; classList = []; fullText = []
- for i in range(1,26):
- # 导入并解析文本文件
- # 导入文件夹spam与ham下的文本文件,并将它们解析为词列表。
- wordList = textParse(open('email/spam/%d.txt' % i, "r", encoding='utf-8', errors='ignore').read())
- docList.append(wordList)
- fullText.extend(wordList)
- # append()向列表中添加一个对象object,整体打包追加
- # extend() 函数用于在列表末尾一次性追加另一个序列中的多个值(用新列表扩展原来的列表)。
- classList.append(1)
- wordList = textParse(open('email/ham/%d.txt' % i, "r", encoding='utf-8', errors='ignore').read())
- docList.append(wordList)
- fullText.extend(wordList)
- classList.append(0)
- vocabList = createVocabList(docList) # 词列表
- # 本例中共有50封电子邮件,其中10封电子邮件被随机选择为测试集
- # 分类器所需要的概率计算只利用训练集中的文档来完成。
- trainingSet = list(range(50)); testSet = []
- # 随机构建训练集
- for i in range(10):
- # 随机选择其中10个文件作为测试集,同时也将其从训练集中剔除。
- # 这种随机选择数据的一部分作为训练集,而剩余部分作为测试集的过程称为 留存交叉验证。
- randIndex = int(random.uniform(0, len(trainingSet)))
- testSet.append(trainingSet[randIndex])
- del(trainingSet[randIndex])
- trainMat = []; trainClasses = []
- # 对测试集分类
- for docIndex in trainingSet: # 训练
- trainMat.append(setOfWords2Vec(vocabList, docList[docIndex])) # 词向量
- trainClasses.append(classList[docIndex]) # 标签
- p0V, p1V, pSpam = trainNB1(array(trainMat), array(trainClasses))
- errorCount = 0
- for docIndex in testSet: # 测试
- wordVector = setOfWords2Vec(vocabList, docList[docIndex])
- if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
- errorCount += 1
- print('classificagion error ', docList[docIndex])
- print('the error rate is: ', float(errorCount)/len(testSet))
- classificagion error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'dont', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
- the error rate is: 0.1
- 收集数据:从RSS源收集内容,这里需要对RSS源构建一个接口
- 准备数据:将文本文件解析成词条向量
- 分析数据:检查词条确保解析的正确性
- 训练算法:使用我们之前建立的trainNB1()函数
- 测试算法:观查错误率,确保分类器可用。可以修改切分程序,以降低错误率,提高分类结果。
- 使用算法:构建一个完整的程序,封装所有内容。给定两个RSS源,该程序会显示最常用的公共词。
6.1 收集数据:导入RSS源
Universal Feed Parser是Python中最常用的RSS程序库。 可以在 http://code.google.com/p/feedparser/ 下浏览相关文档。 首先解压下载的包,并将当前目录切换到解压文件所在的文件夹,然后在Python提示符下敲入>>python setup.py install
- import feedparser
- ny = feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
- print(ny['entries'])
- print(len(ny['entries']))
- def calcMostFreq(vocabList, fullText):
- # 计算出现频率
- import operator
- freqDict = {}
- for token in vocabList:
- freqDict[token] = fullText.count(token)
- sortedFreq = sorted(freqDict.items(), key=operator.itemgetter(1), reverse=True)
- return sortedFreq[:10]
- def stopWords():
- import re
- wordList = open('stopwords.txt').read()
- listOfTokens = re.split(r'\W+', wordList)
- listOfTokens = [tok.lower() for tok in listOfTokens]
- return listOfTokens
- def localWords(feed1, feed0):
- import feedparser
- docList = []; classList = []; fullText = []
- minLen = min(len(feed1['entries']), len(feed0['entries']))
- for i in range(minLen):
- # 每次访问一条RSS源
- wordList = textParse(feed1['entries'][i]['summary'])
- docList.append(wordList)
- fullText.extend(wordList)
- classList.append(1)
- wordList = textParse(feed0['entries'][i]['summary'])
- docList.append(wordList)
- fullText.extend(wordList)
- classList.append(0)
- vocabList = createVocabList(docList)
- # 去掉出现次数最高的那些词
- top10Words = calcMostFreq(vocabList, fullText)
- for pairW in top10Words:
- if pairW[0] in vocabList:
- vocabList.remove(pairW[0])
- # 移除停用词
- stopWordList = stopWords()
- for stopWord in stopWordList:
- if stopWord in vocabList:
- vocabList.remove(stopWord)
- trainingSet = list(range(2*minLen)); testSet = []
- for i in range(10):
- randIndex = int(random.uniform(0, len(trainingSet)))
- testSet.append(trainingSet[randIndex])
- del(trainingSet[randIndex])
- trainMat = []; trainClasses = []
- for docIndex in trainingSet:
- trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
- trainClasses.append(classList[docIndex])
- p0V, p1V, pSpam = trainNB1(array(trainMat), array(trainClasses))
- errorCount = 0
- for docIndex in testSet:
- wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
- if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
- errorCount += 1
- print('the error rate is: ', float(errorCount)/len(testSet))
- return vocabList, p1V, p0V
- ny=feedparser.parse('https://newyork.craigslist.org/search/res?format=rss')
- sf=feedparser.parse('https://sfbay.craigslist.org/search/apa?format=rss')
- print(len(ny['entries']))
- print(len(sf['entries']))
- vocabList, pNY, pSF = localWords(ny, sf)
- 25
- 25
- the error rate is: 0.1
6.2 分析数据:显示地域相关的用词
- def getTopWords(ny, sf):
- import operator
- vocabList, p0V, p1V = localWords(ny, sf)
- topNY = []; topSF = []
- for i in range(len(p0V)):
- if p0V[i] > -6.0:
- topSF.append((vocabList[i], p0V[i]))
- if p1V[i] > -6.0:
- topNY.append((vocabList[i], p1V[i]))
- sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
- print("SF**SF**SF**SF**SF**SF**SF**SF**SF**")
- for item in sortedSF:
- print(item[0])
- sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
- print("NY**NY**NY**NY**NY**NY**NY**NY**NY**")
- for item in sortedNY:
- print(item[0])
getTopWords(ny, sf)
- the error rate is: 0.3
- SF**SF**SF**SF**SF**SF**SF**SF**SF**
- experience
- amp
- job
- services
- looking
- marketing
- provide
- experienced
- time
- please
- money
- online
- specialty
- com
- virtual
- reseller
- support
- work
- lot
- postings
- delivery
- personal
- interested
- budgets
- female
- just
- wolf
- position
- starting
- shopper
- well
- help
- copy
- picks
- packages
- one
- enable
- need
- tasty
- spending
- clients
- photos
- leads
- assistance
- part
- descriptions
- research
- without
- willing
- item
- big
- hour
- errands
- ride
- top
- people
- businesses
- expertise
- products
- years
- get
- contact
- email
- monthly
- anais
- clean
- skills
- pricing
- https
- sounds
- tests
- true
- good
- superv
- outpatient
- service
- white
- cacaoethescribendi
- prescription
- content
- home
- hope
- pdf
- jobs
- summer
- reclaim
- paralegal
- sell
- run
- calendar
- lawyer
- startups
- proud
- choir
- information
- technician
- year
- electronics
- straightforward
- film
- independently
- building
- director
- text
- experiences
- w0rd
- last
- licensing
- pos
- rate
- wordpress
- upside
- expand
- punctual
- degree
- black
- cashiers
- location
- woman
- island
- yes
- communications
- dishwasher
- world
- social
- f0rmat
- relisting
- clothing
- regarding
- business
- courier
- professionally
- take
- person
- http
- scimcoating
- journalist
- quickly
- isnt
- free
- wanted
- billy
- vocalist
- short
- cardiovascular
- producer
- taper
- legal
- predetermined
- century
- school
- walker
- startup
- commercial
- new
- assistant
- cashier
- offering
- shopping
- articles
- via
- hello
- layersofv
- affordable
- 2018
- pharmacy
- hey
- open
- option
- versatile
- offer
- listing
- 23yrs
- york
- fast
- even
- corporate
- 446k
- department
- sonyc
- excellent
- will
- plenty
- clinic
- studies
- hear
- thanks
- practice
- around
- level
- soon
- typing
- rican
- within
- etc
- stories
- effective
- based
- satisfy
- resume
- pay
- data
- messenger
- realize
- samples
- happy
- old
- arawumi
- proofreading
- present
- media
- marketer
- nonprofits
- coveted
- copywriting
- program
- lost
- bronx
- patient
- name
- pile
- typical
- limited
- client
- provided
- select
- painter
- note
- collection
- assist
- soundcloud
- partners
- motivated
- cpa
- full
- care
- reduced
- link
- hardworking
- minute
- 162k
- potential
- rent
- according
- budget
- thank
- associate
- list
- tech
- freelance
- updating
- therefore
- affordably
- really
- collaborations
- gift
- responsible
- collaboration
- handova
- pleas
- derek
- brooklyn
- extra
- hospital
- house
- request
- edge
- puerto
- odd
- interpersonal
- video
- budgeting
- wants
- secured
- retail
- writing
- boss
- upon
- queens
- music
- clerk
- per
- essays
- tobi
- 235q
- make
- canadian
- 21st
- writer
- february
- fandalism
- base
- seeking
- price
- death
- fertility
- qualit
- history
- huge
- runs
- multiple
- cleaning
- course
- long
- concept
- NY**NY**NY**NY**NY**NY**NY**NY**NY**
- close
- great
- location
- bathroom
- bath
- hardwood
- unit
- kitchen
- shopping
- coming
- space
- soon
- house
- entrances
- apartments
- home
- room
- 2019
- laundry
- floor
- located
- tops
- remodeled
- living
- valley
- beautiful
- one
- rent
- near
- floors
- freeway
- quiet
- centers
- amp
- rooms
- 1200
- colony
- silicon
- inside
- laminate
- hello
- duplex
- parking
- sunnyvale
- approx
- storage
- glen
- 30th
- heart
- berdroom
- conveniently
- restaurants
- separate
- tech
- firms
- countless
- cupertino
- includes
- april
- private
- perfect
- major
- manor
- campus
- updated
- enough
- three
- newpark
- patio
- spacious
- complex
- 2017
- block
- heating
- country
- nice
- granite
- site
- easy
- pay
- counter
- grand
- show
- eyrie
- currently
- appointment
- security
- drive
- measured
- water
- garbage
- 900
- 500
- throughout
- downstairs
- luxury
- vineyard
- two
- top
- pics
- high
- beautifully
- dryer
- central
- covered
- washer
- jun1
- north
- neighborhood
- gorgeous
- upgraded
- recycling
- supermarkets
- san
- freshly
- painted
- flooring
- lake
- inc
- victorian
- molding
- market
- noise
- wine
- district
- bri
- immed
- theater
- please
- newly
- attractive
- millbrae
- find
- acalanes
- oven
- bed
- napa
- end
- area
- deck
- lots
- deserve
- building
- jacuzzi
- included
- setting
- unfurnished
- irma
- tile
- 150
- restaur
- small
- enjoy
- dishwasher
- francisco
- ceilings
- professionally
- refrigerators
- murchison
- customized
- bart
- short
- hard
- best
- come
- commute
- school
- enclosed
- sides
- south
- ground
- built
- new
- far
- schedule
- back
- offering
- see
- law
- open
- sunlight
- tranquil
- farmers
- special
- broadway
- deposit
- towers
- lion
- 10am
- crown
- roommates
- photo
- shared
- antique
- bustling
- 94611
- quartz
- welcome
- executive
- 4pm
- leads
- street
- management
- glass
- plaza
- tub
- owned
- hea
- 1109
- halfway
- perfectly
- roger
- door
- coffee
- xpiedmont
- 495rent
- garden
- feel
- link
- viewing
- nearly
- features
- gibson
- jose
- closet
- center
- super
- furniture
- 02071565
- walk
- managed
- 94030
- showings
- sunroom
- court
- sinks
- minutes
- additional
- people
- windows
- fou
- request
- shops
- newer
- portfolio
- walnut
- ceiling
- condominiums
- oakland
- 101
- dining
- foyer
- mall
- 103
- dre
- cameras
- food
- entertaining
- beam
- bio
- individually
- make
- access
- chef
- email
- sliding
- apartment
- ave
- creek
- call
- facing
- clean
- huge
- sitting
- charming
- village
- modern