赞
踩
朴素贝叶斯分类算法主要是基于概率模型建立的分类算法。主要原理详见《小瓜讲机器学习——分类算法(三)朴素贝叶斯法(naive Bayes)算法原理及Python代码实现》。这里我们介绍sklearn中的朴素贝叶斯分类器。
MultinomialNB能够处理特征是离散数据的情况,比如多项式模型能够处理文本分类中的以词频为特征的情况。原理在《小瓜讲机器学习——分类算法(三)朴素贝叶斯法(naive Bayes)算法原理及Python代码实现》中有详细介绍。
sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
参数说明:
1.alpha:平滑参数
2.fit_prior:是否考虑先验概率
P
(
y
)
P(y)
P(y),如果是false,则所有样本类别有相同的先验概率;
3.class_prior:设定先验概率
P
(
y
)
P(y)
P(y),如果不设定,从样本的极大似然估计得到。
属性说明:
1.class_log_prior_:经过平滑后的先验概率的log值
2.intercept_:与上值同;
3.feature_log_prob_:每个类的各个特征概率(即先验条件概率
P
(
x
i
∣
y
)
P(x_i|y)
P(xi∣y))
4.coef_:与上值同;
5.class_count_:训练样本中每个类别的样本数;
6.feature_count_:每个类别中的各个特征出现次数。
import numpy as np
x = np.random.randint(5, size=(6, 10))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(x, y)
print('------- feature vector -------')
print(x)
print('------- prior of each class ------')
print(clf.class_log_prior_)
print('------- probability of features in each class ------')
print(clf.feature_log_prob_)
print('------- predict ------')
print(clf.predict(x[2:3]))
结果如下:
------- feature vector ------- ------- feature vector ------- [[0 3 0 0] [4 1 4 4] [1 4 0 1] [2 3 0 3] [1 0 2 0] [0 0 4 1]] ------- prior of each class ------ [-1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947] ------- probability of features in each class ------ [[-1.94591015 -0.55961579 -1.94591015 -1.94591015] [-1.22377543 -2.14006616 -1.22377543 -1.22377543] [-1.60943791 -0.69314718 -2.30258509 -1.60943791] [-1.38629436 -1.09861229 -2.48490665 -1.09861229] [-1.25276297 -1.94591015 -0.84729786 -1.94591015] [-2.19722458 -2.19722458 -0.58778666 -1.5040774 ]] ------- predict ------ [3]
BernouliNB只能处理特征是二值的离散数据的情况,假设数据特征呈现伯努利分布,则后验概率由下式确定
特
征
x
i
=
1
时
,
P
(
x
i
∣
y
)
=
P
(
x
i
=
1
∣
y
)
特征x_i=1时,P(x_i|y)=P(x_i=1|y)
特征xi=1时,P(xi∣y)=P(xi=1∣y)
特
征
x
i
=
0
时
,
P
(
x
i
∣
y
)
=
P
(
x
i
=
0
∣
y
)
特征x_i=0时,P(x_i|y)=P(x_i=0|y)
特征xi=0时,P(xi∣y)=P(xi=0∣y)
sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
参数说明:
1.alpha(default=1.0):平滑参数
2.binarize(default=0.0):
3.fit_prior(default=True):是否考虑先验概率
P
(
y
)
P(y)
P(y),如果是false,则所有样本类别有相同的先验概率;
4.class_prior(default=None):设定先验概率
P
(
y
)
P(y)
P(y),如果不设定,从样本的极大似然估计得到。
属性说明:
1.class_log_prior_:经过平滑后的先验概率的log值
2.feature_log_prob_:每个类的各个特征概率(即先验条件概率
P
(
x
i
∣
y
)
P(x_i|y)
P(xi∣y))
3.class_count_:训练样本中每个类别的样本数;
4.feature_count_:每个类别中的各个特征出现次数。
import numpy as np from sklearn.naive_bayes import BernoulliNB x = np.random.randint(2, size=(6, 4)) y = np.array([1, 2, 3, 4, 5, 6]) clf = BernoulliNB() clf.fit(x, y) print('------- feature vector -------') print(x) print('------- prior of each class ------') print(clf.class_log_prior_) print('------- probability of features in each class ------') print(clf.feature_log_prob_) print('------- predict ------') print(clf.predict(x[2:3]))
输出如下
------- feature vector ------- [[0 1 1 0] [0 0 0 0] [0 1 0 0] [0 1 1 0] [0 1 1 1] [1 1 0 1]] ------- prior of each class ------ [-1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947] ------- probability of features in each class ------ [[-1.09861229 -0.40546511 -0.40546511 -1.09861229] [-1.09861229 -1.09861229 -1.09861229 -1.09861229] [-1.09861229 -0.40546511 -1.09861229 -1.09861229] [-1.09861229 -0.40546511 -0.40546511 -1.09861229] [-1.09861229 -0.40546511 -0.40546511 -0.40546511] [-0.40546511 -0.40546511 -1.09861229 -0.40546511]] ------- predict ------ [3]
GaussianNB是高斯型朴素贝叶斯分类器,它假设了条件概率
P
(
x
i
∣
y
)
P(x_i|y)
P(xi∣y)符合高斯分布,即
P
(
x
i
∣
y
)
=
1
2
π
σ
y
2
exp
(
−
(
x
i
−
μ
y
)
2
2
σ
y
2
)
P(x_i|y)=\frac{1}{\sqrt{2\pi\sigma_y^2}}\exp(-\frac{(x_i-\mu_y)^2}{2\sigma_y^2})
P(xi∣y)=2πσy2
1exp(−2σy2(xi−μy)2)
参数
σ
y
\sigma_y
σy和
μ
y
\mu_y
μy用最大似然估计得到。
sklearn.naive_bayes.GaussianNB(priors=None, var_smoothing=1e-09)
参数说明:
1.prior:每个类的先验概率,若不设定,则通过样本极大似然估计得到;
2.var_smoothing:比例因子,将所有特征中最大的方差的一定比例添加到方差中,为了计算稳定性(原文:Portion of the largest variance of all features that is added to variances for calculation stability)
属性说明:
1.class_prior:分属不同类的概率;
2.class_count_:每个类的样本数;
3.theta_:每个类的特征均值;
4.sigma_:每个类的特征方差
5.epsilon_:方差增加值?
训练样本如下
4.45925637575900 8.22541838354701 0
0.0432761720122110 6.30740040001402 0
6.99716180262699 9.31339338579386 0
4.75483224215432 9.26037784240288 0
8.66190392439652 9.76797698918454 0
...
4.21408348419092 2.97014277918461 1
5.52248511695330 3.63263027130760 1
4.15244831176753 1.44597290703838 1
9.55986996363196 1.13832040773527 1
1.63276516895206 0.446783742774178 1
9.38532498107474 0.913169554364942 1
代码如下
import numpy as np from sklearn.naive_bayes import GaussianNB with open(r'H:\python dataanalysis\sklearn\naive_bayes_data.txt') as f: data = [] label = [] for loopi in f.readlines(): line = loopi.strip().split('\t') data.append([float(line[0]), float(line[1])]) label.append(float(line[2])) feature = np.array(data) label = np.array(label) clf = GaussianNB() clf.fit(feature, label) print('----priors probablity of each class----') print(clf.class_prior_) print('----number of samples in each class----') print(clf.class_count_) print('-----mean of each feature per class------') print(clf.theta_) print('-----variance of each feature per class------') print(clf.sigma_) x = np.array([1.0,1.0]).reshape((1,2)) print('-----predict------') print(clf.predict(x))
结果如下
----priors probablity of each class----
[0.5 0.5]
----number of samples in each class----
[100. 100.]
-----mean of each feature per class------
[[4.11770459 7.57552293]
[5.76083782 2.3511532 ]]
-----variance of each feature per class------
[[6.60834055 4.73694733]
[6.18880712 3.96585982]]
-----predict------
[1.]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。