赞
踩
时间:2022/5/11
数值属性数据集依旧是采用鸢尾花iris数据集这里就不过多介绍,标称属性数据集采用weather.arff数据集。这个数据集如下,有五个属性组成,决策属性为最后一个是否出去玩。这个数据集全部由标称属性组成,且数据集大小较小,非常适合入门学习。
@relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no
朴素贝叶斯算法是基于贝叶斯定理进行预测分类的一种算法,贝叶斯定理如下:
P
(
H
∣
X
)
=
P
(
X
∣
H
)
P
(
H
)
P
(
X
)
(1)
P(H|X)={P(X|H)P(H)\over P(X)} \tag{1}
P(H∣X)=P(X)P(X∣H)P(H)(1)
其中X是n个属性集的测量值,而H为某种假设。P(H|X)是后验概率(posterior probability),或者说条件X下,H的后验概率,以weather数据集举例,后验概率为晴天出去玩的概率P(yes | sunny)。而P(H)为先验概率(prior probability),比如出去玩的概率P(yes)。先验概率考虑的因素要更少,或者说后验概率比先验概率基于更多的信息。先验概率P(H)独立于X。
那朴素贝叶斯算法如何利用贝叶斯定理进行预测分类呢?
从贝叶斯定理可得,令X=x1∧x2∧⋯∧xm 表示一个条件的组合,outlook=sunny ∧ temperature=hot ∧ humidity=high^windy=false,对应一条数据。则Di为是否出去玩的假设,i有两种即yes/no。则由(1)式可得:
P
(
D
i
∣
X
)
=
P
(
X
∣
D
i
)
P
(
D
i
)
P
(
X
)
(2)
P(D_i|X)={P(X|D_i)P(D_i) \over P(X)} \tag{2}
P(Di∣X)=P(X)P(X∣Di)P(Di)(2)
从已知数据X可计算两种假设的概率,概率最大的发生的可能性就越大,我们则认为该假设为事件的最大后验假设。由于P(X)对于所有的假设而言都是个常数,所以只需要计算
P
(
X
∣
D
i
)
P
(
D
i
)
P(X|D_i)P(D_i)
P(X∣Di)P(Di)最大值即可。如果先验概率
P
(
D
i
)
P(D_i)
P(Di)是未知的,则假定所有的假设都是等概率的,以此最大化
P
(
X
∣
D
i
)
P(X|D_i)
P(X∣Di)。
为了减少
P
(
X
∣
D
i
)
P(X|D_i)
P(X∣Di)的计算开销,朴素贝叶斯算法做出类条件独立性的朴素假定。认为所有属性之间均相互独立。因此:
P
(
X
∣
D
i
)
=
∏
j
=
1
m
P
(
X
k
∣
D
i
)
=
P
(
X
1
∣
D
i
)
P
(
X
2
∣
D
i
)
…
P
(
X
k
∣
D
i
)
(3)
P(X|D_i)=\prod_{j=1}^m {P(X_k|D_i)}=P(X_1|D_i)P(X_2|D_i)\ldots P(X_k|D_i) \tag{3}
P(X∣Di)=j=1∏mP(Xk∣Di)=P(X1∣Di)P(X2∣Di)…P(Xk∣Di)(3)
则(2)式变为:
P
(
D
i
∣
X
)
=
P
(
D
i
)
∏
j
=
1
m
P
(
X
k
∣
D
i
)
P
(
X
)
(4)
P(D_i|X)={P(D_i)\prod_{j=1}^m {P(X_k|D_i)} \over P(X)}\tag{4}
P(Di∣X)=P(X)P(Di)∏j=1mP(Xk∣Di)(4)
使用对数将连乘转化成连加,得到最后的预测函数:
D
(
X
)
=
a
r
g
max
1
≤
i
≤
k
P
(
D
i
∣
X
)
=
a
r
g
max
1
≤
i
≤
k
P
(
D
i
)
∏
j
=
1
m
P
(
X
k
∣
D
i
)
=
a
r
g
max
1
≤
i
≤
k
(
l
o
g
P
(
D
i
)
+
∑
j
=
1
m
l
o
g
P
(
X
i
∣
D
i
)
)
(5)
D(X)=arg\,\max_{1\le i \le k}P(D_i|X)=arg\,\max_{1\le i \le k}P(D_i)\prod_{j=1}^m {P(X_k|D_i)}=arg\,\max_{1\le i \le k}(logP(D_i)+\sum_{j=1}^m{logP(X_i|D_i)})\tag{5}
D(X)=arg1≤i≤kmaxP(Di∣X)=arg1≤i≤kmaxP(Di)j=1∏mP(Xk∣Di)=arg1≤i≤kmax(logP(Di)+j=1∑mlogP(Xi∣Di))(5)
在计算(3)式时如果出现一个属性的概率为0时则整个结果都为0了,这样计算便出问题了。为了避免这种0概率情况,法国数学家拉普拉斯便提出了一种平滑方法——拉普拉斯平滑。具体公式如下
P
L
(
x
j
∣
D
i
)
=
n
P
(
X
j
D
i
)
+
1
n
P
(
D
i
)
+
v
j
(6)
P^L(x_j|D_i)={{nP(X_jD_i)+1}\over {nP(D_i)+v_j}}\tag{6}
PL(xj∣Di)=nP(Di)+vjnP(XjDi)+1(6)
其中n是对象的数量,
v
j
v_j
vj是属性的可能取值数。这样通过分子的加一,便解决的数据的0概率情况。
对于数值型数据,不能使用P(humidity=87),因为湿度恰好等于87的概率过小,小概率事件在大量随机测试中发生概率趋近于0。但P(80<humidity<=87)的概率却不为0;这样便可实现数值属性的离散化。但如果不使用离散化方法,还可以使用数值的概率密度函数。
首先根据数据及分布假设,求得概率密度函数p(humidity=87)。然后直接使用p代替(5)式中的P。通常概率分布可以选择高斯分布,其概率密度函数如下:
p
(
x
)
=
1
2
π
σ
e
x
p
(
−
(
s
−
μ
)
2
2
σ
2
)
(7)
p(x)={1\over \sqrt{2\pi}\sigma}exp(-{(s-\mu)^2\over 2\sigma^2})\tag{7}
p(x)=2π
σ1exp(−2σ2(s−μ)2)(7)
代入(5)式得
D
(
X
)
=
a
r
g
max
1
≤
i
≤
k
(
l
o
g
P
(
D
i
)
+
∑
j
=
1
m
−
l
o
g
σ
i
j
−
(
x
j
−
μ
i
j
2
)
2
σ
i
j
2
)
(8)
D(X)=arg\,\max_{1\le i \le k}(logP(D_i)+\sum_{j=1}^m{-log\sigma_{ij}-{(x_j-\mu _{ij}^2)\over 2\sigma_{ij}^2}})\tag{8}
D(X)=arg1≤i≤kmax(logP(Di)+j=1∑m−logσij−2σij2(xj−μij2))(8)
模型构建
/** * MyNaiveBayes.java * * @author zjy * @date 2022/5/10 * @Description: * @version V1.0 */ package swpu.zjy.ML.NB; import weka.core.Instance; import weka.core.Instances; import java.io.FileNotFoundException; import java.io.FileReader; import java.util.Arrays; public class MyNaiveBayes { /** * 内部类,用于存储高斯分布中的期望mu和方差sigma * 用以对数组型数据集进行Naive Bayes分类 */ private class GaussianParamters { double mu; double sigma; public GaussianParamters(double mu, double sigma) { this.mu = mu; this.sigma = sigma; } @Override public String toString() { return "GaussianParamters{" + "mu=" + mu + ", sigma=" + sigma + '}'; } } //标称属性 public static final int NOMINAL = 0; //数值属性 public static final int NUMERICAL = 1; //数据属性类别 private int dataType = NOMINAL; //数据集实体 Instances dataset; //决策属性取值个数 int numClasses; //数据集数据个数 int numInstances; //数据的条件属性个数 int numConditions; //预测标签 length = numInstance int predicts[]; //决策属性概率分布 length = numClasses double[] classDistribution; //决策属性经过拉普拉斯平滑后的概率分布 length = numClasses double[] classDistributionLaplacian; /** * 条件属性个数,用以计算P(xi|Di) * 第一维是决策属性 length = numClasses * 第二维是条件属性 length = numConditions * 第三维是条件属性取值出现次数 length = 该条件属性可取值个数 */ double[][][] conditionalCounts; /** * 条件属性经过拉普拉斯平滑后的概率,各维度长度同上 * 第三维是条件属性取值出现的概率 */ double[][][] conditionalLaplacianOdds; /** * 存放高斯分布参数 用以预测数值型数据 * 第一维是决策属性 length = numClasses * 第二维是条件属性 length = numConditions */ GaussianParamters[][] gaussianParameters; /** * 构造方法,根据传入数据集文件路径,初始化数据集实体及相关参数 * * @param dataSetFileName 数据集文件路径 */ public MyNaiveBayes(String dataSetFileName) { try { FileReader fileReader = new FileReader(dataSetFileName); dataset = new Instances(fileReader); fileReader.close(); } catch (Exception e) { e.printStackTrace(); } dataset.setClassIndex(dataset.numAttributes() - 1); //初始化数据集参数 numClasses = dataset.numClasses(); numConditions = dataset.numAttributes() - 1; numInstances = dataset.numInstances(); } /** * 设置数据集数据类型 * * @param dataType 数据类型 */ public void setDataType(int dataType) { this.dataType = dataType; } /** * 计算决策属性的概率分布和经过拉普拉斯平滑后的概率分布 */ public void calculateClassDistribution() { //step1.初始化向量 classDistribution = new double[numClasses]; classDistributionLaplacian = new double[numClasses]; //step2.统计决策属性出现次数 double[] tempCnt = new double[numClasses]; int tempType; for (int i = 0; i < numInstances; i++) { tempType = (int) dataset.instance(i).classValue(); tempCnt[tempType]++; } //step3.计算决策属性概率分布及拉普拉斯平滑后的概率 for (int i = 0; i < numClasses; i++) { classDistribution[i] = tempCnt[i] / numInstances; classDistributionLaplacian[i] = (tempCnt[i] + 1) / (numInstances + numClasses); } System.out.println("Class distribution: " + Arrays.toString(classDistribution)); System.out.println("Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian)); } /** * 计算条件属性的概率分布即P(xi|Di) */ public void calculateConditionalOdds() { //step1.初始化向量,最后一维暂留 conditionalCounts = new double[numClasses][numConditions][]; conditionalLaplacianOdds = new double[numClasses][numConditions][]; //step2.根据每一个条件属性的取值个树初始最后一维 int tempCnt; for (int i = 0; i < numClasses; i++) { for (int j = 0; j < numConditions; j++) { tempCnt = dataset.attribute(j).numValues(); conditionalCounts[i][j] = new double[tempCnt]; conditionalLaplacianOdds[i][j] = new double[tempCnt]; } } //step3.统计训练集中的条件属性对应取值出现次数 int[] tempClassCount = new int[numClasses]; int tempClass, tempValue; for (int i = 0; i < numInstances; i++) { tempClass = (int) dataset.instance(i).classValue(); tempClassCount[tempClass]++; for (int j = 0; j < numConditions; j++) { tempValue = (int) dataset.instance(i).value(j); conditionalCounts[tempClass][j][tempValue]++; } } //step4.计算条件属性拉普拉斯平滑后的概率 for (int i = 0; i < numClasses; i++) { for (int j = 0; j < numConditions; j++) { int tempNumvalue = dataset.attribute(j).numValues(); for (int k = 0; k < tempNumvalue; k++) { conditionalLaplacianOdds[i][j][k] = (conditionalCounts[i][j][k] + 1) / (tempClassCount[i] + tempNumvalue); } } } System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts)); } /** * 计算数值型数据的高斯分布参数mu与sigma */ public void calculateGausssianParameters() { gaussianParameters = new GaussianParamters[numClasses][numConditions]; double[] tempValuesArray = new double[numInstances]; int tempNumValues = 0; double tempSum = 0; for (int i = 0; i < numClasses; i++) { for (int j = 0; j < numConditions; j++) { tempSum = 0; //求和+统计个数 tempNumValues = 0; for (int k = 0; k < numInstances; k++) { if ((int) dataset.instance(k).classValue() != i) { continue; } tempValuesArray[tempNumValues] = dataset.instance(k).value(j); tempSum += tempValuesArray[tempNumValues]; tempNumValues++; } //求期望 double tempMu = tempSum / tempNumValues; //求方差 double tempSigma = 0; for (int k = 0; k < tempNumValues; k++) { tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu); } tempSigma /= tempNumValues; tempSigma = Math.sqrt(tempSigma); gaussianParameters[i][j] = new GaussianParamters(tempMu, tempSigma); } } System.out.println(Arrays.deepToString(gaussianParameters)); } /** * 对标称属性的数据进行分类 * * @param instance 数据元组 * @return 预测标签 */ public int classifyNominal(Instance instance) { //记录最大概率 double tempMaxOdds = -Double.MAX_VALUE; //记录标签 int classIndex = 0; for (int i = 0; i < numClasses; i++) { //Pl(Di) double tempClassfiyOdds = Math.log(classDistributionLaplacian[i]); for (int j = 0; j < numConditions; j++) { int tempConditionValue = (int) instance.value(j); //sum(Pl(xi|Di)) tempClassfiyOdds += Math.log(conditionalLaplacianOdds[i][j][tempConditionValue]); } if (tempClassfiyOdds > tempMaxOdds) { tempMaxOdds = tempClassfiyOdds; classIndex = i; } } return classIndex; } /** * 对数值属性数据进行分类,原理同标称属性分类一致,差别在于使用概率密度p代替概率P * * @param instance 数据元组 * @return 预测标签 */ public int classifyNumerical(Instance instance) { // Find the biggest one double tempBiggest = -10000; int resultBestIndex = 0; for (int i = 0; i < numClasses; i++) { double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]); double tempPseudoProbability = tempClassProbabilityLaplacian; //计算概率 for (int j = 0; j < numConditions; j++) { double tempAttributeValue = instance.value(j); double tempSigma = gaussianParameters[i][j].sigma; double tempMu = gaussianParameters[i][j].mu; tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu) * (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma); } if (tempBiggest < tempPseudoProbability) { tempBiggest = tempPseudoProbability; resultBestIndex = i; } } return resultBestIndex; } /** * 对一个数据进行分类,根据数据类型选择不同方法 * * @param paraInstance 待预测的数据元组 * @return 预测标签 */ public int classify(Instance paraInstance) { if (dataType == NOMINAL) { return classifyNominal(paraInstance); } else if (dataType == NUMERICAL) { return classifyNumerical(paraInstance); } return -1; } /** * 对数据集所有数据进行分类 */ public void classify() { predicts = new int[numInstances]; for (int i = 0; i < numInstances; i++) { predicts[i] = classify(dataset.instance(i)); } } /** * 统计准确率 * * @return 准确率 */ public double computeAccuracy() { double tempCorrect = 0; for (int i = 0; i < numInstances; i++) { if (predicts[i] == (int) dataset.instance(i).classValue()) { tempCorrect++; } } double resultAccuracy = tempCorrect / numInstances; return resultAccuracy; } /** * 标称属性数据分类测试 */ public static void testNominal() { System.out.println("Hello, Naive Bayes. I only want to test the nominal data."); String tempFilename = "E:\\DataSet\\weather.arff"; MyNaiveBayes tempLearner = new MyNaiveBayes(tempFilename); tempLearner.setDataType(NOMINAL); tempLearner.calculateClassDistribution(); tempLearner.calculateConditionalOdds(); tempLearner.classify(); System.out.println("The accuracy is: " + tempLearner.computeAccuracy()); } /** * 数值属性数据分类测试 */ public static void testNumerical() { System.out.println( "Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption."); String tempFilename = "E:\\JAVA项目\\mytest\\src\\main\\java\\swpu\\zjy\\ML\\DataSet\\iris.arff"; MyNaiveBayes tempLearner = new MyNaiveBayes(tempFilename); tempLearner.setDataType(NUMERICAL); tempLearner.calculateClassDistribution(); tempLearner.calculateGausssianParameters(); tempLearner.classify(); System.out.println("The accuracy is: " + tempLearner.computeAccuracy()); } public static void main(String[] args) { // testNominal(); testNumerical(); } }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。