当前位置:   article > 正文

朴素贝叶斯(Naive Bayes):鸢尾花分类项目(不调库,手工推)_朴素贝叶斯算法代码实现鸢尾花

朴素贝叶斯算法代码实现鸢尾花

一、数据来源

1.数据来源:kaggle

2.数据样式

        通过对数据“萼片、花瓣的长度、宽度(sepal_length、sepal_width、petal_length、petal_width)”搭建模型进行计算,判断鸢尾花的种类(species)。

 

二、使用方法

朴素贝叶斯(Naive Bayes)

方法说明:

贝叶斯公式P(A|B)=P(B|A)P(B)P(A)

实际运用时,需要判断在已知数据的情况下,属于该组的概率,公式变形为:

P(class|data)=P(data|class)P(data)P(class)

在建立模型时,依次对比是每一组数据的概率大小,而每一组数据的全概率是相同的,所以,公式再次变形为:

P(class|X)=P(X|class)P(class)=P(class)×(P(x1|class)×P(x2|class)P(class)××P(xn|class)P(class))

 

三、代码实现

从数据读取开始,不调取三方库,纯手工推。

1.导入基础库

  1. #1.导入基础库
  2. from csv import reader
  3. from math import exp,pi,sqrt
  4. from random import randrange,seed
  5. import copy

2.读取csv文件和转换数据类型

  1. #2.读取csv文件和转换数据类型
  2. #读取csv文件
  3. def csv_loader(file):
  4. dataset=list()
  5. with open(file,'r') as f:
  6. csv_reader=reader(f)
  7. for row in csv_reader:
  8. if not row:
  9. continue
  10. dataset.append(row)
  11. return dataset
  12. #指标数据转换为浮点型
  13. def str_to_float_converter(dataset):
  14. dataset=dataset[1:]
  15. for i in range(len(dataset[0])-1):
  16. for row in dataset:
  17. row[i]= float(row[i].strip())
  18. #分类数据转换为整型
  19. def str_to_int_converter(dataset):
  20. class_values= [row[-1] for row in dataset]
  21. unique_values= set(class_values)
  22. converter_dict=dict()
  23. for i,value in enumerate(unique_values):
  24. converter_dict[value] = i
  25. for row in dataset:
  26. row[-1] = converter_dict[row[-1] ]

3.K折交叉验证拆分数据

  1. #3.K折交叉验证拆分数据
  2. def k_fold_cross_validation(dataset,n_folds):
  3. dataset_splitted=list()
  4. fold_size= int(len(dataset)/n_folds)
  5. dataset_copy = list(dataset)
  6. for i in range(n_folds):
  7. fold_data = list()
  8. while len(fold_data) < fold_size:
  9. index = randrange(len(dataset_copy))
  10. fold_data.append(dataset_copy.pop(index))
  11. dataset_splitted.append(fold_data)
  12. return dataset_splitted

4.计算准确性

  1. #4.计算准确性
  2. def calculate_accuracy(actual,predicted):
  3. correct_num = 0
  4. for i in range(len(actual)):
  5. if actual[i] == predicted[i]:
  6. correct_num +=1
  7. accuracy = correct_num/float(len(actual)) *100.0
  8. return accuracy

5.模型测试

  1. #5.模型测试
  2. def mode_test(dataset,algo,n_folds,*args):
  3. dataset_splitted = k_fold_cross_validation(dataset,n_folds)
  4. scores = list()
  5. for fold in dataset_splitted:
  6. train = copy.deepcopy(dataset_splitted)
  7. train.remove(fold)
  8. train = sum(train, [])
  9. test =list()
  10. test = copy.deepcopy(fold)
  11. predicted = algo(train, test, *args)
  12. actual = [row[-1] for row in fold]
  13. accuracy= calculate_accuracy(actual,predicted)
  14. scores.append(accuracy)
  15. return scores

6.数据按字典分类和描述

首先,将数据按分类数据作为key,每一行作为value,进行字典转换;

然后,计算每一列的均值、标准差、长度,并通过字典进行描述:{class:[(mean,std,len)])}。

  1. #6.数据按字典分类和描述
  2. #数据组合成按每一种类进行分类
  3. def split_class(dataset):
  4. splitted = dict()
  5. for i in range(len(dataset)):
  6. vector = dataset[i]
  7. class_value = vector[-1]
  8. if class_value not in splitted:
  9. splitted[class_value]=list()
  10. splitted[class_value].append(vector)
  11. return splitted
  12. #计算每一列(x_i)的均值
  13. def calculate_mean(column):
  14. mean = sum(column)/len(column)
  15. return mean
  16. #计算每一列(x_i)的标准差
  17. def calculate_std(column):
  18. mean = calculate_mean(column)
  19. var = sum([(x - mean )**2 for x in column])/float((len(column)-1))
  20. std = sqrt(var)
  21. return std
  22. #描述数据[(mean,std,len)]
  23. def describe_data(dataset):
  24. description = [(calculate_mean(column), calculate_std(column),
  25. len(column)) for column in zip(*dataset)]
  26. del description[-1]
  27. return description
  28. #数据按字典分类描述{class:[(mean,std,len)])}
  29. def describe_class(dataset):
  30. splitted = split_class(dataset)
  31. descriptions = dict()
  32. for class_value, rows in splitted.items():
  33. descriptions[class_value] = describe_data(rows)
  34. return descriptions

7.设置计算概率的基础模型

正态分布概率计算:

如果随机变量X服从N(μ,σ2)

f(x)=1σ2πe((xμ)22σ2)

  1. #7.计算概率的基础模型
  2. def calculate_probability(x,mean,std):
  3. exponent = exp(-((x - mean)**2)/(2 *(std**2)))
  4. probability = (1/(sqrt(2* pi) * std)) *exponent
  5. return probability

8.计算每一行数据的概率

  1. #8.计算每一行数据的概率
  2. def calculate_class_probabilities(dataset,row):
  3. descriptions= describe_class(dataset)
  4. total = sum([descriptions[label][0][-1] for label in descriptions])
  5. pribabilities = dict()
  6. for class_key, class_value in descriptions.items():
  7. pribabilities [class_key] = class_value[0][-1]/float(total)
  8. for i in range(len(class_value)):
  9. mean,std,count = class_value[i]
  10. pribabilities [class_key] *= calculate_probability(row[i],mean,std)
  11. return pribabilities

9.每一行数据中找出最好的标签

  1. #9.每一行数据中找出最好的标签
  2. def predict(dataset,row):
  3. pribabilities=calculate_class_probabilities(dataset,row)
  4. best_label,best_probability =None, -1
  5. for class_value, probability in pribabilities.items():
  6. if best_label is None or probability >best_probability:
  7. best_probability = probability
  8. best_label= class_value
  9. return best_label

10.预测测试数据的分类

  1. #10.预测测试数据的分类
  2. def naive_bayes(train,test):
  3. pedictions = list()
  4. for row in test:
  5. prediction = predict(train,row)
  6. pedictions.append(prediction)
  7. return pedictions

11.运行和参数调整

  1. #11.运行和参数调整
  2. seed(5)
  3. file='./download_datas/IRIS.csv'
  4. dataset= csv_loader(file)
  5. str_to_float_converter(dataset)
  6. dataset=dataset[1:]
  7. str_to_int_converter(dataset)
  8. n_folds=3
  9. algo=naive_bayes
  10. scores = mode_test(dataset,algo,n_folds)
  11. print('The scores of our model are : %s' % scores)
  12. print('The average score of our model is : %.3f%%' % (sum(scores)/float(len(scores))))

代码运行结果:

  1. #结果输出
  2. The scores of our model is : [94.0, 98.0, 96.0]
  3. The average score of our model is : 96.000%

四、完整代码

  1. #1.导入基础库
  2. from csv import reader
  3. from math import exp,pi,sqrt
  4. from random import randrange,seed
  5. import copy
  6. #2.读取csv文件和转换数据类型
  7. #读取csv文件
  8. def csv_loader(file):
  9. dataset=list()
  10. with open(file,'r') as f:
  11. csv_reader=reader(f)
  12. for row in csv_reader:
  13. if not row:
  14. continue
  15. dataset.append(row)
  16. return dataset
  17. #指标数据转换为浮点型
  18. def str_to_float_converter(dataset):
  19. dataset=dataset[1:]
  20. for i in range(len(dataset[0])-1):
  21. for row in dataset:
  22. row[i]= float(row[i].strip())
  23. #分类数据转换为整型
  24. def str_to_int_converter(dataset):
  25. class_values= [row[-1] for row in dataset]
  26. unique_values= set(class_values)
  27. converter_dict=dict()
  28. for i,value in enumerate(unique_values):
  29. converter_dict[value] = i
  30. for row in dataset:
  31. row[-1] = converter_dict[row[-1] ]
  32. #3.K折交叉验证拆分数据
  33. def k_fold_cross_validation(dataset,n_folds):
  34. dataset_splitted=list()
  35. fold_size= int(len(dataset)/n_folds)
  36. dataset_copy = list(dataset)
  37. for i in range(n_folds):
  38. fold_data = list()
  39. while len(fold_data) < fold_size:
  40. index = randrange(len(dataset_copy))
  41. fold_data.append(dataset_copy.pop(index))
  42. dataset_splitted.append(fold_data)
  43. return dataset_splitted
  44. #4.计算准确性
  45. def calculate_accuracy(actual,predicted):
  46. correct_num = 0
  47. for i in range(len(actual)):
  48. if actual[i] == predicted[i]:
  49. correct_num +=1
  50. accuracy = correct_num/float(len(actual)) *100.0
  51. return accuracy
  52. #5.模型测试
  53. def mode_test(dataset,algo,n_folds,*args):
  54. dataset_splitted = k_fold_cross_validation(dataset,n_folds)
  55. scores = list()
  56. for fold in dataset_splitted:
  57. train = copy.deepcopy(dataset_splitted)
  58. train.remove(fold)
  59. train = sum(train, [])
  60. test =list()
  61. test = copy.deepcopy(fold)
  62. predicted = algo(train, test, *args)
  63. actual = [row[-1] for row in fold]
  64. accuracy= calculate_accuracy(actual,predicted)
  65. scores.append(accuracy)
  66. return scores
  67. #6.数据按字典分类和描述
  68. #数据组合成按每一种类进行分类
  69. def split_class(dataset):
  70. splitted = dict()
  71. for i in range(len(dataset)):
  72. vector = dataset[i]
  73. class_value = vector[-1]
  74. if class_value not in splitted:
  75. splitted[class_value]=list()
  76. splitted[class_value].append(vector)
  77. return splitted
  78. #计算每一列(x_i)的均值
  79. def calculate_mean(column):
  80. mean = sum(column)/len(column)
  81. return mean
  82. #计算每一列(x_i)的标准差
  83. def calculate_std(column):
  84. mean = calculate_mean(column)
  85. var = sum([(x - mean )**2 for x in column])/float((len(column)-1))
  86. std = sqrt(var)
  87. return std
  88. #描述数据[(mean,std,len)]
  89. def describe_data(dataset):
  90. description = [(calculate_mean(column), calculate_std(column),
  91. len(column)) for column in zip(*dataset)]
  92. del description[-1]
  93. return description
  94. #数据按字典分类描述{class:[(mean,std,len)])}
  95. def describe_class(dataset):
  96. splitted = split_class(dataset)
  97. descriptions = dict()
  98. for class_value, rows in splitted.items():
  99. descriptions[class_value] = describe_data(rows)
  100. return descriptions
  101. #7.计算概率的基础模型
  102. def calculate_probability(x,mean,std):
  103. exponent = exp(-((x - mean)**2)/(2 *(std**2)))
  104. probability = (1/(sqrt(2* pi) * std)) *exponent
  105. return probability
  106. #8.计算每一行数据的概率
  107. def calculate_class_probabilities(dataset,row):
  108. descriptions= describe_class(dataset)
  109. total = sum([descriptions[label][0][-1] for label in descriptions])
  110. pribabilities = dict()
  111. for class_key, class_value in descriptions.items():
  112. pribabilities [class_key] = class_value[0][-1]/float(total)
  113. for i in range(len(class_value)):
  114. mean,std,count = class_value[i]
  115. pribabilities [class_key] *= calculate_probability(row[i],mean,std)
  116. return pribabilities
  117. #9.每一行数据中找出最好的标签
  118. def predict(dataset,row):
  119. pribabilities=calculate_class_probabilities(dataset,row)
  120. best_label,best_probability =None, -1
  121. for class_value, probability in pribabilities.items():
  122. if best_label is None or probability >best_probability:
  123. best_probability = probability
  124. best_label= class_value
  125. return best_label
  126. #10.预测测试数据的分类
  127. def naive_bayes(train,test):
  128. pedictions = list()
  129. for row in test:
  130. prediction = predict(train,row)
  131. pedictions.append(prediction)
  132. return pedictions
  133. #11.运行和参数调整
  134. seed(5)
  135. file='./download_datas/IRIS.csv'
  136. dataset= csv_loader(file)
  137. str_to_float_converter(dataset)
  138. dataset=dataset[1:]
  139. str_to_int_converter(dataset)
  140. n_folds=3
  141. algo=naive_bayes
  142. scores = mode_test(dataset,algo,n_folds)
  143. print('The scores of our model are : %s' % scores)
  144. print('The average score of our model is : %.3f%%' % (sum(scores)/float(len(scores))))

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/340774
推荐阅读
相关标签
  

闽ICP备14008679号