赞
踩
举例:要确定一个瓜是好瓜还是坏瓜,用判别模型的方法是从历史数据中学习到模型,然后通过提取这个瓜的特征来预测出这只瓜是好瓜的概率,是坏瓜的概率。
举例:利用生成模型是根据好瓜的特征首先学习出一个好瓜的模型,然后根据坏瓜的特征学习得到一个坏瓜的模型,然后从需要预测的瓜中提取特征,放到生成好的好瓜的模型中看概率是多少,在放到生产的坏瓜模型中看概率是多少,哪个概率大就预测其为哪个。
举例:
假如你的任务是识别一个语音属于哪种语言。例如对面一个人走过来,和你说了一句话,你需要识别出她说的到底是汉语、英语还是法语等。那么你可以有两种方法达到这个目的:
1.学习每一种语言,你花了大量精力把汉语、英语和法语等都学会了,我指的学会是你知道什么样的语音对应什么样的语言。然后再有人过来对你说,你就可以知道他说的是什么语言.
2.不去学习每一种语言,你只学习这些语言之间的差别,然后再判断(分类)。意思是指我学会了汉语和英语等语言的发音是有差别的,我学会这种差别就好了。
那么第一种方法就是生成方法,第二种方法是判别方法。
生成模型是所有变量的全概率模型,而判别模型是在给定观测变量值前提下目标变量条件概率模型。因此生成模型能够用于模拟(即生成)模型中任意变量的分布情况,而判别模型只能根据观测变量得到目标变量的采样。判别模型不对观测变量的分布建模,因此它不能够表达观测变量与目标变量之间更复杂的关系。因此,生成模型更适用于无监督的任务,如分类和聚类。
条件概率: 就是事件A在事件B发生的条件下发生的概率。条件概率表示为P(A|B),读作“A在B发生的条件下发生的概率”。
贝叶斯公式:
P(X) 代表 X 事件发生的概率,也称为先验概率;
P(Y|X) 代表在 X 事件发生的前提下,Y 事件发生的概率,也称为似然率;
P(X|Y) 代表事件 Y 发生后,X 事件发生的概率,也称为后验概率;
最大似然估计(英语:maximum likelihood estimation,缩写为MLE),是用来估计一个概率模型的参数的一种方法。
条件概率,就是在条件为瓜的颜色是青绿的情况下,瓜是好瓜的概率
先验概率,就是常识、经验、统计学所透露出的“因”的概率,即瓜的颜色是青绿的概率。
后验概率,就是在知道“果”之后,去推测“因”的概率,也就是说,如果已经知道瓜是好瓜,那么瓜的颜色是青绿的概率是多少。后验和先验的关系就需要运用贝叶斯决策理论来求解。
基于条件独立性假设,对于多个属性的后验概率可以写成:
d为属性数目,xi是x在第i个属性上取值。
对于所有的类别来说P(x)相同,基于极大似然的贝叶斯判定准则有朴素贝叶斯的表达式:
朴素贝叶斯算法实现:
- #coding:utf-8
-
- #P(y|x) = [P(x|y)*P(y)]/P(x)
-
- import numpy as np
- import pandas as pd
-
- class Naive_Bayes:
- def __init__(self):
- pass
-
- # 朴素贝叶斯训练过程
- def nb_fit(self, X, y):
- # print('===y.columns[0]:', y.columns[0])
- classes = y[y.columns[0]].unique()
- # print('==classes:', classes)
- # print('==y[y.columns[0]]:', y[y.columns[0]])
- class_count = y[y.columns[0]].value_counts()
- # print('=class_count:', class_count)
- # 计算类先验概率
- class_prior = class_count / len(y)
- print('==class_prior:', class_prior)
- # 计算类条件概率
- prior = dict()
- #也就是求P(x1=?|y=?)
- for col in X.columns:
- for j in classes:
- # print('y:', y)
- # print('j:', j)
- # print('===X[(y == j).values]:', X[(y == j).values])
- # print('==X[(y == j).values][col]:', X[(y == j).values][col])
- p_x_y = X[(y == j).values][col].value_counts()
- # print('==p_x_y:', p_x_y)
-
- for i in p_x_y.index:
- # print('=i:', i)
- # print('==p_x_y[i]:', p_x_y[i])
- prior[(col, i, j)] = p_x_y[i] / class_count[j]
- # print(prior)
- # assert 1 == 0
- print('==prior:', prior)
- return classes, class_prior, prior
-
- # 预测新的实例
- def predict(self, X_test):
- #argmax(P(x1=?|y=?)*P(y=?))
- res = []
- for c in classes:
- p_y = class_prior[c]
- p_x_y = 1
- for i in X_test.items():
- # print('i:', i)
- # print(tuple(list(i) + [c]))
- p_x_y *= prior[tuple(list(i) + [c])]
- res.append(p_y * p_x_y)
- # print('===res:', res)
- return classes[np.argmax(res)]
-
-
- if __name__ == "__main__":
- x1 = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
- x2 = ['S', 'M', 'M', 'S', 'S', 'S', 'M', 'M', 'L', 'L', 'L', 'M', 'M', 'L', 'L']
- y = [-1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
- df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
- print('==df:\n', df)
- X = df[['x1', 'x2']]
- # print('==X:', X)
- y = df[['y']]
- # print('==y:', y)
- X_test = {'x1': 2, 'x2': 'S'}
-
- nb = Naive_Bayes()
- classes, class_prior, prior = nb.nb_fit(X, y)
- print('测试数据预测类别为:', nb.predict(X_test))
朴素贝叶斯分类器代码:
朴素贝叶斯分类器采用了“属性条件独立性假设”,对已知类别,假设所有属性相互独立。换言之,假设每个属性独立的对分类结果发生影响相互独立。
采用GaussianNB 高斯朴素贝叶斯,概率密度函数为
- import math
-
-
- class NaiveBayes:
- def __init__(self):
- self.model = None
-
- # 数学期望
- @staticmethod
- def mean(X):
- """计算均值
- Param: X : list or np.ndarray
- Return:
- avg : float
- """
- avg = 0.0
- # ========= show me your code ==================
- avg = sum(X) / float(len(X))
- # ========= show me your code ==================
- return avg
-
- # 标准差(方差)
- def stdev(self, X):
- """计算标准差
- Param: X : list or np.ndarray
- Return:
- res : float
- """
- res = 0.0
-
- avg = self.mean(X)
- res = math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))
-
- return res
-
- # 概率密度函数
- def gaussian_probability(self, x, mean, stdev):
- """根据均值和标注差计算x符号该高斯分布的概率
- Parameters:
- ----------
- x : 输入
- mean : 均值
- stdev : 标准差
- Return:
- res : float, x符合的概率值
- """
- res = 0.0
- # ========= show me your code ==================
- exponent = math.exp(-(math.pow(x - mean, 2) /
- (2 * math.pow(stdev, 2))))
- res = (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
- # ========= show me your code ==================
-
- return res
-
- # 处理X_train
- def summarize(self, train_data):
- """计算每个类目下对应数据的均值和标准差
- Param: train_data : list
- Return : [mean, stdev]
- """
- summaries = [0.0, 0.0]
- # ========= show me your code ==================
- # for i in zip(*train_data):
- # print(i)
- summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
-
- # ========= show me your code ==================
- return summaries
-
- # 分类别求出数学期望和标准差
- def fit(self, X, y):
- labels = list(set(y))
- data = {label: [] for label in labels}
- for f, label in zip(X, y):
- data[label].append(f)
- print('===data:', data)
- self.model = {
- label: self.summarize(value) for label, value in data.items()
- }
- print(self.model)#得到每一类的每个特征的均值和方差
- return 'gaussianNB train done!'
-
- # 计算概率
- def calculate_probabilities(self, input_data):
- """计算数据在各个高斯分布下的概率
- Paramter:
- input_data : 输入数据
- Return:
- probabilities : {label : p}
- """
- # summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
- # input_data:[1.1, 2.2]
- probabilities = {}
- # ========= show me your code ==================
- for label, value in self.model.items():
- print('====label, value', label, value)
- print('==len(value)', len(value))
- probabilities[label] = 1
- for i in range(len(value)):
- mean, stdev = value[i]
- probabilities[label] *= self.gaussian_probability(
- input_data[i], mean, stdev)
- print('===probabilities:', probabilities)
- # ========= show me your code ==================
- return probabilities
-
- # 类别
- def predict(self, X_test):
- # {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
- label = sorted(self.calculate_probabilities(X_test).items(), key=lambda x: x[-1])[-1][0]
- return label
-
- # 计算得分
- def score(self, X_test, y_test):
- right = 0
- for X, y in zip(X_test, y_test):
- label = self.predict(X)
- if label == y:
- right += 1
-
- return right / float(len(X_test))
-
- def test_bayes_model():
- from sklearn.datasets import load_iris
- import pandas as pd
- from sklearn.model_selection import train_test_split
- iris = load_iris()
- X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)
- print(len(X_train))
- print(len(y_train))
- model = NaiveBayes()
- model.fit(X_train, y_train)
-
- print(model.predict([4.4, 3.2, 1.3, 0.2]))
- if __name__ == '__main__':
- test_bayes_model()
基于pgmpy的贝叶斯网络例子:
pgmpy是一款基于Python的概率图模型包,主要包括贝叶斯网络和马尔可夫蒙特卡洛等常见概率图模型的实现以及推断方法.
下图是学生获得推荐信质量的例子。具体有向图和概率表如下图所示:
代码:
- #coding:utf-8
- #git clone https://github.com/pgmpy/pgmpy
- #cd pgmpy
- #python setup.py install
-
-
- from pgmpy.factors.discrete import TabularCPD
- from pgmpy.models import BayesianModel
-
- student_model = BayesianModel([('D', 'G'),
- ('I', 'G'),
- ('G', 'L'),
- ('I', 'S')])
- #分数节点
- grade_cpd = TabularCPD(
- variable='G',# 节点名称
- variable_card=3,# 节点取值个数
- values=[[0.3, 0.05, 0.9, 0.5],# 该节点的概率表
- [0.4, 0.25, 0.08, 0.3],
- [0.3, 0.7, 0.02, 0.2]],
- evidence=['I', 'D'], # 该节点的依赖节点
- evidence_card=[2, 2] # 依赖节点的取值个数
- )
- #考试难度节点
- difficulty_cpd = TabularCPD(
- variable='D',
- variable_card=2,
- values=[[0.6, 0.4]]
- )
- ##智商节点
- intel_cpd = TabularCPD(
- variable='I',
- variable_card=2,
- values=[[0.7, 0.3]]
- )
- #收到推荐信节点
- letter_cpd = TabularCPD(
- variable='L',
- variable_card=2,
- values=[[0.1, 0.4, 0.99],
- [0.9, 0.6, 0.01]],
- evidence=['G'],
- evidence_card=[3]
- )
- #sat分数节点
- sat_cpd = TabularCPD(
- variable='S',
- variable_card=2,
- values=[[0.95, 0.2],
- [0.05, 0.8]],
- evidence=['I'],
- evidence_card=[2]
- )
-
- student_model.add_cpds(
- grade_cpd,
- difficulty_cpd,
- intel_cpd,
- letter_cpd,
- sat_cpd
- )
- print(student_model.get_cpds())
-
-
- print('D节点路径:', student_model.active_trail_nodes('D'))
- print('I节点路径:', student_model.active_trail_nodes('I'))
-
- print(student_model.local_independencies('G'))
-
- # print(student_model.get_independencies())
-
- # print(student_model.to_markov_model())
-
- # 进行贝叶斯推断
- from pgmpy.inference import VariableElimination
- student_infer = VariableElimination(student_model)
- prob_G = student_infer.query(variables=['G'])
-
- print('所有可能性的分数概率prob_G:', prob_G)
-
- prob_G = student_infer.query(
- variables=['G'],
- evidence={'I': 1, 'D': 0})
- print('聪明学生的分数概率prob_G', prob_G)
-
- # prob_G = student_infer.query(
- # variables=['G'],
- # evidence={'I': 0, 'D': 1})
- # print(prob_G)
-
-
- # # 生成数据
- # import numpy as np
- # import pandas as pd
- #
- # raw_data = np.random.randint(low=0, high=2, size=(1000, 5))
- # data = pd.DataFrame(raw_data, columns=['D', 'I', 'G', 'L', 'S'])
- # data.head()
- #
- #
- # # 定义模型
- # from pgmpy.models import BayesianModel
- # from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
- #
- # model = BayesianModel([('D', 'G'), ('I', 'G'), ('I', 'S'), ('G', 'L')])
- #
- # # 基于极大似然估计进行模型训练
- # model.fit(data, estimator=MaximumLikelihoodEstimator)
- # for cpd in model.get_cpds():
- # # 打印条件概率分布
- # print("CPD of {variable}:".format(variable=cpd.variable))
- # print(cpd)
-
-
knn的详细链接:https://blog.csdn.net/fanzonghao/article/details/86411102
决策树的详细链接:https://blog.csdn.net/fanzonghao/article/details/85246720
1.SVM:寻找最优的间隔
等式约束的最优解
不等式约束的最优解:利用kkT条件
最终得到分类器:
也就是C(松弛变量)越大:得到高方差,低偏差的模型;更倾向于过拟合;
C越小:得到低方差,高偏差的模型;更倾向于欠拟合。
SVM案例,应用SMO算法:
- import numpy as np
- import pandas as pd
- from sklearn.datasets import load_iris
- from sklearn.model_selection import train_test_split
- import matplotlib.pyplot as plt
-
-
- def create_data():
- iris = load_iris()
- df = pd.DataFrame(iris.data, columns=iris.feature_names)
- df['label'] = iris.target
- df.columns = [
- 'sepal length', 'sepal width', 'petal length', 'petal width', 'label'
- ]
- data = np.array(df.iloc[:100, [0, 1, -1]])
- for i in range(len(data)):
- if data[i, -1] == 0:
- data[i, -1] = -1
- # print(data)
- return data[:, :2], data[:, -1]
-
- X, y = create_data()
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
- print('==X_train.shape:', X_train.shape)
- print('==y_train.shape:', y_train.shape)
- plt.scatter(X[:50, 0], X[:50, 1], label='0', color='R')
- plt.scatter(X[50:, 0], X[50:, 1], label='1', color='G')
- plt.legend()
- # plt.show()
-
- #w = alpha*y*x
- class SVM:
- def __init__(self, max_iter=100, kernel='linear'):
- self.max_iter = max_iter
- self._kernel = kernel
- def init_args(self, features, labels):
- self.m, self.n = features.shape#m数据量 n特征维度
- self.X = features
- self.Y = labels
- self.b = 0.0
- # 将Ei保存在一个列表里
- self.alpha = np.ones(self.m)
- self.E = [self._E(i) for i in range(self.m)]
- # 松弛变量
- self.C = 1.0
- def _KKT(self, i):
- y_g = self._g(i) * self.Y[i]
- if self.alpha[i] == 0:
- return y_g >= 1
- elif 0 < self.alpha[i] < self.C:
- return y_g == 1
- else:
- return y_g <= 1
- # g(x)预测值,输入xi(X[i])
- def _g(self, i):
- r = self.b
- for j in range(self.m):
- r += self.alpha[j] * self.Y[j] * self.kernel(self.X[i], self.X[j])
- return r
- # E(x)为g(x)对输入x的预测值和y的差
- def _E(self, i):
- return self._g(i) - self.Y[i]
- # 核函数
- def kernel(self, x1, x2):
- if self._kernel == 'linear':
- return sum([x1[k] * x2[k] for k in range(self.n)])
- elif self._kernel == 'poly':
- return (sum([x1[k] * x2[k] for k in range(self.n)]) + 1)**2
- return 0
- def _init_alpha(self):
- # 外层循环首先遍历所有满足0<a<C的样本点,检验是否满足KKT
- index_list = [i for i in range(self.m) if 0 < self.alpha[i] < self.C]
- # 否则遍历整个训练集
- non_satisfy_list = [i for i in range(self.m) if i not in index_list]
- index_list.extend(non_satisfy_list)
- for i in index_list:
- if self._KKT(i):
- continue
- E1 = self.E[i]
- # 如果E2是+,选择最小的;如果E2是负的,选择最大的
- if E1 >= 0:
- j = min(range(self.m), key=lambda x: self.E[x])
- else:
- j = max(range(self.m), key=lambda x: self.E[x])
- return i, j
- def _compare(self, _alpha, L, H):
- if _alpha > H:
- return H
- elif _alpha < L:
- return L
- else:
- return _alpha
- def fit(self, features, labels):
- self.init_args(features, labels)
- for t in range(self.max_iter):
- # train
- i1, i2 = self._init_alpha()
- # 边界
- if self.Y[i1] == self.Y[i2]:
- L = max(0, self.alpha[i1] + self.alpha[i2] - self.C)
- H = min(self.C, self.alpha[i1] + self.alpha[i2])
- else:
- L = max(0, self.alpha[i2] - self.alpha[i1])
- H = min(self.C, self.C + self.alpha[i2] - self.alpha[i1])
- E1 = self.E[i1]
- E2 = self.E[i2]
- # eta=K11+K22-2K12
- eta = self.kernel(self.X[i1], self.X[i1]) + self.kernel(
- self.X[i2],
- self.X[i2]) - 2 * self.kernel(self.X[i1], self.X[i2])
- if eta <= 0:
- # print('eta <= 0')
- continue
- alpha2_new_unc = self.alpha[i2] + self.Y[i2] * (
- E1 - E2) / eta #此处有修改,根据书上应该是E1 - E2,书上130-131页
- alpha2_new = self._compare(alpha2_new_unc, L, H)
-
- alpha1_new = self.alpha[i1] + self.Y[i1] * self.Y[i2] * (
- self.alpha[i2] - alpha2_new)
-
- b1_new = -E1 - self.Y[i1] * self.kernel(self.X[i1], self.X[i1]) * (
- alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(
- self.X[i2],
- self.X[i1]) * (alpha2_new - self.alpha[i2]) + self.b
- b2_new = -E2 - self.Y[i1] * self.kernel(self.X[i1], self.X[i2]) * (
- alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(
- self.X[i2],
- self.X[i2]) * (alpha2_new - self.alpha[i2]) + self.b
-
- if 0 < alpha1_new < self.C:
- b_new = b1_new
- elif 0 < alpha2_new < self.C:
- b_new = b2_new
- else:
- # 选择中点
- b_new = (b1_new + b2_new) / 2
- # 更新参数
- self.alpha[i1] = alpha1_new
- self.alpha[i2] = alpha2_new
- self.b = b_new
- self.E[i1] = self._E(i1)
- self.E[i2] = self._E(i2)
- return 'train done!'
-
- def predict(self, data):
- r = self.b
- for i in range(self.m):
- r += self.alpha[i] * self.Y[i] * self.kernel(data, self.X[i])
-
- return 1 if r > 0 else -1
-
- def score(self, X_test, y_test):
- right_count = 0
- for i in range(len(X_test)):
- result = self.predict(X_test[i])
- if result == y_test[i]:
- right_count += 1
- return right_count / len(X_test)
-
- # def _weight(self):
- # # linear model
- # yx = self.Y.reshape(-1, 1) * self.X
- # self.w = np.dot(yx.T, self.alpha)
- # return self.w
-
-
- svm = SVM(max_iter=200)
- svm.fit(X_train, y_train)
- score = svm.score(X_test, y_test)
- print('===score:', score)
SVM案例,用于水果数据集分类,调用scikit-learn:
import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.svm import SVC import matplotlib.patches as mpatches from matplotlib.colors import ListedColormap def plot_class_regions_for_classifier(clf, X, y, X_test=None, y_test=None, title=None, target_names=None, plot_decision_regions=True): """ 根据分类器可视化数据分类的结果 只能用于二维特征的数据 """ num_classes = np.amax(y) + 1 color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF'] color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC'] cmap_light = ListedColormap(color_list_light[0:num_classes]) cmap_bold = ListedColormap(color_list_bold[0:num_classes]) h = 0.03 k = 0.5 x_plot_adjust = 0.1 y_plot_adjust = 0.1 plot_symbol_size = 50 x_min = X[:, 0].min() x_max = X[:, 0].max() y_min = X[:, 1].min() y_max = X[:, 1].max() x2, y2 = np.meshgrid(np.arange(x_min-k, x_max+k, h), np.arange(y_min-k, y_max+k, h)) P = clf.predict(np.c_[x2.ravel(), y2.ravel()]) P = P.reshape(x2.shape) plt.figure() if plot_decision_regions: plt.contourf(x2, y2, P, cmap=cmap_light, alpha=0.8) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor='black') plt.xlim(x_min - x_plot_adjust, x_max + x_plot_adjust) plt.ylim(y_min - y_plot_adjust, y_max + y_plot_adjust) if X_test is not None: plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size, marker='^', edgecolor='black') train_score = clf.score(X, y) test_score = clf.score(X_test, y_test) title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score) if target_names is not None: legend_handles = [] for i in range(0, len(target_names)): patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i]) legend_handles.append(patch) plt.legend(loc=0, handles=legend_handles) if title is not None: plt.title(title) plt.show() # 加载数据集 fruits_df = pd.read_table('fruit_data_with_colors.txt') X = fruits_df[['width', 'height']] y = fruits_df['fruit_label'].copy() # 将不是apple的标签设为0 y[y != 1] = 0 # 分割数据集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=0) print(y_test.shape) # 不同的C值 c_values = [0.0001, 1, 100] for c_value in c_values: # 建立模型 svm_model = SVC(C=c_value, kernel='rbf') # 训练模型 svm_model.fit(X_train, y_train) # 验证模型 y_pred = svm_model.predict(X_test) acc = accuracy_score(y_test, y_pred) print('C={},准确率:{:.3f}'.format(c_value, acc)) # 可视化 plot_class_regions_for_classifier(svm_model, X_test.values, y_test.values, title='C={}'.format(c_value))
二维高斯分布
将kernel替换成‘linear’
2.集成学习
def load_data(): # 加载数据集 fruits_df = pd.read_table('fruit_data_with_colors.txt') # print(fruits_df) print('样本个数:', len(fruits_df)) # 创建目标标签和名称的字典 fruit_name_dict = dict(zip(fruits_df['fruit_label'], fruits_df['fruit_name'])) # 划分数据集 X = fruits_df[['mass', 'width', 'height', 'color_score']] y = fruits_df['fruit_label'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=0) print('数据集样本数:{},训练集样本数:{},测试集样本数:{}'.format(len(X), len(X_train), len(X_test))) # print(X_train) return X_train, X_test, y_train, y_test #特征归一化 def minmax_scaler(X_train,X_test): scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) # print(X_train_scaled) #此时scaled得到一个最小最大值,对于test直接transform就行 X_test_scaled = scaler.transform(X_test) for i in range(4): print('归一化前,训练数据第{}维特征最大值:{:.3f},最小值:{:.3f}'.format(i + 1, X_train.iloc[:, i].max(), X_train.iloc[:, i].min())) print('归一化后,训练数据第{}维特征最大值:{:.3f},最小值:{:.3f}'.format(i + 1, X_train_scaled[:, i].max(), X_train_scaled[:, i].min())) return X_train_scaled,X_test_scaled
def stack(X_train_scaled, y_train,X_test_scaled, y_test): from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from mlxtend.classifier import StackingClassifier clf1 = KNeighborsClassifier(n_neighbors=1) clf2 = SVC(kernel='linear') clf3 = DecisionTreeClassifier() lr = LogisticRegression(C=100) sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], meta_classifier=lr) clf1.fit(X_train_scaled, y_train) clf2.fit(X_train_scaled, y_train) clf3.fit(X_train_scaled, y_train) sclf.fit(X_train_scaled, y_train) print('kNN测试集准确率:{:.3f}'.format(clf1.score(X_test_scaled, y_test))) print('SVM测试集准确率:{:.3f}'.format(clf2.score(X_test_scaled, y_test))) print('DT测试集准确率:{:.3f}'.format(clf3.score(X_test_scaled, y_test))) print('Stacking测试集准确率:{:.3f}'.format(sclf.score(X_test_scaled, y_test)))
- if __name__ == '__main__':
- X_train, X_test, y_train, y_test=load_data()
- X_train_scaled,X_test_scaled=minmax_scaler(X_train,X_test)
2.1Boosting
2.1.1Adaboost
2.1.2 GBDT
- def gbdt(X_train_scaled, y_train, X_test_scaled, y_test):
- from sklearn.ensemble import GradientBoostingClassifier
- from sklearn.model_selection import GridSearchCV
- parameters = {'learning_rate': [0.001, 0.01, 0.1, 1, 10, 100]}
- clf = GridSearchCV(GradientBoostingClassifier(), parameters, cv=3, scoring='accuracy')
- clf.fit(X_train_scaled, y_train)
-
- print('最优参数:', clf.best_params_)
- print('验证集最高得分:', clf.best_score_)
- print('测试集准确率:{:.3f}'.format(clf.score(X_test_scaled, y_test)))
2.2 Bagging
- import warnings
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_circles
- from sklearn.model_selection import train_test_split
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.linear_model import LogisticRegression
- from sklearn.svm import SVC
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.ensemble import VotingClassifier,RandomForestClassifier,ExtraTreesClassifier
- from sklearn.ensemble import AdaBoostClassifier
- warnings.filterwarnings('ignore')
-
- X,y=make_circles(n_samples=300,noise=0.15,factor=0.5,random_state=233)
- plt.scatter(X[y==0,0],X[y==0,1])
- plt.scatter(X[y== 1, 0], X[y== 1, 1])
- # plt.show()
-
- X_train,X_test,y_train,y_test=train_test_split(X,y)
- print('X_train.shape=',X_train.shape)
- print('X_test.shape=',X_test.shape)
- print(y_test)
- print('===========knn==============')
- knn_clf=KNeighborsClassifier()
- knn_clf.fit(X_train,y_train)
- print('knn accuracy={}'.format(knn_clf.score(X_test,y_test)))
- print('\n')
- print('===========logistic regression==============')
- log_clf = LogisticRegression()
- log_clf.fit(X_train, y_train)
- print('logistic regression accuracy={}'.format(log_clf.score(X_test, y_test)))
- print('\n')
- print('===========SVM==============')
- svm_clf = SVC()
- svm_clf.fit(X_train, y_train)
- print('SVM accuracy={}'.format(svm_clf.score(X_test, y_test)))
- print('\n')
- print('===========Decison tree==============')
- dt_clf = DecisionTreeClassifier()
- dt_clf.fit(X_train, y_train)
- print('Decison tree accuracy={}'.format(dt_clf.score(X_test, y_test)))
- print('\n')
- print('===========ensemble classfier==============')
- voting_clf=VotingClassifier(estimators=[('knn',KNeighborsClassifier()),
- ('logistic', LogisticRegression()),
- ('SVM',SVC()),
- ('decision tree',DecisionTreeClassifier())],
- voting='hard')#严格遵守少数服从多数
- voting_clf.fit(X_train,y_train)
- print('voting classfier accuracy={}'.format(voting_clf.score(X_test, y_test)))
- print('\n')
- print('===========random forest==============')
- rf_clf=RandomForestClassifier(n_estimators=500,#500棵树
- max_depth=6,#每颗树的深度
- bootstrap=True,# 放回抽样
- oob_score=True,#使用没有被抽到的数据做验证
- )
- rf_clf.fit(X,y)#由于oob_score为true 故直接fit整个训练集
- print('rf accuracy={}'.format(rf_clf.oob_score_))
- print('\n')
- print('===========extreme random tree==============')
- ex_clf=ExtraTreesClassifier(n_estimators=500,
- max_depth=6,
- bootstrap=True,
- oob_score=True)
- ex_clf.fit(X,y)
- print('extreme random treeaccuracy={}'.format(ex_clf.oob_score_))
- print('\n')
- print('===========Adaboost classifier==============')
- ada_clf = AdaBoostClassifier(DecisionTreeClassifier(),
- n_estimators=500,
- learning_rate=0.3)
- ada_clf.fit(X_train, y_train)
- print('Adaboost accuracy={}'.format(ada_clf.score(X_test,y_test)))
- print('\n')
随机森林算法的高明之处之一就是利用随机性,使得模型更鲁棒。假如森林中有 N 棵树,那么就随机取出 N 个训练数据集,对 N 棵树分别进行训练,通过统计每棵树的预测结果来得出随机森林的预测结果。
因为随机森林的主要构件是决策树,所以随机森林的超参数很多与决策树相同。除此之外,有2个比较重要的超参数值得注意,一个是 bootstrap,取 true 和 false,表示在划分训练数据集时是否采用放回取样;另一个是 oob_score,因为采用放回取样时,构建完整的随机森林之后会有大约 33% 的数据没有被取到过,所以当 oob_score 取 True 时,就不必再将数据集划分为训练集和测试集了,直接取未使用过的数据来验证模型的准确率。
由上述可以看出Extremely Randomized Trees 算法精度最高,它不仅在构建数据子集时对样本的选择进行随机抽取,而且还会对样本的特征进行随机抽取(即在建树模型时,采用部分特征而不是全部特征进行训练)。换句话说,就是对于特征集 X,随机森林只是在行上随机,Extremely Randomized Trees是在行和列上都随机。
n
个独立不相关的模型预测结果取平均,方差是原来的 1/n
;
参考:
赞
踩
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。