赞
踩
2.1 for循环方式实现
2.2 向量化方式实现
简单来说,线性回归算法
以一个坐标系里一个维度为结果,其它维度为特征(如二维平面坐标系中横轴为特征,纵轴为结果),无数的训练集放在坐标系中,发现他们是围绕这一条线分布的。线性回归算法
的期望,就是寻找一条直线,最大程度的拟合
样本特征和样本输出标记的关系
以简单线性回归(样本特征只有一个时的回归问题)
为例,将房屋的面积作为x轴,将房屋的价格作为y轴,每一个点为(X(i) ,y(i))
,那么我们期望寻找的直线就是y=ax+b
,当给出一个新的点x(j)
的时候,我们希望预测的y^(j)=ax(j)+b
注意:
因此,简单线性回归问题的目标是
通过上面的推导,我们也可以归纳出一类机器学习算法的基本思路,如下图:其中损失函数是计算期望值和预测值的差值,而期望其差值(也就是损失)越来越小,而效用函数则是用来描述拟合度,期望拟合度越来越好
1.数据集
- import numpy as np
- import matplotlib.pyplot as plt
-
- x = np.array([1.,2.,3.,4.,5.])
- y = np.array([1.,3.,2.,3.,5.])
-
- plt.scatter(x,y)
- plt.axis([0,6,0,6])
- plt.show()
a,b的计算公式
2.以for循环的方式实现简单的线性回归算法
- class SimpleLinearRegression1:
-
- def __inint__(self):
- """初始化Simple Linear Regression 模型"""
- self.a_ = None
- self.b_ = None
-
- def fit(self,x_train,y_train):
- """根据训练数据集x_train,y_train训练Simple Linear Regression模型"""
- assert x_train.ndim == 1,"Simple Linear Regressor can only solve single feature training data."
- assert len(x_train) == len(y_train),"the size of x_train must be equal to the size of y_train"
-
- # 求均值
- x_mean = np.mean(x_train)
- y_mean = np.mean(y_train)
-
- # 分子
- num = 0.0
- # 分母
- d = 0.0
-
- # 计算分子和分母
- for x,y in zip(x_train,y_train):
- num += (x-x_mean)*(y-y_mean)
- d += (x-x_mean)**2
-
- # 计算参数a和b
- self.a_ = num / d
- self.b_ = y_mean - self.a_*x_mean
-
- return self
-
- def predict(self,x_predict):
- """给定待预测数据集x_predict,返回表示x_predict的结果向量"""
- assert x_predict.ndim == 1,"Simple Linear Regressor can only solve single feature training data."
- assert self.a_ is not None and self.b_ is not None,"must fit before predict!"
- return np.array([self._predict(x) for x in x_predict])
- def _predict(self, x_single):
- """给定单个待预测数据x,返回x的预测结果值"""
- return self.a_ * x_single + self.b_
-
- def __repr__(self):
- return "SimpleLinearRegression1()"
a.调用自己的封装函数
- reg1 = SimpleLinearRegression1()
- reg1.fit(x,y)
b.进行预测
- x_predict = 6
- print(reg1.predict(np.array([x_predict])))
[5.2]
c.查看参数a和b
- print(reg1.a_)
- print(reg1.b_)
- 0.8
- 0.39999999999999947
d.绘图展示
- y_hat1 = reg1.predict(x)
-
- plt.scatter(x,y)
- plt.plot(x,y_hat1,color="red")
- plt.axis([0,6,0,6])
- plt.show()
3.以向量化的方式实现简单线性回归算法
- class SimpleLinearRegression2:
-
- def __init__(self):
- """初始化Simple Linear Regression模型"""
- self.a_ = None
- self.b_ = None
-
- def fit(self, x_train, y_train):
- """根据训练数据集x_train,y_train训练Simple Linear Regression模型"""
- assert x_train.ndim == 1, \
- "Simple Linear Regressor can only solve single feature training data."
- assert len(x_train) == len(y_train), \
- "the size of x_train must be equal to the size of y_train"
-
- # 均值
- x_mean = np.mean(x_train)
- y_mean = np.mean(y_train)
-
- # 使用向量化点乘计算参数a和b
- self.a_ = (x_train - x_mean).dot(y_train - y_mean) / (x_train - x_mean).dot(x_train - x_mean)
- self.b_ = y_mean - self.a_ * x_mean
-
- return self
-
- def predict(self, x_predict):
- """给定待预测数据集x_predict,返回表示x_predict的结果向量"""
- assert x_predict.ndim == 1, \
- "Simple Linear Regressor can only solve single feature training data."
- assert self.a_ is not None and self.b_ is not None, \
- "must fit before predict!"
-
- return np.array([self._predict(x) for x in x_predict])
-
- def _predict(self, x_single):
- """给定单个待预测数据x_single,返回x_single的预测结果值"""
- return self.a_ * x_single + self.b_
-
- def __repr__(self):
- return "SimpleLinearRegression2()"
a.调用自己的封装函数
- reg2 = SimpleLinearRegression2()
- reg2.fit(x,y)
b.进行预测
reg2.predict(np.array([x_predict]))
[5.2]
c.查看参数和a和b
- print(reg2.a_)
- print(reg2.b_)
- 0.8
- 0.39999999999999947
d.进行绘图展示
- y_hat2 = reg2.predict(x)
-
- plt.scatter(x,y)
- plt.plot(x,y_hat2,color="red")
- plt.axis([0,6,0,6])
- plt.show()
4.for循环和向量化性能对比
a.数据集的提供
- m = 1000000
- big_x = np.random.random(size=m)
- big_y = big_x * 2.0 + 3.0 + np.random.normal(size=m)
b.性能测试
- %timeit reg1.fit(big_x,big_y)
- %timeit reg2.fit(big_x,big_y)
- 1.11 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
- 17.4 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.衡量标准
a.MSE(Mean Squared Error)均分误差
其中衡量标准
是和m
有关的,因为越多的数据量产生的误差和可能会更大,但是毫无疑问越多的数据量训练出来的模型更好,为此需要一个取消误差的方法,如下:
b.RMSE(Root Mean Squared Error)均方根误差
MSE 的缺点,量纲不准确,如果y的单位是万元,平方后就变成了万元的平方,这可能会给我们带来一些麻烦MSE 的缺点,量纲不准确,如果y的单位是万元,平方后就变成了万元的平方,这可能会给我们带来一些麻烦
c.MAE(Mean Absolute Error)平均绝对误差
总结:
RMSE平方累加后再开根号,如果某些预测结果和真实结果相差非常大,那么RMSE的结果会相对变大,所以RMSE有放大误差的趋势,而MAE没有,他直接就反应的是预测结果和真实结果直接的差距,正因如此,从某种程度上来说,想办法我们让RMSE变的更小小对于我们来说比较有意义,因为这意味着整个样本的错误中,那个最值相对比较小,而且我们之前训练样本的目标,就是RMSE根号里面1/m的这一部分,而这一部分的本质和优化RMSE是一样的
2.MSE,RMSE,MAE的源码实现
a.这里使用Boston房价数据集
- from sklearn import datasets
-
- boston = datasets.load_boston()
- # 简单线性回归算法
- x = boston.data[:,5] # 只选用了RM房间这个属性
- y = boston.target
- # 在图中表示一下
- plt.scatter(x,y)
- plt.show()
- # 从上图中可以看出存在上限值,我们需要将这些上限值排除在外
- x = x[y<50.0]
- y = y[y<50.0]
- # 重新绘图
- plt.scatter(x,y)
- plt.show()
b.调用简单线性回归函数
- # 先将数据集分为测试集和训练集
- from sklearn.model_selection import train_test_split
-
- x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=666)
- # 查看一下测试集和训练集数据
- print(x_train.shape,type(x_train))
- print(y_train.shape,type(y_train))
- print(x_test.shape,type(x_test))
- print(y_test.shape,type(y_test))
- (392,) <class 'numpy.ndarray'>
- (392,) <class 'numpy.ndarray'>
- (98,) <class 'numpy.ndarray'>
- (98,) <class 'numpy.ndarray'>
- reg = SimpleLinearRegression2()
- reg.fit(x_train,y_train)
- # 绘图预测
- plt.scatter(x_train,y_train)
- plt.plot(x_train,reg.predict(x_train),color="r")
- plt.show()
- # 预测值
- y_predict = reg.predict(x_test)
c.MSE评估
- mse_test = np.sum((y_predict-y_test)**2)/len(y_test)
- print(mse_test)
24.156602134387438
d.RMSE评估
- from math import sqrt
-
- rmse_test = sqrt(mse_test)
- print(rmse_test)
4.914936635846635
e.MAE的评估
- mae_test = np.sum(np.absolute(y_predict-y_test))/len(y_test)
- print(mae_test)
3.5430974409463873
3.使用scikit-learn自带的MSE和MAE
- from sklearn.metrics import mean_squared_error
- from sklearn.metrics import mean_absolute_error
mean_squared_error(y_test,y_predict)
24.156602134387438
mean_absolute_error(y_test,y_predict)
3.5430974409463873
总结:
RMSE和MAE的值对应不同的单位和量纲,无法具体比较
1.R Squared简介
2.R Squared的意义
使用BaseLine Model产生的错误会很大,使用我们的模型预测产生的错误会相对少些(因为我们的模型充分的考虑了y和x之间的关系),用这两者相减,结果就是拟合了我们的错误指标,用1减去这个商结果就是我们的模型没有产生错误的指标
3.R Square的评价
4.R Square的公式
5.R Square源码实现
1-mean_squared_error(y_test,y_predict)/np.var(y_test)
0.6129316803937322
- # 直接调用sklearn
- from sklearn.metrics import r2_score
- r2_score(y_test,y_predict)
0.6129316803937324
1,多元线性回归模型的简介
2.多元线性回归的实现
- class LinearRegression:
-
- def __init__(self):
- """初始化Linear Regression模型"""
-
- # 系数向量(θ1,θ2,.....θn)
- self.coef_ = None
- # 截距 (θ0)
- self.interception_ = None
- # θ向量
- self._theta = None
-
- def fit_normal(self, X_train, y_train):
- """根据训练数据集X_train,y_train 训练Linear Regression模型"""
- assert X_train.shape[0] == y_train.shape[0], \
- "the size of X_train must be equal to the size of y_train"
- # np.ones((len(X_train), 1)) 构造一个和X_train 同样行数的,只有一列的全是1的矩阵
- # np.hstack 拼接矩阵
- X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
- # X_b.T 获取矩阵的转置
- # np.linalg.inv() 获取矩阵的逆
- # dot() 矩阵点乘
- self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
-
- self.interception_ = self._theta[0]
- self.coef_ = self._theta[1:]
-
- return self
-
- def predict(self, X_predict):
- """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
- assert self.coef_ is not None and self.interception_ is not None,\
- "must fit before predict"
- assert X_predict.shape[1] == len(self.coef_),\
- "the feature number of X_predict must be equal to X_train"
-
- X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
- return X_b.dot(self._theta)
-
- def score(self, X_test, y_test):
- """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""
-
- y_predict = self.predict(X_test)
- return r2_score(y_test, y_predict)
-
- def __repr__(self):
- return "LinearRegression()"
1.获取数据集
- # 导入模块
- import numpy as np
- import matplotlib.pyplot as plt
- from sklearn import datasets
- # 获取数据集
- boston = datasets.load_boston()
-
- x = boston.data
- y = boston.target
-
- x = x[y<50.0]
- y = y[y<50.0]
- # 查看一下数据集
- print(x.shape,type(x))
- print(y.shape,type(y))
- (490, 13) <class 'numpy.ndarray'>
- (490,) <class 'numpy.ndarray'>
- # 将数据集划分为测试集和训练集
- from sklearn.model_selection import train_test_split
- x_train,x_test,y_train,y_test = train_test_split(x,y)
2.建立线性回归模型Linear Regression
- from sklearn.linear_model import LinearRegression
-
- lin_reg = LinearRegression()
- # 训练模型得出参数
- lin_reg.fit(x_train,y_train)
- LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
- normalize=False)
- # 查看模型参数
- print(lin_reg.coef_)
- [-1.04156138e-01 3.86619027e-02 -4.51033368e-02 3.98933894e-01
- -1.44605989e+01 3.72025936e+00 -1.38861106e-02 -1.21248375e+00
- 2.65937183e-01 -1.33953987e-02 -9.40424839e-01 8.81098776e-03
- -3.56489837e-01]
print(lin_reg.intercept_)
34.29394259562285
- # 调用R2进行衡量指标
- lin_reg.score(x_test,y_test)
0.7392238735878867
在01机器学习算法之KNN中,讲到KNN可以进行线性回归,当时并没有例子,在这里我们用KNN实现线性回归。
KNN实现线性回归
a.调用KNN
- from sklearn.neighbors import KNeighborsRegressor
-
- knn_reg = KNeighborsRegressor()
- knn_reg.fit(x_train,y_train)
- print(knn_reg.score(x_test,y_test))
0.6545104997332957
b.使用网格搜索超参数
- from sklearn.model_selection import GridSearchCV
-
- param_grid = [
- {
- "weights" : ["uniform"],
- "n_neighbors":[i for i in range(1,11)]
- },
- {
- "weights" : ["distance"],
- "n_neighbors":[i for i in range(1,11)],
- "p":[i for i in range(1,6)]
- }
- ]
-
- knn_reg = KNeighborsRegressor()
- grid_search = GridSearchCV(knn_reg,param_grid,n_jobs=-1,verbose=1)
- grid_search.fit(x_train,y_train)
- # 查看一下最好的参数
- grid_search.best_params_
{'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
- # 调用参数进行验证
- grid_search.best_estimator_.score(x_test,y_test)
0.6735218877136675
线性回归算法的可解释性
- from sklearn import datasets
- import numpy as np
- from sklearn.linear_model import LinearRegression
-
- boston = datasets.load_boston()
- x = boston.data
- y = boston.target
-
- lin_reg = LinearRegression()
- lin_reg.fit(x,y)
- LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
- normalize=False)
lin_reg.coef_
- [-1.08011358e-01, 4.64204584e-02, 2.05586264e-02, 2.68673382e+00,
- -1.77666112e+01, 3.80986521e+00, 6.92224640e-04, -1.47556685e+00,
- 3.06049479e-01, -1.23345939e-02, -9.52747232e-01, 9.31168327e-03,
- -5.24758378e-01]
- # 将特征结果排序
- np.argsort(lin_reg.coef_)
[ 4, 7, 10, 12, 0, 9, 6, 11, 2, 1, 8, 3, 5]
- # 将排序过后的坐标对应的名称展示出来,方便观察理解
- boston.feature_names[np.argsort(lin_reg.coef_)]
['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'TAX', 'AGE', 'B','INDUS', 'ZN', 'RAD', 'CHAS', 'RM']
总结
RM对应的是房间数,是正相关最大的特征,也就是说房间数越多,房价越高,这是很合理的
NOX对应的是一氧化氮浓度,也就是说一氧化氮浓度越低,房价越低,这也是非常合理的
由此说明,我们的线性回归具有可解释性,我们可以在对研究一个模型的时候,可以先用线性回归模型看一下,然后根据感性的认识去直观的判断一下是否符合我们的常识。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。