当前位置:   article > 正文

Python-sklearn-diabetes项目实战_sklearn数据集中的diabetas中的预处理

sklearn数据集中的diabetas中的预处理

目录

1 下载数据集和预处理

1.1 加载/下载数据集

1.2 数据可视化

1.3 数据清洗

1.4 特征工程

1.5 构建特征集和标签集

1.6 拆分训练集和测试集

2 训练模型

2.1 选择算法和确定模型

2.2 训练拟合模型

3 评估并优化模型性能


本文以糖尿病数据集diabetes为基础进行线性回归训练:

1 下载数据集和预处理

1.1 加载/下载数据集

  1. """
  2. @Title: 收集数据
  3. @Time: 2024/3/11
  4. @Author: Michael Jie
  5. 收集数据和预处理:
  6. 1、收集数据;
  7. 2、数据可视化;
  8. 3、数据清洗;
  9. 4、特征工程;
  10. 5、构建特征集和标签集(仅监督学习需要);
  11. 6、拆分训练集和测试集。
  12. """
  13. import sklearn.datasets as ds
  14. import pandas as pd
  15. # 加载并返回糖尿病数据集(回归)
  16. diabetes = ds.load_diabetes(
  17. # 若为True,返回(data, target)元组,而非Bunch对象
  18. return_X_y=False,
  19. # 若为True,以pandas DataFrame/Series形式返回数据集
  20. as_frame=False,
  21. # 若为True,返回归一化后的特征集
  22. scaled=False
  23. )
  24. # Bunch对象本质是一个字典
  25. print(diabetes.keys())
  26. """
  27. dict_keys([
  28. 'data', # 特征集
  29. 'target', # 标签集
  30. 'frame', # 包含特征值和标签的数组,当as_frame=True时存在
  31. 'DESCR', # 数据集描述
  32. 'feature_names', # 特征集列名
  33. 'data_filename', # 内存中的特征集文件名
  34. 'target_filename', # 内存中的标签集文件名
  35. 'data_module'
  36. ])
  37. """
  38. # 特征集
  39. data = diabetes.data
  40. print(type(data), data.shape)
  41. """
  42. <class 'numpy.ndarray'>
  43. (442, 10)
  44. """
  45. feature_names = diabetes.feature_names
  46. print(feature_names, type(feature_names))
  47. """
  48. ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
  49. <class 'list'>
  50. """
  51. # 标签集
  52. target = diabetes.target
  53. print(type(target), target.shape)
  54. """
  55. <class 'numpy.ndarray'>
  56. (442,)
  57. """
  58. # 数据集描述
  59. print(diabetes.DESCR)
  60. """
  61. Diabetes dataset
  62. ----------------
  63. Ten baseline variables, age, sex, body mass index, average blood
  64. pressure, and six blood serum measurements were obtained for each of n =
  65. 442 diabetes patients, as well as the response of interest, a
  66. quantitative measure of disease progression one year after baseline.
  67. **Data Set Characteristics:**
  68. :Number of Instances: 442
  69. :Number of Attributes: First 10 columns are numeric predictive values
  70. :Target: Column 11 is a quantitative measure of disease progression one year after baseline
  71. :Attribute Information:
  72. - age age in years
  73. - sex
  74. - bmi body mass index
  75. - bp average blood pressure
  76. - s1 tc, total serum cholesterol
  77. - s2 ldl, low-density lipoproteins
  78. - s3 hdl, high-density lipoproteins
  79. - s4 tch, total cholesterol / HDL
  80. - s5 ltg, possibly log of serum triglycerides level
  81. - s6 glu, blood sugar level
  82. Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
  83. Source URL:
  84. https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
  85. For more information see:
  86. Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
  87. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
  88. """
  89. # 下载数据集
  90. data_csv = pd.DataFrame(data=data, columns=feature_names)
  91. target_csv = pd.DataFrame(data=target, columns=['target'])
  92. diabetes_csv = pd.concat([data_csv, target_csv], axis=1)
  93. diabetes_csv.to_csv(r'diabetes_datasets.csv', index=False)

1.2 数据可视化

  1. """
  2. @Title: 数据可视化
  3. @Time: 2024/3/11
  4. @Author: Michael Jie
  5. """
  6. import pandas as pd
  7. import matplotlib.pyplot as plt
  8. # 读取数据
  9. csv = pd.read_csv(r'diabetes_datasets.csv')
  10. print(csv.shape) # (442, 11)
  11. # 可视化数据
  12. plt.figure(figsize=(19.2, 10.8))
  13. for i in range(csv.shape[1] - 1):
  14. plt.subplot(2, 5, i + 1).scatter(csv[csv.columns[i]], csv["target"])
  15. # 保存图片
  16. plt.savefig(r'diabetes_datasets.png')
  17. # plt.show()

1.3 数据清洗

  1. """
  2. @Title: 数据清洗
  3. @Time: 2024/3/11
  4. @Author: Michael Jie
  5. """
  6. import pandas as pd
  7. """
  8. 1、处理缺失数据:剔除残缺数据,也可以用平均值、随机值或者0来补值;
  9. 2、处理重复数据:删除完全相同的重复数据处理;
  10. 3、处理错误数据:处理逻辑错误数据;
  11. 4、处理不可用数据:处理格式错误数据。
  12. """
  13. # 读取数据
  14. csv = pd.read_csv(r'diabetes_datasets.csv')
  15. # 统计NaN出现的次数
  16. print(csv.isna().sum())
  17. """
  18. age 0
  19. sex 0
  20. bmi 0
  21. bp 0
  22. s1 0
  23. s2 0
  24. s3 0
  25. s4 0
  26. s5 0
  27. s6 0
  28. target 0
  29. dtype: int64
  30. """

1.4 特征工程

  1. """
  2. @Title: 特征工程
  3. @Time: 2024/3/11
  4. @Author: Michael Jie
  5. """
  6. import numpy as np
  7. import sklearn.datasets as ds
  8. # 标准化
  9. def z_score_normalization(x, axis=0):
  10. x = np.array(x)
  11. x = (x - np.mean(x, axis=axis)) / np.std(x, axis=axis)
  12. return x
  13. # 若为True,返回归一化后的特征集
  14. diabetes_pre = ds.load_diabetes(scaled=True)
  15. print(diabetes_pre.data)
  16. # 手动标准化特征集
  17. diabetes = ds.load_diabetes(scaled=False)
  18. print(z_score_normalization(diabetes.data))

1.5 构建特征集和标签集

无。

1.6 拆分训练集和测试集

  1. """
  2. @Title:
  3. @Time: 2024/3/11
  4. @Author: Michael Jie
  5. """
  6. import sklearn.datasets as ds
  7. from sklearn.model_selection import train_test_split
  8. # 加载数据
  9. diabetes = ds.load_diabetes(scaled=False)
  10. # 将数据集进行80%训练集和20%的测试集的分割
  11. x_train, x_test, y_train, y_test = train_test_split(
  12. diabetes.data, diabetes.target, test_size=0.2, random_state=0
  13. )
  14. print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
  15. """
  16. (353, 10) (89, 10) (353,) (89,)
  17. """

2 训练模型

2.1 选择算法和确定模型

  1. # 创建基本线性回归类
  2. linear = LinearRegression(
  3. # 是否计算截距
  4. fit_intercept=True,
  5. # 是否拷贝特征集
  6. copy_X=True,
  7. )
  8. # 创建正则线性回归类
  9. ridge = Ridge(
  10. # 学习率
  11. alpha=1.0,
  12. # 是否计算截距
  13. fit_intercept=True,
  14. # 是否拷贝特征集
  15. copy_X=True,
  16. # 最大训练轮次
  17. max_iter=None,
  18. # 最小损失差
  19. tol=1e-4,
  20. )

2.2 训练拟合模型

  1. """
  2. @Title: 训练模型和评估
  3. @Time: 2024/3/11
  4. @Author: Michael Jie
  5. """
  6. import sklearn.datasets as ds
  7. from sklearn.linear_model import LinearRegression, Ridge
  8. from sklearn.model_selection import train_test_split
  9. # 加载数据
  10. diabetes = ds.load_diabetes(scaled=True)
  11. # 将数据集进行80%的训练集和20%的测试集的分割
  12. x_train, x_test, y_train, y_test = train_test_split(
  13. diabetes.data, diabetes.target, test_size=0.2, random_state=0
  14. )
  15. # 创建基本线性回归类
  16. linear = LinearRegression()
  17. # 训练
  18. linear.fit(x_train, y_train)
  19. print(linear.coef_, linear.intercept_)
  20. """
  21. [ -35.55025079 -243.16508959 562.76234744 305.46348218 -662.70290089
  22. 324.20738537 24.74879489 170.3249615 731.63743545 43.0309307 ] 152.5380470138517
  23. """
  24. # 创建正则线性回归类
  25. ridge = Ridge()
  26. # 训练
  27. ridge.fit(x_train, y_train)
  28. print(ridge.coef_, ridge.intercept_)
  29. """
  30. [ 21.34794489 -72.97401935 301.36593604 177.49036347 2.82093648
  31. -35.27784862 -155.52090285 118.33395129 257.37783937 102.22540041] 151.9441509473086
  32. """

3 评估并优化模型性能

  1. # 创建基本线性回归类
  2. linear = LinearRegression()
  3. linear.fit(x_train, y_train)
  4. # 评估模型,结果在0-1之间,越大证明模型越拟合数据
  5. print(linear.score(x_test, y_test))
  6. """
  7. 0.33223321731061806
  8. """
  9. # 创建正则线性回归类
  10. ridge = Ridge()
  11. ridge.fit(x_train, y_train)
  12. # 评估模型
  13. print(ridge.score(x_test, y_test))
  14. """
  15. 0.3409800318493461
  16. """

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小蓝xlanll/article/detail/629039
推荐阅读
相关标签
  

闽ICP备14008679号