赞
踩
红酒口感数据集包括将近 1 599 种红酒的数据。每一种红酒都有一系列化学成分的测量指标,包括酒精含量、挥发性酸、亚硝酸盐。每种红酒都有一个口感评分值,是三个专业评酒员的评分的平均值。
import pandas as pd from pandas import DataFrame from pylab import * import matplotlib.pyplot as plot target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv") ## 数据集读取 wine = pd.read_csv(target_url,header=0, sep=";") print(wine.head()) ## 数据集统计 summary = wine.describe() print(summary) ## 归一化 wineNormalized = wine ncols = len(wineNormalized.columns) for i in range(ncols): mean = summary.iloc[1,i] sd = summary.iloc[2,i] wineNormalized.iloc[:,i:(i + 1)] = (wineNormalized.iloc[:,i:(i + 1)] - mean) / sd array = wineNormalized.values ## 绘制箱线图 boxplot(array) plot.xlabel("Attribute Index") plot.ylabel(("Quartile Ranges - Normalized ")) show()
fixed acidity volatile acidity citric acid residual sugar chlorides \ 0 7.4 0.70 0.00 1.9 0.076 1 7.8 0.88 0.00 2.6 0.098 2 7.8 0.76 0.04 2.3 0.092 3 11.2 0.28 0.56 1.9 0.075 4 7.4 0.70 0.00 1.9 0.076 free sulfur dioxide total sulfur dioxide density pH sulphates \ 0 11.0 34.0 0.9978 3.51 0.56 1 25.0 67.0 0.9968 3.20 0.68 2 15.0 54.0 0.9970 3.26 0.65 3 17.0 60.0 0.9980 3.16 0.58 4 11.0 34.0 0.9978 3.51 0.56 alcohol quality 0 9.4 5 1 9.8 5 2 9.8 5 3 9.8 6 4 9.4 5 fixed acidity volatile acidity citric acid residual sugar \ count 1599.000000 1599.000000 1599.000000 1599.000000 mean 8.319637 0.527821 0.270976 2.538806 std 1.741096 0.179060 0.194801 1.409928 min 4.600000 0.120000 0.000000 0.900000 25% 7.100000 0.390000 0.090000 1.900000 50% 7.900000 0.520000 0.260000 2.200000 75% 9.200000 0.640000 0.420000 2.600000 max 15.900000 1.580000 1.000000 15.500000 chlorides free sulfur dioxide total sulfur dioxide density \ count 1599.000000 1599.000000 1599.000000 1599.000000 mean 0.087467 15.874922 46.467792 0.996747 std 0.047065 10.460157 32.895324 0.001887 min 0.012000 1.000000 6.000000 0.990070 25% 0.070000 7.000000 22.000000 0.995600 50% 0.079000 14.000000 38.000000 0.996750 75% 0.090000 21.000000 62.000000 0.997835 max 0.611000 72.000000 289.000000 1.003690 pH sulphates alcohol quality count 1599.000000 1599.000000 1599.000000 1599.000000 mean 3.311113 0.658149 10.422983 5.636023 std 0.154386 0.169507 1.065668 0.807569 min 2.740000 0.330000 8.400000 3.000000 25% 3.210000 0.550000 9.500000 5.000000 50% 3.310000 0.620000 10.200000 6.000000 75% 3.400000 0.730000 11.100000 6.000000 max 4.010000 2.000000 14.900000 8.000000
从箱线图中可以直观发现数据集中的异常点。数值型统计信息和箱线图都显示含有大量的边缘点。在对此数据集进行训练时要记住这一点。当分析预测模型的性能时,这些边缘点很可能就是分析模型预测错误的一个重要来源。
import pandas as pd from pandas import DataFrame from pylab import * import matplotlib.pyplot as plot from math import exp target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv") ## 数据集读取 wine = pd.read_csv(target_url,header=0, sep=";") print(wine.head()) ## 数据集统计 summary = wine.describe() nrows = len(wine.index) tasteCol = len(summary.columns) meanTaste = summary.iloc[1,tasteCol - 1] sdTaste = summary.iloc[2,tasteCol - 1] nDataCol = len(wine.columns) -1 ## 绘制平行坐标图 for i in range(nrows): #plot rows of data as if they were series data dataRow = wine.iloc[i,1:nDataCol] normTarget = (wine.iloc[i,nDataCol] - meanTaste)/sdTaste labelColor = 1.0/(1.0 + exp(-normTarget)) dataRow.plot(color=plot.cm.RdYlBu(labelColor), alpha=0.5) plot.xlabel("Attribute Index") plot.ylabel(("Attribute Values")) plot.show() ## 归一化 wineNormalized = wine ncols = len(wineNormalized.columns) for i in range(ncols): mean = summary.iloc[1, i] sd = summary.iloc[2, i] wineNormalized.iloc[:,i:(i + 1)] =(wineNormalized.iloc[:,i:(i + 1)] - mean) / sd ## 归一化后重新绘制平行坐标图 for i in range(nrows): #plot rows of data as if they were series data dataRow = wineNormalized.iloc[i,1:nDataCol] normTarget = wineNormalized.iloc[i,nDataCol] labelColor = 1.0/(1.0 + exp(-normTarget)) dataRow.plot(color=plot.cm.RdYlBu(labelColor), alpha=0.5) plot.xlabel("Attribute Index") plot.ylabel(("Attribute Values")) plot.show()
fixed acidity volatile acidity citric acid residual sugar chlorides \ 0 7.4 0.70 0.00 1.9 0.076 1 7.8 0.88 0.00 2.6 0.098 2 7.8 0.76 0.04 2.3 0.092 3 11.2 0.28 0.56 1.9 0.075 4 7.4 0.70 0.00 1.9 0.076 free sulfur dioxide total sulfur dioxide density pH sulphates \ 0 11.0 34.0 0.9978 3.51 0.56 1 25.0 67.0 0.9968 3.20 0.68 2 15.0 54.0 0.9970 3.26 0.65 3 17.0 60.0 0.9980 3.16 0.58 4 11.0 34.0 0.9978 3.51 0.56 alcohol quality 0 9.4 5 1 9.8 5 2 9.8 5 3 9.8 6 4 9.4 5
加入颜色标记的平行坐标图更易于观察属性与目标的相关程度。图1的平行坐标图主要不足在于对取值范围较小的变量进行了压缩。为了克服这个问题,先对红酒数据进行了归一化,然后重绘平行坐标图。图2为归一化之后的平行坐标图。归一化红酒数据的平行坐标图可以更方便地观察出目标与哪些属性相关。图2展示了属性间清晰的相关性。在图的最右边,深蓝线(高口感评分值)聚集在酒精含量属性的高值区域;但是图的最左边,深红线(低口感评分值)聚集在挥发性酸属性的高值区域。这些都是最明显的相关属性。
import pandas as pd from pandas import DataFrame from pylab import * import matplotlib.pyplot as plot target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv") ## 数据集读取 wine = pd.read_csv(target_url,header=0, sep=";") ## 计算所有实值列(包括目标)的相关矩阵 corMat = DataFrame(wine.iloc[:,:].corr()) print(corMat) ## 使用热图可视化相关矩阵 plot.pcolor(corMat) plot.show()
fixed acidity volatile acidity citric acid \ fixed acidity 1.000000 -0.256131 0.671703 volatile acidity -0.256131 1.000000 -0.552496 citric acid 0.671703 -0.552496 1.000000 residual sugar 0.114777 0.001918 0.143577 chlorides 0.093705 0.061298 0.203823 free sulfur dioxide -0.153794 -0.010504 -0.060978 total sulfur dioxide -0.113181 0.076470 0.035533 density 0.668047 0.022026 0.364947 pH -0.682978 0.234937 -0.541904 sulphates 0.183006 -0.260987 0.312770 alcohol -0.061668 -0.202288 0.109903 quality 0.124052 -0.390558 0.226373 residual sugar chlorides free sulfur dioxide \ fixed acidity 0.114777 0.093705 -0.153794 volatile acidity 0.001918 0.061298 -0.010504 citric acid 0.143577 0.203823 -0.060978 residual sugar 1.000000 0.055610 0.187049 chlorides 0.055610 1.000000 0.005562 free sulfur dioxide 0.187049 0.005562 1.000000 total sulfur dioxide 0.203028 0.047400 0.667666 density 0.355283 0.200632 -0.021946 pH -0.085652 -0.265026 0.070377 sulphates 0.005527 0.371260 0.051658 alcohol 0.042075 -0.221141 -0.069408 quality 0.013732 -0.128907 -0.050656 total sulfur dioxide density pH sulphates \ fixed acidity -0.113181 0.668047 -0.682978 0.183006 volatile acidity 0.076470 0.022026 0.234937 -0.260987 citric acid 0.035533 0.364947 -0.541904 0.312770 residual sugar 0.203028 0.355283 -0.085652 0.005527 chlorides 0.047400 0.200632 -0.265026 0.371260 free sulfur dioxide 0.667666 -0.021946 0.070377 0.051658 total sulfur dioxide 1.000000 0.071269 -0.066495 0.042947 density 0.071269 1.000000 -0.341699 0.148506 pH -0.066495 -0.341699 1.000000 -0.196648 sulphates 0.042947 0.148506 -0.196648 1.000000 alcohol -0.205654 -0.496180 0.205633 0.093595 quality -0.185100 -0.174919 -0.057731 0.251397 alcohol quality fixed acidity -0.061668 0.124052 volatile acidity -0.202288 -0.390558 citric acid 0.109903 0.226373 residual sugar 0.042075 0.013732 chlorides -0.221141 -0.128907 free sulfur dioxide -0.069408 -0.050656 total sulfur dioxide -0.205654 -0.185100 density -0.496180 -0.174919 pH 0.205633 -0.057731 sulphates 0.093595 0.251397 alcohol 1.000000 0.476166 quality 0.476166 1.000000
上图为属性之间、属性与目标之间的关联热图。在这个热图中,黄色对应强相关(颜色标尺的选择与平行坐标图中的正好相反)。红酒数据的关联热图显示口感评分值(最后一列)与酒精含量(倒数第二列)高度正相关,但是与其他几个属性(包括挥发性酸(第二列)等)高度负相关。平行坐标图和关联热图都说明酒精含量高则口感评分值高,然而挥发性酸高则口感评分值低。在预测模型中的一部分工作就是研究各种属性对预测的重要性。红酒数据集就是一个很好的例子,展示了如何通过探究数据来知晓向从哪个方向努力来构建预测模型以及如何评价预测模型。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。