赞
踩
主要内容:
数据预处理的必要性
数据清洗
数据集成
数据标准化
数据规约
数据变换与离散化
利用sklearn进行数据预处理
小结
数据集成是将多个数据源中的数据合并,存放于一个一致的数据存储中。
1.数据集成过程中的关键问题
1.实体识别
2.数据冗余和相关分析
3.元组重复
4.数据值冲突检测与处理
5.数据异常值检测
有些冗余可以被相关分析检测到,对于标称属性,使用卡方检验,对于数值属性,可以使用相关系数(correlation coefficient)和 协方差( covariance)评估属性间的相关性。
(1)标称数据的卡方相关检验
(2)数值数据的相关系数
(3)数值数据的协方差
import pandas as pd
import numpy as np
a=[47, 83, 81, 18, 72, 41, 50, 66, 47, 20, 96, 21, 16, 60, 37, 59, 22, 16, 32, 63]
b=[56, 96, 84, 21, 87, 67, 43, 64, 85, 67, 68, 64, 95, 58, 56, 75, 6, 11, 68, 63]
# 数组转置(T)
data = np.array([a,b]).T
dfab = pd.DataFrame(data,columns = ['A','B'])
# display(dfab)
print('属性A和B的协方差:',dfab.A.cov(dfab.B))
print('属性A和B的相关系数:',dfab.A.corr(dfab.B))
# 属性A和B的协方差: 310.2157894736842
# 属性A和B的相关系数: 0.49924871046524394
merge(left,right,how = 'inner',on = None,left_on = None,right_on = None,left_index = False,right_index = False,sort = False,suffixes = ('_x','_y'),copy = True,indicator = False,validate = None)
参数 | 说明 |
---|---|
left | 参与合并的左侧DataFrame |
right | 参与合并的右侧DataFrame |
how | 连接方法:inner,left,right,outer |
on | 用于连接的列名 |
left_on | 左侧DataFrame中用于连接键的列 |
right_on | 右侧DataFrame中用于连接键的列 |
left_index | 左侧DataFrame中行索引作为连接键 |
right_index | 右侧DataFrame中行索引作为连接键 |
sort | 合并后会对数据排序,默认为True |
suffixes | 修改重复名 |
merge的默认合并数据。
price = pd.DataFrame({'fruit':['apple','grape','orange','orange'],'price':[8,7,9,11]}) amount = pd.DataFrame({'fruit':['apple','grape','orange'],'amount':[5,11,8]}) display(price,amount,pd.merge(price,amount)) # fruit price # 0 apple 8 # 1 grape 7 # 2 orange 9 # 3 orange 11 # fruit amount # 0 apple 5 # 1 grape 11 # 2 orange 8 # fruit price amount # 0 apple 8 5 # 1 grape 7 11 # 2 orange 9 8 # 3 orange 11 8
指定合并时的列名。
display(pd.merge(price,amount,left_on = 'fruit',right_on = 'fruit'))
# fruit price amount
# 0 apple 8 5
# 1 grape 7 11
# 2 orange 9 8
# 3 orange 11 8
左连接。
display(pd.merge(price,amount,how = 'left'))
# fruit price amount
# 0 apple 8 5
# 1 grape 7 11
# 2 orange 9 8
# 3 orange 11 8
右连接。
display(pd.merge(price,amount,how = 'right'))\
# fruit price amount
# 0 apple 8 5
# 1 grape 7 11
# 2 orange 9 8
# 3 orange 11 8
merge通过多个键合并。
left = pd.DataFrame({'key1':['one','one','two'],'key2':['a','b','a'],'value1':range(3)}) right = pd.DataFrame({'key1':['one','one','two','two'],'key2':['a','a','a','b'],'value2':range(4)}) display(left,right,pd.merge(left,right,on = ['key1','key2'],how = 'left')) # key1 key2 value1 # 0 one a 0 # 1 one b 1 # 2 two a 2 # key1 key2 value2 # 0 one a 0 # 1 one a 1 # 2 two a 2 # 3 two b 3 # key1 key2 value1 value2 # 0 one a 0 0.0 # 1 one a 0 1.0 # 2 one b 1 NaN # 3 two a 2 2.0
merge函数中参数suffixes的应用。
print(pd.merge(left,right,on = 'key1')) print(pd.merge(left,right,on = 'key1',suffixes = ('_left','_right'))) key1 key2_x value1 key2_y value2 # 0 one a 0 a 0 # 1 one a 0 a 1 # 2 one b 1 a 0 # 3 one b 1 a 1 # 4 two a 2 a 2 # 5 two a 2 b 3 # key1 key2_left value1 key2_right value2 # 0 one a 0 a 0 # 1 one a 0 a 1 # 2 one b 1 a 0 # 3 one b 1 a 1 # 4 two a 2 a 2 # 5 two a 2 b 3
(2)concat数据连接
两个Series的数据连接。
s1 = pd.Series([0,1],index = ['a','b'])
s2 = pd.Series([2,3,4],index = ['a','d','e'])
s3 = pd.Series([5,6],index = ['f','g'])
print(pd.concat([s1,s2,s3]))
# a 0
# b 1
# a 2
# d 3
# e 4
# f 5
# g 6
# dtype: int64
两个DataFrame的数据连接。
data1 = pd.DataFrame(np.arange(6).reshape(2,3),columns = list('abc')) data2 = pd.DataFrame(np.arange(20,26).reshape(2,3),columns = list('ayz')) data = pd.concat([data1,data2],axis = 0) display(data1,data2,data) # a b c # 0 0 1 2 # 1 3 4 5 # a y z # 0 20 21 22 # 1 23 24 25 # a b c y z # 0 0 1.0 2.0 NaN NaN # 1 3 4.0 5.0 NaN NaN # 0 20 NaN NaN 21.0 22.0 # 1 23 NaN NaN 24.0 25.0
指定索引顺序。
import pandas as pd s1 = pd.Series([0,1],index = ['a','b']) s2 = pd.Series([2,3,4],index = ['a','d','e']) s3 = pd.Series([5,6],index = ['f','g']) s4 = pd.concat([s1*5,s3],sort = False) s5 = pd.concat([s1,s4],axis=1,sort=False) s6 = pd.concat([s1,s4],axis=1,join='inner',sort=False) s7 = pd.concat([s1,s4],axis=1,join='inner',join_axes=[['b','a']],sort=False) display(s6,s7) # 0 1 # a 0 0 # b 1 5 # 0 1 # b 1 5 # a 0 0
(3) combine_first合并数据
使用combine_first合并。(需要合并的两个DataFrame存在重复索引)
s6.combine_first(s5)
# 0 1
# a 0.0 0.0
# b 1.0 5.0
# f NaN 5.0
# g NaN 6.0
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。