赞
踩
这里记录一下犯过的及其傻帽的错误!!!!哈哈,无语,同时讨论一下NaN这个数据类型的处理
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py:816: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison result = getattr(x, name)(y)
....................
TypeError: invalid type comparison
这里有一个优惠券的scv表:
- import numpy as np
- import pandas as pd
- dfoff = pd.read_csv("datalab/4901/ccf_offline_stage1_train.csv")
- dfofftest = pd.read_csv("datalab/4901/ccf_offline_stage1_test_revised.csv")
- dfoff.head()
----------------------------------------------------------------------------------------------------------------------------------------------------------------
一般来说比如我们想筛选出 Discount_rate是20:1且Distance不是1.0的行数可以这么做:
- dfoff.info()
- print('数目是:',dfoff[(dfoff['Discount_rate']=='20:1')&(dfoff['Date']!=1.0)].shape[0])
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
于是笔者这样做了筛选:
- dfoff.info()
- print('有优惠券,但是没有使用优惠券购买的客户有',dfoff[(dfoff['Coupon_id']!='NaN')&(dfoff['Date']=='NaN')].shape[0])
结果报错:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id float64
Discount_rate object
Distance float64
Date_received float64
Date float64
dtypes: float64(4), int64(2), object(1)
memory usage: 93.7+ MB
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py:816: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
result = getattr(x, name)(y)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-c27c94978405> in <module>()
1 dfoff.info()
----> 2 print('有优惠券,但是没有使用优惠券购买的客户有',dfoff[(dfoff['Coupon_id']!='NaN')&(dfoff['Date']=='NaN')].shape[0])
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
877
878 with np.errstate(all='ignore'):
--> 879 res = na_op(values, other)
880 if is_scalar(res):
881 raise TypeError('Could not compare {typ} type with Series'
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
816 result = getattr(x, name)(y)
817 if result is NotImplemented:
--> 818 raise TypeError("invalid type comparison")
819 except AttributeError:
820 result = op(x, y)
TypeError: invalid type comparison
其实吧原因很简单,注意看上面笔者故意标红的地方,Coupon_id 和Date的数据类型都是float64,而代码中却用了dfoff['Coupon_id']!='NaN',这不是字符串嘛!!!!!!
print(type('NaN'))
<class 'str'>
float和str比较当然报错了是吧,哎!能这样直接去比较我也算是极品啦哈哈哈
于是可以使用其内置的方法解决:
- dfoff.info()
- print('有优惠券,但是没有使用优惠券购买的客户有',dfoff[(dfoff['Coupon_id'].notnull())&(dfoff['Date'].isnull())].shape[0])
即使用了如下两个方法
- .notnull()
- .isnull()
其作用就是判断是否是空值,如果csv中的NaN的地方换成null同样适用
同时这里说一下怎么将NaN替换掉:例如替换成0.0
dfoff['Coupon_id']=dfoff['Coupon_id'].replace(np.nan, 0.0)
-----------------------------------------------------------------------------------------------------------------------------------------------------------
下面来说一下NaN这个数据类型,它的全称应该是not a number,说到这里不得不提到另外一个数据类型inf
相同点:都是代表一个无法表示的数
不同点:inf代表无穷大,是一个超过浮点表示范围的浮点数,而NaN可以看成是缺少值或者是无理数
假设现在有一段程序:
- def ConvertRate(row):
- if row.isnull():
- return 0
- elif ':' in str(row):
- rows = str(row).split(':')
- return 1.0-float(rows[1])/float(rows[0])
- else:
- return float(row)
- dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
- print(dfoff.head(3))
会发现报错:
- ---------------------------------------------------------------------------
- AttributeError Traceback (most recent call last)
- <ipython-input-3-0aa06185ee75> in <module>()
- 7 else:
- 8 return float(row)
- ----> 9 dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
- 10 print(dfoff.head(3))
-
- /opt/conda/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
- 2549 else:
- 2550 values = self.asobject
- -> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
- 2552
- 2553 if len(mapped) and isinstance(mapped[0], Series):
-
- pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
-
- <ipython-input-3-0aa06185ee75> in ConvertRate(row)
- 1 def ConvertRate(row):
- ----> 2 if row.isnull():
- 3 return 0
- 4 elif ':' in str(row):
- 5 rows = str(row).split(':')
-
- AttributeError: 'float' object has no attribute 'isnull'
那它到底是什么数据类型呢?
- print(type(np.nan))
- print(type(np.inf))
- <class 'float'>
- <class 'float'>
NaN'就是表示一个普通的字符串,而np.nan就是代表真真的nan,那我们可不可以使用这样:
- def ConvertRate(row):
- if row==np.nan:
- return 0
- elif ':' in str(row):
- rows = str(row).split(':')
- return 1.0-float(rows[1])/float(rows[0])
- else:
- return float(row)
- dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
- print(dfoff.head(3))
- User_id Merchant_id Coupon_id Discount_rate Distance Date_received \
- 0 1439408 2632 NaN NaN 0.0 NaN
- 1 1439408 4663 11002.0 150:20 1.0 20160528.0
- 2 1439408 2632 8591.0 20:1 0.0 20160217.0
-
- Date discount_rate
- 0 20160217.0 NaN
- 1 NaN 0.866667
- 2 NaN 0.950000
可以看到这里还是NaN,并不是0,说明还是不对
那试一下:
- def ConvertRate(row):
- if row==float('NaN'):
- return 0
- elif ':' in str(row):
- rows = str(row).split(':')
- return 1.0-float(rows[1])/float(rows[0])
- else:
- return float(row)
- dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
- print(dfoff.head(3))
结果还是如上面,其实NaN数据类型就是一种特殊的float,这里相当于强制类型转化
那到底怎么办呢?其实判断是否是NaN可以使用如下方法:
row!=row
如果结果是真,那么就是NaN,假就代表不是NaN
可以看一下结果:
- def ConvertRate(row):
- if row!=row:
- return 0
- elif ':' in str(row):
- rows = str(row).split(':')
- return 1.0-float(rows[1])/float(rows[0])
- else:
- return float(row)
- dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
- print(dfoff.head(3))
print(dfoff.head(3))
- User_id Merchant_id Coupon_id Discount_rate Distance Date_received \
- 0 1439408 2632 NaN NaN 0.0 NaN
- 1 1439408 4663 11002.0 150:20 1.0 20160528.0
- 2 1439408 2632 8591.0 20:1 0.0 20160217.0
-
- Date discount_rate
- 0 20160217.0 0.000000
- 1 NaN 0.866667
- 2 NaN 0.950000
于是笔者最开始的那个问题也可以这样解决:
print('有优惠券,但是没有使用优惠券购买的客户有',dfoff[(dfoff['Coupon_id']==dfoff['Coupon_id'])&(dfoff['Date']!=dfoff['Date'])].shape[0])
有优惠券,但是没有使用优惠券购买的客户有 977900
---------------------------------------------------------------------------------------------------------------------------------------------------------------
有时候在使用apply的时候会报错,所以最好加一下:axis = 1意思是按列处理的
对应到上面就是吧:
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
改为:
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate,axis = 1)
------------------------------------------------------------------------------------------------------------------------------------------------------------
所以最后总结一下:
------------------------------------------------------------------------------------------------------------------------------------------------------
在使用pands加载数据的时候,其实我们是可以控制数据类型的,比如让缺省值变为null,而不是NAN,即让字段的数据类型不再是float,而是object,这里有一个例子:https://blog.csdn.net/weixin_42001089/article/details/85013073
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。