赞
踩
(1)Kaggle项目,用于预测出租车出行的总时间。
(2)根据已有数据,抽提出更多有用特征,提升预测的准确性。
(3)依据探索出来的特征数据,探索性的发现纽约出租车的订单数量变化情况以及订单的行为轨迹等。
(4)项目预测结果:通过xgboost建模预测,验证得到模型RMSE的值为0.35568,同时模型稍微有点过拟合。
(5)kaggle项目链接:https://www.kaggle.com/c/nyc-taxi-trip-duration/overview
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime,date
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.cluster import KMeans,MiniBatchKMeans
import warnings
warnings.filterwarnings('ignore')
# 读取数据集
train = pd.read_csv('train.csv', parse_dates=['pickup_datetime'])
test = pd.read_csv('test.csv', parse_dates=['pickup_datetime'])
train.shape,test.shape
((1458644, 11), (625134, 9))
训练集数据约146万,测试集数据约63万。 训练集比测试集多了2个字段,具体是哪3个字段呢?
print([i for i in train.columns if i not in test.columns])
[‘dropoff_datetime’, ‘trip_duration’]
这是多出来的两个字段名称,下车时间和行程时间,其中行程时间是需要我们预测的应变量,测试集没有下车时间(如果测试集有下车时间,那还预测啥呢?)
holiday = pd.read_csv('NYC_2016Holidays.csv', sep=';')
holiday['Date'] = holiday['Date'].apply(lambda x: x + ' 2016')
holidays = [datetime.strptime(holiday.loc[i, 'Date'], '%B %d %Y').date() for i in range(len(holiday))]
holidays
[datetime.date(2016, 1, 1),
datetime.date(2016, 1, 18),
datetime.date(2016, 2, 12),
datetime.date(2016, 2, 15),
datetime.date(2016, 5, 8),
datetime.date(2016, 5, 30),
datetime.date(2016, 6, 19),
datetime.date(2016, 7, 4),
datetime.date(2016, 9, 5),
datetime.date(2016, 10, 10),
datetime.date(2016, 11, 11),
datetime.date(2016, 11, 24),
datetime.date(2016, 12, 26),
datetime.date(2016, 7, 4),
datetime.date(2016, 11, 8)]
# 读取线路集训练数据
fast_route1 = pd.read_csv('fastest_routes_train_part_1.csv',
usecols=['id', 'total_distance','total_travel_time'])
fast_route2 = pd.read_csv('fastest_routes_train_part_2.csv',
usecols=['id', 'total_distance','total_travel_time'])
fast_route = pd.concat((fast_route1, fast_route2))
# 读取线路集测试数据
fast_route_test = pd.read_csv('fastest_routes_test.csv',
usecols=['id', 'total_distance','total_travel_time']))
fast_route.head()
4. 天气数据
weather = pd.read_csv('KNYC_Metars.csv', parse_dates=['Time'])
weather.head()
数据集不存在空值,继续查看数据分布情况。
train.info()
trip_duration,即行程时间存在异常值,最大行程时间约为980小时,不舍昼夜开上980小时显然不可能。
箱型图先查看数据分布。
train.describe()
sns.boxplot(train['trip_duration'])
由图可见,箱子已经被极度压缩不可显示,除此以外,存在几个非常大的离群点。
为了使得数据分布正常,需要删除离群点。
如下图,离群点删除后,箱型图恢复一些,当然,还是存在较多的离群点。
对数据要求比较高的小伙伴,也可以考虑再次进行异常值的删除。
# 删除异常数据 27359 条
train = train[(train['trip_duration']!=1)&(train['trip_duration']!=3526282)]
time_mean = np.mean(train.trip_duration)
time_std = np.std(train.trip_duration)
train = train[(train['trip_duration']<=(time_mean + 3 * time_std)) & (train['trip_duration']>=(time_mean - 3 * time_std))]
#删除异常数据后,再来查看数据分布情况
sns.boxplot(train['trip_duration'])
4.乘客人数的异常值处理
乘客人数为[7,8,9]的数量很少,为了模型的稳定性,选择删除;另外,人数为0的,也删除。
train.passenger_count.value_counts(), test.passenger_count.value_counts()
plt.figure(figsize=(8, 6))
train.groupby(['passenger_count'])['passenger_count'].count().plot.bar(color='b', alpha=0.8)
plt.xlabel('Number of passenger')
plt.ylabel('Count')
plt.title('Count of passenger')
#删除乘客人数为【0,7,8,9】的数据
train = train[train['passenger_count'].between(left=1, right=6)]
test = test[test['passenger_count'].between(left=1, right=6)]
train.shape, test.shape
((1458579, 11), (625109, 9))
(1)提取时间特征:年、月、日、小时、分钟、星期、日期
(2)考虑到周末、节假日和工作日用车情况的差异,根据日期,筛选出是否为周末和是否为节假日两个特征
(3)分钟整合到小时,生成一个字段
(4)时间特征加上后,特征值增加了10个
(5)根据后续的24小时的订单数据分布,将24小时切割成4个时段:早高峰、晚高峰、白天和晚上,图示显示,四个时段用车需求不同,与实际感知一致
#提取年、月、日、小时、分钟、星期、日期 for df in (train, test): df['year'] = df.pickup_datetime.dt.year df['month'] = df.pickup_datetime.dt.month df['day'] = df.pickup_datetime.dt.day df['hour'] = df.pickup_datetime.dt.hour df['minute'] = df.pickup_datetime.dt.minute df['dayofweek'] = df.pickup_datetime.dt.dayofweek #df['date'] = df.pickup_datetime.values.astype('datetime64[D]') df['date'] = df.pickup_datetime.dt.date #根据日期,判断是否为周末或休息日(鉴于假期对用车有加大影响) def is_rest_day(year, month, day, holidays): is_weekend = [None] * len(year) is_rest = [None] * len(year) i = 0 for yy,mm,dd in zip(year, month, day): is_weekend[i] = date(yy,mm,dd).isoweekday() in (6, 7) is_rest[i] = is_weekend[i] or (date(yy,mm,dd) in holidays) i += 1 return is_weekend, is_rest weekend, rest_day = is_rest_day(train.year, train.month, train.day, holidays) train['weekend'] = weekend train['rest_day'] = rest_day weekend, rest_day = is_rest_day(test.year, test.month, test.day, holidays) test['weekend'] = weekend test['rest_day'] = rest_day train['pickup_time'] = train.hour + train.minute/60 test['pickup_time'] = train.hour + train.minute/60 train.shape, test.shape
((1456467, 21), (625109, 19))
""" 工作日: 早高峰:7-9 晚高峰:17-21 白天:9-17 晚上:7点以前, 21点以后 非工作日: 白天:7-21 晚上:7点以前, 21点以后 """ for df in (train, test): df['hour_category'] = np.nan df.loc[(df.rest_day == False)&(df.hour>=7 )& (df.hour<=9), 'hour_category'] = 'morning_peak' df.loc[(df.rest_day == False)&(df.hour>=17 )& (df.hour<=21), 'hour_category'] = 'evening_peak' df.loc[(df.rest_day == False)&(df.hour>9 )& (df.hour<17), 'hour_category'] = 'day' df.loc[(df.rest_day == False)&(df.hour<7)|(df.hour>21), 'hour_category'] = 'night' df.loc[(df.rest_day == True)&(df.hour>=7 )& (df.hour<=21), 'hour_category'] = 'day' df.loc[(df.rest_day == True)&(df.hour<7)|(df.hour>21), 'hour_category'] = 'night'
基于用车需求,把时间进行划分,分为早高峰、晚高峰、白天和夜晚。
数据集中添加距离特征,为后续各区行车速度计算做准备
#数据集中增加距离特征
train = train.join(fast_route.set_index('id'), on='id')
test = test.join(fast_route_test.set_index('id'), on='id')
train.shape, test.shape
((1456467, 24), (625109, 22))
3. 天气特征
(1)提取天气特征的年、月、日,以2016年的日期为准,匹配数据集中,对应日期的天气情况,辅助模型计算当天的行程速度
(2)影响行程速度的天气因素,选择了:温度、能见度、是否下雪或雾天等
提取天气特征的年月日等时间特征,并筛选年份天气等特征。
weather['year'] = weather.Time.dt.year
weather['month'] = weather.Time.dt.month
weather['day'] = weather.Time.dt.day
weather['hour'] = weather.Time.dt.hour
weather['snow'] = 1*(weather.Events=='snow') + 1*(weather.Events == 'Fog\n\t,\nSnow')
weather = weather[weather['year']==2016][['Temp.', 'Visibility', 'snow', 'month', 'day','hour', 'Precip']]
weather.head()
train = pd.merge(train, weather, on=['month', 'day', 'hour'], how='left')
test = pd.merge(test, weather, on=['month', 'day', 'hour'], how='left')
(1)纵向贴合经度和纬度数据,kmeans聚类后,分为8个类别
(2)通过训练好的kmeans模型,返回每个数据集样本的类别
(3)计算出上车点、下车点每个类别的订单数,并合并至原始数据集
(4)不同区域订单的上车点、下车点订单数据差异较大,建模过程中未进行标准化,可能对模型结果有影响,建模前可以考虑对齐进行标准化处理。
coodrs = np.vstack((train[['pickup_longitude', 'pickup_latitude']].values,
train[['dropoff_longitude', 'dropoff_latitude']].values,
test[['pickup_longitude', 'pickup_latitude']].values,
test[['dropoff_longitude', 'dropoff_latitude']].values))
sample_ids = np.random.permutation(len(coodrs))[:1000000]
kmeans = MiniBatchKMeans(n_clusters=8, batch_size=10000).fit(coodrs[sample_ids])
train.loc[:, 'pickup_cluster'] = kmeans.predict(train[['pickup_longitude', 'pickup_latitude']])
train.loc[:, 'dropoff_cluster'] = kmeans.predict(train[['dropoff_longitude', 'dropoff_latitude']])
test.loc[:, 'pickup_cluster'] = kmeans.predict(test[['pickup_longitude', 'pickup_latitude']])
test.loc[:, 'dropoff_cluster'] = kmeans.predict(test[['dropoff_longitude', 'dropoff_latitude']])
len(train), len(test)
(1456467, 625109)
a = pd.concat([train, test]).groupby(['pickup_cluster']).size().reset_index()
b = pd.concat([train, test]).groupby(['dropoff_cluster']).size().reset_index()
train = pd.merge(train, a, on=['pickup_cluster'], how='left')
train = pd.merge(train, b, on=['dropoff_cluster'], how='left')
test = pd.merge(test, a, on=['pickup_cluster'], how='left')
test = pd.merge(test, b, on=['dropoff_cluster'], how='left')
train.shape, test.shape
((1456467, 32), (625109, 30))
特征数进一步增加
train['speed'] = train['total_distance'] / train['total_travel_time'] pickup_speed = train[['speed', 'pickup_cluster']].groupby('pickup_cluster').mean().reset_index() pickup_speed.rename(columns={'speed': 'avg_pickup_speed'}, inplace=True) dropoff_speed = train[['speed', 'dropoff_cluster']].groupby('dropoff_cluster').mean().reset_index() dropoff_speed.rename(columns={'speed': 'avg_dropoff_speed'}, inplace=True) train = pd.merge(train, pickup_speed, on=['pickup_cluster'], how='left') train = pd.merge(train, dropoff_speed, on=['dropoff_cluster'], how='left') test = pd.merge(test, pickup_speed, on=['pickup_cluster'], how='left') test = pd.merge(test, dropoff_speed, on=['dropoff_cluster'], how='left') train.drop('speed', axis = 1, inplace=True) train.shape, test.shape
((1456467, 34), (625109, 32))
train['log_trip_duration'] = np.log(train['trip_duration'])
train['store_and_fwd_flag'] = 1 * (train['store_and_fwd_flag'] == 'Y')
test['store_and_fwd_flag'] = 1 * (test['store_and_fwd_flag'] == 'Y')
行程时间数据差异大,为减小数据差异带来的干扰,将数据log化。
# 月度订单数量变化
plt.figure(figsize=(10, 5))
plt.plot(train.groupby(['month'])['id'].count(), 'go--', linewidth=2, markersize=12)
plt.xlabel('Month')
plt.ylabel('Trip of orders')
plt.title('Monthly change of orders')
plt.show()
(2)订单数随日期的变化:
(1)时间序列上,订单周期整体上呈现波动中保持平衡状态
(2)数据特征显示出一定的周期性,订单量呈现连续增长几天后,再回落,猜测可能与工作日和休息日有关
(3)时间序列上显示存在两个异常低的时间节点,订单量萎缩明显,通过查找发现异常时间点分别是1.24号和前后两天已经5.30号这天
(4)1月份订单量异常低,或许跟该月24号前后的订单异常有关;上图6月份订单量下挫严重,可以发现,对应时序上,当月订单整体上是一个往下走的趋势
2016-01-23 1644
2016-01-24 3376
2016-01-25 6076
2016-05-30 5564
# 时间序列上的订单数量变化
plt.figure(figsize=(10, 6))
plt.plot(train.groupby(['date'])['id'].count(), 'ro-', linewidth=1, markersize=5, alpha=0.8)
plt.xlabel('Date')
plt.ylabel('Trip of orders')
plt.title('Daily change of orders')
plt.show()
# 1.23,1.24,1.25和5.30日订单异常低
def outlier(data):
iqr = np.quantile(data, 0.75) - np.quantile(data, 0.25)
upper = np.quantile(data, 0.75) + 1.5 * iqr
lower = np.quantile(data, 0.25) - 1.5 * iqr
outflier_data = data[(data>upper) | (data<lower) ]
return outflier_data
outlier(train.groupby(['date'])['id'].count())
(3)订单数随日历的变化:
a.每月的日订单量,整体在波动中保持相对稳定;
b.20号以后,订单量有所收缩,呈下降趋势;
c.30号订单少可能与2月份少了一天有关,31号订单量异常低,可能与计算天数少了3天有关,当然,不排除还有其他可能。
# 日期随订单数量的变化3
plt.figure(figsize=(10, 6))
plt.plot(train.groupby(['day'])['id'].count(), 'bo-', linewidth=1, markersize=5, alpha=0.8)
plt.xlabel('Day')
plt.ylabel('Trip of orders')
plt.title('Day change of orders')
plt.show()
(4)订单数与星期的关系:
a.整体上,工作日用车订单高于周末,与我们在时间序列上发现的周期性波动相印证;
b.周末用车需求迅速降低,周日用车量将至最低。可能原因:
(a)工作日上班赶时间,有着天然的用车需求;
(b)因公需要,打车出行;
(c)下班后的聚餐、结伴出行等;
(d)休息日出行意愿降低;
(e)周末礼拜等当地文化因素,降低了出行。
# 订单数与星期的关系
plt.figure(figsize=(10, 6))
plt.plot(train.groupby(['dayofweek'])['id'].count(), 'g*-', linewidth=1, markersize=8, alpha=0.8)
plt.xlabel('Dayofweek')
plt.ylabel('Trip of orders')
plt.title('Dayofweek change of orders')
plt.show()
trip_week = train[['trip_duration', 'month', 'dayofweek','day', 'hour']].groupby([ 'month', 'dayofweek','day','hour'])['trip_duration'].agg([ 'mean', 'count']).reset_index()
trip_week.rename(columns={'mean': 'week_mean_trip_dur', 'count': 'week_trip_order'}, inplace=True)
plt.figure(figsize=(15, 8))
sns.swarmplot(x='dayofweek', y='week_trip_order', hue='month', data=trip_week)
plt.title('Dayofweek change of orders')
plt.show()
(5)24小时内的订单数量变化:
整体上,明显的可以分为四个时段,即早高峰(7:00-9:00)、晚高峰(17::00-21:00)、白天(9:00-17:00)、晚上(7:00之前和21:00以后);
# 24小时内订单量的变化
plt.figure(figsize=(12, 6))
sns.stripplot(trip_week['hour'], trip_week['week_trip_order'], data = trip_week)
#plt.xlabel('Hour')
#plt.ylabel('Trip of orders')
plt.title('')
plt.show()
(6)订单的地域分布:
(1)订单主要集中在曼哈顿地区,此外,布鲁克林和皇后区的部分区域订单也较多;
(2)机场附近,肯尼迪、拉瓜迪机场用车需求也较大
longitude = list(train.pickup_longitude) + list(train.dropoff_longitude) latitude = list(train.pickup_latitude) + list(train.dropoff_latitude) print('The lenth of train.pickup_longitude',len(train.pickup_longitude)) print('The lenth of train.dropoff_longitude',len(train.dropoff_longitude)) print('The lenth of train.latitude',len(train.pickup_latitude)) print('The lenth of train.latitude',len(train.dropoff_latitude)) print('The lenth of longitude',len(longitude)) print('The lenth of latitude',len(latitude)) loc_df = pd.DataFrame() loc_df['longitude'] = longitude loc_df['latitude'] = latitude long_lim = [-74.03, -73.77] lat_lim = [40.63, 40.85] print(loc_df.shape) loc_df = loc_df[loc_df['longitude'].between(left=long_lim[0], right=long_lim[1])] loc_df = loc_df[loc_df['latitude'].between(left=lat_lim[0], right=lat_lim[1])]
The lenth of train.pickup_longitude 1456467
The lenth of train.dropoff_longitude 1456467
The lenth of train.latitude 1456467
The lenth of train.latitude 1456467
The lenth of longitude 2912934
The lenth of latitude 2912934
kmeans = KMeans(n_clusters=15, n_init=10, random_state=123).fit(loc_df)
loc_df['labels'] = kmeans.labels_
loc_df.head()
plt.figure(figsize=(10, 6))
for label in loc_df.labels.unique():
plt.plot(loc_df.longitude[loc_df['labels']==label], loc_df.latitude[loc_df['labels']==label],
'.', markersize=0.3, alpha=0.3)
plt.title('Cluster of New York')
plt.show()
月、日、小时等分类数据,本身没有大小之分,为了减小这类数据带来的偏差,进行dummy操作
vendor_id_train = pd.get_dummies(train.vendor_id, prefix='vi', prefix_sep='_') store_and_fwd_flag_train = pd.get_dummies(train.store_and_fwd_flag, prefix='flag', prefix_sep='_') pickup_cluster_train = pd.get_dummies(train.pickup_cluster, prefix='pc', prefix_sep='_') dropoff_cluster_train = pd.get_dummies(train.dropoff_cluster, prefix='dc', prefix_sep='_') month_train = pd.get_dummies(train.month, prefix='m', prefix_sep='_') day_train = pd.get_dummies(train.day, prefix='d', prefix_sep='_') hour_train = pd.get_dummies(train.hour, prefix='h', prefix_sep='_') dayofweek_train = pd.get_dummies(train.dayofweek, prefix='dw', prefix_sep='_') hour_category_train = pd.get_dummies(train.hour_category, prefix='hc', prefix_sep='_') vendor_id_test = pd.get_dummies(test.vendor_id, prefix='vi', prefix_sep='_') store_and_fwd_flag_test = pd.get_dummies(test.store_and_fwd_flag, prefix='flag', prefix_sep='_') pickup_cluster_test = pd.get_dummies(test.pickup_cluster, prefix='pc', prefix_sep='_') dropoff_cluster_test = pd.get_dummies(test.dropoff_cluster, prefix='dc', prefix_sep='_') month_test = pd.get_dummies(test.month, prefix='m', prefix_sep='_') day_test = pd.get_dummies(test.day, prefix='d', prefix_sep='_') hour_test = pd.get_dummies(test.hour, prefix='h', prefix_sep='_') dayofweek_test = pd.get_dummies(test.dayofweek, prefix='dw', prefix_sep='_') hour_category_test = pd.get_dummies(test.hour_category, prefix='hc', prefix_sep='_') train = train.drop(['id', 'vendor_id', 'dropoff_datetime', 'store_and_fwd_flag', 'trip_duration', 'year', 'month', 'day', 'hour', 'minute', 'dayofweek', 'date', 'pickup_time', 'hour_category', 'pickup_cluster', 'dropoff_cluster'], axis=1) Teat_id = test['id'] test = test.drop(['id', 'vendor_id', 'store_and_fwd_flag', 'year', 'month', 'day', 'hour', 'minute', 'dayofweek', 'date', 'pickup_time', 'hour_category', 'pickup_cluster', 'dropoff_cluster'], axis=1)
Train_master = pd.concat([train, vendor_id_train, store_and_fwd_flag_train, pickup_cluster_train, dropoff_cluster_train, month_train, day_train, hour_train, dayofweek_train, hour_category_train], axis = 1) Test_master = pd.concat([test, vendor_id_test, store_and_fwd_flag_test, pickup_cluster_test, dropoff_cluster_test, month_test, day_test, hour_test, dayofweek_test, hour_category_test], axis = 1) Train_master = Train_master.drop(['pickup_datetime'], axis=1) Test_master = Test_master.drop(['pickup_datetime'], axis=1) Train_master.shape, Test_master.shape
((1456467, 110), (625109, 109))
经过一些列的特征增加及dummy,训练集特征数已经增加到110个。
Train, Test = train_test_split(Train_master, test_size=0.01) X_train = Train.drop(['log_trip_duration'], axis=1) Y_train = Train['log_trip_duration'] X_test = Test.drop(['log_trip_duration'], axis=1) Y_test = Test['log_trip_duration'] Y_train = Y_train.reset_index().drop(['index'], axis=1) Y_test = Y_test.reset_index().drop(['index'], axis=1) dtrain = xgb.DMatrix(X_train, label=Y_train) dvalid = xgb.DMatrix(X_test, label=Y_test) dtest = xgb.DMatrix(Test_master) watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
本文没有进行参数筛选,当然电脑支撑的话,可以选择最好的参数,让模型效果更佳。
xgb_params = {
'object': 'reg:linear',
'learning_rate': 0.05,
'max_depth':7,
'subsample':0.8,
'colsample_bytree':0.7,
'colsample_bylevel':0.7,
'silent':1,
'reg_alpha':1
}
在这里插入代码片
3 构建模型
model = xgb.train(xgb_params, dtrain,500, watchlist,early_stopping_rounds=10,
maximize=False, verbose_eval=1)
print('RMSLE of modeling is %.5f', model.best_score)
4.查看特征贡献值
plt.figure(figsize=(10, 8))
xgb.plot_importance(model, max_num_features=20, height=1)
5.预测结果
pred = model.predict(dtest)
pred = np.exp(pred) - 1
————————————————
版权声明:本文为CSDN博主「SophiaSSSSS」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/weixin_44216391/article/details/90115972
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。