赞
踩
1.项目背景
本数据集提供了某旅游网站上客户行为的各种信息。通过分析这些数据能够更好的理解用户的旅游习惯、偏好以及与旅游内容的互动方式非常重要,对于旅游网站在市场营销、用户体验优化以及新服务开发等方面具有重要的参考价值。通过分析这些数据,旅游公司可以更有效地满足客户需求,提升服务质量,同时增强用户的参与度和忠诚度。
本项目主要从用户购买行为分析、聚类分析、随机森林三个角度来探究用户情况,并且探究影响用户购票的主要因素。2.数据说明
变量 描述 UserID 用户的唯一ID Taken_product 下个月购买机票(目标变量) Yearly_avg_view_on_travel_page 用户每年在旅行相关页面的平均浏览次数 preferred_device 用户登录的首选设备 total_likes_on_outstation_checkin_given 用户在过去一年对外站签到给予的总点赞数 yearly_avg_Outstation_checkins 用户平均每年的外站签到次数 member_in_family 用户账户中提及的家庭成员总数 preferred_location_type 用户旅行的首选地点类型 Yearly_avg_comment_on_travel_page 用户每年在旅行相关页面的平均评论数 total_likes_on_outofstation_checkin_received 用户在过去一年收到的外站签到总点赞数 week_since_last_outstation_checkin 用户最后一次外站签到更新以来的周数 following_company_page 客户是否关注公司页面(是或否) montly_avg_comment_on_company_page 用户每月在公司页面的平均评论数 working_flag 客户是否在工作 travelling_network_rating 表明用户是否有喜欢旅行的密切朋友的评级。1是最高,4是最低 Adult_flag 客户的年龄状态(因为取值为0-3,我猜测应该和成人状态有关,而不是判断是否为成人) Daily_Avg_mins_spend_on_traveling_page 用户在公司旅行页面上的平均每日花费时间 3.Python库导入及数据读取
In [1]:
# 导入需要的库 import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.cluster import KMeans from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.utils import resample from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report,confusion_matrix,roc_curve, aucIn [2]:
# 读取数据 data = pd.read_csv("/home/mw/input/data5466/Customer behaviour Tourism.csv")4.数据预览及数据处理
4.1数据预览
In [3]:
# 查看数据维度 data.shape(11760, 17)In [4]:
# 查看数据信息 data.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 11760 entries, 0 to 11759 Data columns (total 17 columns): UserID 11760 non-null int64 Taken_product 11760 non-null object Yearly_avg_view_on_travel_page 11179 non-null float64 preferred_device 11707 non-null object total_likes_on_outstation_checkin_given 11379 non-null float64 yearly_avg_Outstation_checkins 11685 non-null object member_in_family 11760 non-null object preferred_location_type 11729 non-null object Yearly_avg_comment_on_travel_page 11554 non-null float64 total_likes_on_outofstation_checkin_received 11760 non-null int64 week_since_last_outstation_checkin 11760 non-null int64 following_company_page 11657 non-null object montly_avg_comment_on_company_page 11760 non-null int64 working_flag 11760 non-null object travelling_network_rating 11760 non-null int64 Adult_flag 11759 non-null float64 Daily_Avg_mins_spend_on_traveling_page 11759 non-null float64 dtypes: float64(5), int64(5), object(7) memory usage: 1.5+ MBIn [5]:
# 查看各列缺失值 data.isna().sum()UserID 0 Taken_product 0 Yearly_avg_view_on_travel_page 581 preferred_device 53 total_likes_on_outstation_checkin_given 381 yearly_avg_Outstation_checkins 75 member_in_family 0 preferred_location_type 31 Yearly_avg_comment_on_travel_page 206 total_likes_on_outofstation_checkin_received 0 week_since_last_outstation_checkin 0 following_company_page 103 montly_avg_comment_on_company_page 0 working_flag 0 travelling_network_rating 0 Adult_flag 1 Daily_Avg_mins_spend_on_traveling_page 1 dtype: int64In [6]:
# 查看重复值 data.duplicated().sum()04.2数据处理
In [7]:
# 删除缺失值 data.dropna(inplace=True)In [8]:
# 再次查看缺失值情况 data.isna().sum()UserID 0 Taken_product 0 Yearly_avg_view_on_travel_page 0 preferred_device 0 total_likes_on_outstation_checkin_given 0 yearly_avg_Outstation_checkins 0 member_in_family 0 preferred_location_type 0 Yearly_avg_comment_on_travel_page 0 total_likes_on_outofstation_checkin_received 0 week_since_last_outstation_checkin 0 following_company_page 0 montly_avg_comment_on_company_page 0 working_flag 0 travelling_network_rating 0 Adult_flag 0 Daily_Avg_mins_spend_on_traveling_page 0 dtype: int64In [9]:
# 查看指定特征的唯一值(因为数据比较杂乱) characteristic = ['Taken_product','preferred_device','yearly_avg_Outstation_checkins','member_in_family','preferred_location_type','following_company_page','working_flag'] for i in characteristic: print(f'{i}:') print(data[i].unique()) print('-'*50)Taken_product: ['Yes' 'No'] -------------------------------------------------- preferred_device: ['iOS and Android' 'iOS' 'ANDROID' 'Android' 'Android OS' 'Other' 'Others' 'Tab' 'Laptop' 'Mobile'] -------------------------------------------------- yearly_avg_Outstation_checkins: ['1' '23' '16' '26' '19' '24' '21' '11' '15' '10' '25' '12' '18' '29' '22' '20' '28' '14' '27' '13' '17' '*' '5' '8' '2' '3' '9' '7' '6' '4'] -------------------------------------------------- member_in_family: ['2' '1' '4' '3' 'Three' '5' '10'] -------------------------------------------------- preferred_location_type: ['Financial' 'Other' 'Medical' 'Game' 'Entertainment' 'Social media' 'Tour and Travel' 'Movie' 'OTT' 'Tour Travel' 'Beach' 'Historical site' 'Big Cities' 'Trekking' 'Hill Stations'] -------------------------------------------------- following_company_page: ['Yes' 'No' '1' '0'] -------------------------------------------------- working_flag: ['No' 'Yes'] --------------------------------------------------**可以看到:
1.用户登录的首选设备中存在Other和Others,到时候需要统一称为Others,还有ANDROID和Android、Android OS需要统一成Android,Tab是平板电脑,Mobile应该也是指移动手机,正常来讲也是需要处理的,但是可能是一些特殊的系统,这里不作处理了。
2.用户平均每年的外站签到次数中存在'*'号,这里直接删除这个异常符号,并且将数据格式改成int格式。
3.家庭成员中存在Three,直接把Three改成3,然后把数据格式改成int格式。
4.用户旅行的首选地点类型中Tour and Travel和Tour Travel是同样的,统一成Tour and Travel。
5.客户是否关注公司页面存在了Yes、No、1、0,这里我们直接把Yes替换成1,No替换成0。
6.把下个月购买机票和客户是否在工作中的Yes和No分别替换成1和0。**In [10]:
# 1. 用户登录首选设备的处理 data['preferred_device'] = data['preferred_device'].replace({'Other': 'Others', 'ANDROID': 'Android', 'Android OS': 'Android'}) # 2. 用户平均每年的外站签到次数处理 data = data[data['yearly_avg_Outstation_checkins'] != '*'] data['yearly_avg_Outstation_checkins'] = data['yearly_avg_Outstation_checkins'].astype(int) # 3. 家庭成员处理 data['member_in_family'] = data['member_in_family'].replace({'Three': '3'}) data['member_in_family'] = data['member_in_family'].astype(int) # 4. 用户旅行的首选地点类型处理 data['preferred_location_type'] = data['preferred_location_type'].replace({'Tour Travel': 'Tour and Travel'}) # 5. 客户是否关注公司页面的处理 data['following_company_page'] = data['following_company_page'].replace({'Yes': '1', 'No': '0'}) # 6. 把下个月购买机票和客户是否在工作中的Yes和No分别替换成1和0 data['Taken_product'] = data['Taken_product'].replace({'Yes': '1', 'No': '0'}) data['working_flag'] = data['working_flag'].replace({'Yes': '1', 'No': '0'})In [11]:
# 将 UserID 修改为字符串类型 data['UserID'] = data['UserID'].astype(str) # 将 Taken_product 修改为分类类型 data['Taken_product'] = data['Taken_product'].astype('category') # 将 following_company_page 修改为分类变量 data['following_company_page'] = data['following_company_page'].astype('category') # 将 working_flag 修改为分类变量 data['working_flag'] = data['working_flag'].astype('category') # 将 travelling_network_rating 修改为分类变量 data['travelling_network_rating'] = data['travelling_network_rating'].astype('category') # 将 Adult_flag 修改为分类变量 data['Adult_flag'] = data['Adult_flag'].astype('category') # 再次检查数据类型修改后的结果 data.dtypesUserID object Taken_product category Yearly_avg_view_on_travel_page float64 preferred_device object total_likes_on_outstation_checkin_given float64 yearly_avg_Outstation_checkins int64 member_in_family int64 preferred_location_type object Yearly_avg_comment_on_travel_page float64 total_likes_on_outofstation_checkin_received int64 week_since_last_outstation_checkin int64 following_company_page category montly_avg_comment_on_company_page int64 working_flag category travelling_network_rating category Adult_flag category Daily_Avg_mins_spend_on_traveling_page float64 dtype: objectIn [12]:
# 预览一下处理好的数据 data.head(10)
UserID Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given yearly_avg_Outstation_checkins member_in_family preferred_location_type Yearly_avg_comment_on_travel_page total_likes_on_outofstation_checkin_received week_since_last_outstation_checkin following_company_page montly_avg_comment_on_company_page working_flag travelling_network_rating Adult_flag Daily_Avg_mins_spend_on_traveling_page 0 1000001 1 307.0 iOS and Android 38570.0 1 2 Financial 94.0 5993 8 1 11 0 1 0.0 8.0 1 1000002 0 367.0 iOS 9765.0 1 1 Financial 61.0 5130 1 0 23 1 4 1.0 10.0 2 1000003 1 277.0 iOS and Android 48055.0 1 2 Other 92.0 2090 6 1 15 0 2 0.0 7.0 3 1000004 0 247.0 iOS 48720.0 1 4 Financial 56.0 2909 1 1 11 0 3 0.0 8.0 4 1000005 0 202.0 iOS and Android 20685.0 1 1 Medical 40.0 3468 9 0 12 0 4 1.0 6.0 5 1000006 0 240.0 iOS 35175.0 1 2 Financial 79.0 3068 0 0 13 0 3 0.0 8.0 8 1000009 0 285.0 iOS 7560.0 23 3 Financial 44.0 9526 0 0 21 1 2 0.0 10.0 10 1000011 0 262.0 iOS and Android 28315.0 16 3 Medical 84.0 2426 0 0 13 0 3 1.0 6.0 12 1000013 0 232.0 iOS and Android 23450.0 26 1 Financial 31.0 2911 1 0 17 0 4 1.0 5.0 13 1000014 0 255.0 iOS and Android 47110.0 19 2 Medical 93.0 2661 0 0 11 0 3 1.0 3.0 5.用户购买行为分析
5.1用户基本特征与购买行为的关系
In [13]:
adult_flag_counts = pd.crosstab(data['Adult_flag'], data['Taken_product']) working_flag_counts = pd.crosstab(data['working_flag'], data['Taken_product']) member_in_family_counts = pd.crosstab(data['member_in_family'], data['Taken_product']) # 计算百分比 adult_flag_percent = adult_flag_counts.div(adult_flag_counts.sum(1), axis=0) * 100 working_flag_percent = working_flag_counts.div(working_flag_counts.sum(1), axis=0) * 100 member_in_family_percent = member_in_family_counts.div(member_in_family_counts.sum(1), axis=0) * 100 plt.figure(figsize=(20,8)) # Adult_flag plt.subplot(1, 3, 1) ax_adult = adult_flag_counts.plot(kind='bar', stacked=True, ax=plt.gca()) plt.title('Adult Flag and Ticket Purchase') plt.xlabel('Adult Flag') plt.ylabel('Number of Users') plt.xticks(rotation=0) for i, rect in enumerate(ax_adult.patches): if i >= len(adult_flag_counts): height = rect.get_height() if height > 0: percentage = adult_flag_percent.iloc[i % len(adult_flag_counts)][1] x = rect.get_x() + rect.get_width() / 2 y = rect.get_y() + height / 2 ax_adult.text(x, y, f'{percentage:.1f}%', ha='center', va='center', color='white') # Working_flag plt.subplot(1, 3, 2) ax_working = working_flag_counts.plot(kind='bar', stacked=True, ax=plt.gca()) plt.title('Working Flag and Ticket Purchase') plt.xlabel('Working Flag') plt.ylabel('Number of Users') plt.xticks(rotation=0) for i, rect in enumerate(ax_working.patches): if i >= len(working_flag_counts): height = rect.get_height() if height > 0: percentage = working_flag_percent.iloc[i % len(working_flag_counts)][1] x = rect.get_x() + rect.get_width() / 2 y = rect.get_y() + height / 2 ax_working.text(x, y, f'{percentage:.1f}%', ha='center', va='center', color='white') # Member_in_family plt.subplot(1, 3, 3) ax_family = member_in_family_counts.plot(kind='bar', stacked=True, ax=plt.gca()) plt.title('Family Members and Ticket Purchase') plt.xlabel('Number of Family Members') plt.ylabel('Number of Users') plt.xticks(rotation=0) for i, rect in enumerate(ax_family.patches): if i >= len(member_in_family_counts): height = rect.get_height() if height > 0: percentage = member_in_family_percent.iloc[i % len(member_in_family_counts)][1] x = rect.get_x() + rect.get_width() / 2 y = rect.get_y() + height / 2 ax_family.text(x, y, f'{percentage:.1f}%', ha='center', va='center', color='white') plt.tight_layout() plt.show()结论:
1.用户年龄状态为0时,购买率最高,这也断定了我对这个属性的看法,这个表示的应该就是年龄购买状态,而0可能代表的是年轻人,处于1、2年龄状态的购买率最低,可能这一类人处于中年状态。
2.有无工作的用户购买率是一样的,只是网站中未工作的用户更多。
3.家庭人数位于较低人数的时候,购买率比较高,但是随着人数到达5时,购买率出现了明显下降。5.2用户互动行为分析
In [14]:
plt.figure(figsize=(20, 8)) # 年均页面浏览次数 plt.subplot(1, 3, 1) sns.boxplot(x='Taken_product', y='Yearly_avg_view_on_travel_page', data=data) plt.title('Yearly Avg View on Travel Page by Purchase Status') plt.xlabel('Ticket Purchase') plt.ylabel('Yearly Avg View') # 年均页面评论次数 plt.subplot(1, 3, 2) sns.boxplot(x='Taken_product', y='Yearly_avg_comment_on_travel_page', data=data) plt.title('Yearly Avg Comment on Travel Page by Purchase Status') plt.xlabel('Ticket Purchase') plt.ylabel('Yearly Avg Comment') # 外站签到给予的总点赞数 plt.subplot(1, 3, 3) sns.boxplot(x='Taken_product', y='total_likes_on_outstation_checkin_given', data=data) plt.title('Total Likes on Outstation Checkin Given by Purchase Status') plt.xlabel('Ticket Purchase') plt.ylabel('Total Likes on Checkin') plt.tight_layout() plt.show()结论:
1.用户每年在旅行相关页面的平均浏览次数中,购买票的比未购买票的用户低。
2.用户每年在旅行相关页面的平均评论数中,二者差距不大,表明这个可能不是一个重要的影响因素。
3.用户在过去一年对外站签到给予的总点赞数中,未购票的略高一些。5.3用户旅游活动分析
In [15]:
plt.figure(figsize=(16,8)) # 平均每年的外站签到次数 plt.subplot(1, 2, 1) sns.boxplot(x='Taken_product', y='yearly_avg_Outstation_checkins', data=data) plt.title('Yearly Avg Outstation Checkins by Purchase Status') plt.xlabel('Ticket Purchase') plt.ylabel('Yearly Avg Outstation Checkins') # 最后一次外站签到更新以来的周数 plt.subplot(1, 2, 2) sns.boxplot(x='Taken_product', y='week_since_last_outstation_checkin', data=data) plt.title('Weeks Since Last Outstation Checkin by Purchase Status') plt.xlabel('Ticket Purchase') plt.ylabel('Weeks Since Last Outstation Checkin') plt.tight_layout() plt.show()结论:
1.用户平均每年的外站签到次数中,购票用户的签到次数高于未购票的用户。
2.用户最后一次外站签到更新以来的周数,购票的用户的周期更长。5.4用户偏好分析
In [16]:
# 过滤出购买票的用户 purchased_data = data[data['Taken_product'] == '1'] plt.figure(figsize=(20,20)) # 用户登录的首选设备 plt.subplot(2,1, 1) device_counts_purchased = purchased_data['preferred_device'].value_counts() ax_device = device_counts_purchased.plot(kind='bar') plt.title('Preferred Device for Users Who Purchased Tickets') plt.xlabel('Preferred Device') plt.ylabel('Number of Users') plt.xticks(rotation=0) # 添加数据标签 for i, count in enumerate(device_counts_purchased): ax_device.text(i, count, str(count), ha='center', va='bottom') # 用户旅行的首选地点类型 plt.subplot(2,1,2) location_counts_purchased = purchased_data['preferred_location_type'].value_counts() ax_location = location_counts_purchased.plot(kind='bar') plt.title('Preferred Location Type for Users Who Purchased Tickets') plt.xlabel('Preferred Location Type') plt.ylabel('Number of Users') plt.xticks(rotation=0) # 添加数据标签 for i, count in enumerate(location_counts_purchased): ax_location.text(i, count, str(count), ha='center', va='bottom') plt.tight_layout() plt.show()5.5统计方法探究影响因素
In [17]:
from scipy.stats import chi2_contingency, spearmanr # 定义卡方检验函数 def chi_square_test(data, column): crosstab = pd.crosstab(data[column], data['Taken_product']) chi2, p, dof, expected = chi2_contingency(crosstab) return chi2, p # 定义斯皮尔曼相关系数检验函数 def spearman_correlation_test(data, column): correlation, p = spearmanr(data[column], data['Taken_product'].cat.codes) return correlation, p # 分类变量进行卡方检验 categorical_features = ['preferred_device', 'preferred_location_type', 'following_company_page', 'working_flag','travelling_network_rating','Adult_flag'] chi_square_results = {feature: chi_square_test(data, feature) for feature in categorical_features} # 连续变量进行斯皮尔曼相关系数检验 continuous_features = ['Yearly_avg_view_on_travel_page', 'total_likes_on_outstation_checkin_given', 'yearly_avg_Outstation_checkins', 'member_in_family', 'Yearly_avg_comment_on_travel_page','total_likes_on_outofstation_checkin_received','week_since_last_outstation_checkin','montly_avg_comment_on_company_page','Daily_Avg_mins_spend_on_traveling_page'] spearman_results = {feature: spearman_correlation_test(data, feature) for feature in continuous_features} chi_square_df = pd.DataFrame.from_dict(chi_square_results,orient='index',columns=['Chi-Square','P-Value']) spearman_df = pd.DataFrame.from_dict(spearman_results,orient='index',columns=['Spearman Correlation','P-Value']) results_df = pd.concat([chi_square_df,spearman_df]) results_df/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:23: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'.
Chi-Square P-Value Spearman Correlation preferred_device 157.855971 1.684407e-31 NaN preferred_location_type 118.495746 1.239277e-18 NaN following_company_page 626.894350 2.367415e-138 NaN working_flag 0.000029 9.956765e-01 NaN travelling_network_rating 41.280573 5.701811e-09 NaN Adult_flag 429.348020 9.718849e-93 NaN Yearly_avg_view_on_travel_page NaN 1.004774e-67 -0.168867 total_likes_on_outstation_checkin_given NaN 7.511826e-08 -0.052574 yearly_avg_Outstation_checkins NaN 2.427538e-13 0.071538 member_in_family NaN 4.625198e-04 -0.034241 Yearly_avg_comment_on_travel_page NaN 3.772695e-01 -0.008636 total_likes_on_outofstation_checkin_received NaN 1.084557e-91 -0.196705 week_since_last_outstation_checkin NaN 2.979126e-05 0.040822 montly_avg_comment_on_company_page NaN 2.197510e-01 -0.012003 Daily_Avg_mins_spend_on_traveling_page NaN 3.701855e-72 -0.174373 卡方检验结果:
1.preferred_device(用户登录的首选设备):卡方值:157.86,p值<0.0001,可以认为用户登录的首选设备与购买机票之间具有显著关系。
2.preferred_location_type(用户旅行的首选地点类型):卡方值:118.50,p值<0.0001,认为用户旅行的首选地点类型与购买机票之间具有显著关系。
3.following_company_page(客户是否关注公司页面):卡方值:626.89,p值<0.0001,认为客户是否关注公司页面与购买机票之间具有显著关系。
4.working_flag(客户是否关注公司页面):卡方值近似于0,p值:0.996,认为客户是否在工作与购买机票之间没有显著关系。
5.travelling_network_rating(用户是否有喜欢旅行的密切朋友的评级):卡方值:41.28,p值<0.0001,认为用户是否有喜欢旅行的密切朋友的评级与购买机票之间具有显著关系。
6.Adult_flag(客户的年龄状态):卡方值:429.35,p值<0.0001,可以认为客户的年龄状态与购买机票之间具有显著关系。
斯皮尔曼相关系数检验结果:
1.Yearly_avg_view_on_travel_page(用户每年在旅行相关页面的平均浏览次数):相关系数:-0.169,p值<0.0001,年均页面浏览次数与购买行为呈负相关,且相关性显著。
2.total_likes_on_outstation_checkin_given(用户在过去一年对外站签到给予的总点赞数):相关系数:-0.053,p值<0.0001,给予的总点赞数与购买行为呈弱负相关,且相关性显著。
3.yearly_avg_Outstation_checkins(用户平均每年的外站签到次数):相关系数:0.072,p值<0.0001,用户平均每年的外站签到次数与购买行为呈弱正相关,且相关性显著。
4.member_in_family( 用户账户中提及的家庭成员总数):相关系数:-0.034,p值<0.0001,家庭成员总数与购买行为呈弱负相关,且相关性显著。
5.Yearly_avg_comment_on_travel_page(用户每年在旅行相关页面的平均评论数):相关系数:-0.009,p值约为0.377,平均评论数与购买行为相关性不显著。
6.total_likes_on_outofstation_checkin_received(用户在过去一年收到的外站签到总点赞数):相关系数:-0.198,p值<0.0001,收到的总点赞数与购买行为呈负相关,且相关性显著。
7.week_since_last_outstation_checkin(用户最后一次外站签到更新以来的周数):相关系数:0.041,p值<0.0001,最后一次外站签到更新以来的周数与购买行为呈正相关,且相关性显著。
8.montly_avg_comment_on_company_page(用户每月在公司页面的平均评论数):相关系数:-0.012,p值约为0.377,用户每月在公司页面的平均评论数与购买行为相关性不显著。
9.Daily_Avg_mins_spend_on_traveling_page(用户在公司旅行页面上的平均每日花费时间):相关系数:-0.174,p值<0.0001,用户在公司旅行页面上的平均每日花费时间与购买行为呈负相关,且相关性显著。6.聚类分析
6.1数据预处理
In [18]:
# 添加 preferred_device 到特征列表 features = [ 'Taken_product', 'Yearly_avg_view_on_travel_page', 'total_likes_on_outstation_checkin_given', 'yearly_avg_Outstation_checkins', 'member_in_family', 'Yearly_avg_comment_on_travel_page', 'total_likes_on_outofstation_checkin_received', 'week_since_last_outstation_checkin', 'following_company_page', 'montly_avg_comment_on_company_page', 'working_flag', 'travelling_network_rating', 'Adult_flag', 'Daily_Avg_mins_spend_on_traveling_page'] # 数值变量 numeric_features = list(set(features) - set(categorical_features)) # 创建预处理器 preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features)]) # 应用预处理器 X = preprocessor.fit_transform(data[features]) # 使用 KMeans 聚类算法 # 肘部法则确定最佳聚类数 inertia = [] K = range(1, 10) for k in K: kmeanModel = KMeans(n_clusters=k) kmeanModel.fit(X) inertia.append(kmeanModel.inertia_) # 绘制肘部图 plt.figure(figsize=(16, 8)) plt.plot(K, inertia, 'bx-') plt.xlabel('Number of Clusters') plt.ylabel('Inertia') plt.title('Determining the Number of Clusters Using the Elbow Method') plt.show()通过肘部图可以看出来从第四个聚类开始变得缓慢,这里选择聚类数为4。
6.2建立KMeans聚类模型
In [19]:
# 初始化 KMeans 算法 kmeans = KMeans(n_clusters=4, random_state=15) # 拟合模型 kmeans.fit(X) # 获取聚类标签 cluster_labels = kmeans.labels_ Kmean_data = data.copy() Kmean_data['Cluster'] = cluster_labels Kmean_data.head()
UserID Taken_product Yearly_avg_view_on_travel_page preferred_device total_likes_on_outstation_checkin_given yearly_avg_Outstation_checkins member_in_family preferred_location_type Yearly_avg_comment_on_travel_page total_likes_on_outofstation_checkin_received week_since_last_outstation_checkin following_company_page montly_avg_comment_on_company_page working_flag travelling_network_rating Adult_flag Daily_Avg_mins_spend_on_traveling_page Cluster 0 1000001 1 307.0 iOS and Android 38570.0 1 2 Financial 94.0 5993 8 1 11 0 1 0.0 8.0 2 1 1000002 0 367.0 iOS 9765.0 1 1 Financial 61.0 5130 1 0 23 1 4 1.0 10.0 0 2 1000003 1 277.0 iOS and Android 48055.0 1 2 Other 92.0 2090 6 1 15 0 2 0.0 7.0 2 3 1000004 0 247.0 iOS 48720.0 1 4 Financial 56.0 2909 1 1 11 0 3 0.0 8.0 0 4 1000005 0 202.0 iOS and Android 20685.0 1 1 Medical 40.0 3468 9 0 12 0 4 1.0 6.0 0 6.3分析不同类的用户特征
In [20]:
# 分析每个聚类内的用户特征分布 # 对于数值特征使用描述性统计 numerical_features = Kmean_data.select_dtypes(include=['float64', 'int64']).columns.tolist() # 计算每个聚类内部的描述性统计 cluster_descriptions = Kmean_data.groupby('Cluster')[numerical_features].describe() # 输出每个聚类的描述性统计数据 for cluster in cluster_descriptions.index: print(f"Cluster {cluster} statistics:\n") print(cluster_descriptions.loc[cluster])Cluster 0 statistics: Yearly_avg_view_on_travel_page count 6725.000000 mean 264.288773 std 53.176680 min 35.000000 25% 227.000000 50% 262.000000 75% 293.000000 max 462.000000 total_likes_on_outstation_checkin_given count 6725.000000 mean 28416.541413 std 14142.871913 min 3710.000000 25% 16807.000000 50% 28453.000000 75% 40927.000000 max 152430.000000 yearly_avg_Outstation_checkins count 6725.000000 mean 8.003271 std 8.320505 min 1.000000 25% 1.000000 50% 4.000000 75% 12.000000 max 29.000000 member_in_family count 6725.000000 mean 2.895019 std 1.048205 min 1.000000 25% 2.000000 50% 3.000000 ... total_likes_on_outofstation_checkin_received std 2395.897640 min 1051.000000 25% 2871.000000 50% 4449.000000 75% 6145.000000 max 17452.000000 week_since_last_outstation_checkin count 6725.000000 mean 2.802677 std 2.454801 min 0.000000 25% 1.000000 50% 2.000000 75% 4.000000 max 11.000000 montly_avg_comment_on_company_page count 6725.000000 mean 22.922825 std 6.984228 min 11.000000 25% 18.000000 50% 23.000000 75% 27.000000 max 48.000000 Daily_Avg_mins_spend_on_traveling_page count 6725.000000 mean 11.068550 std 5.205495 min 0.000000 25% 7.000000 50% 10.000000 75% 15.000000 max 28.000000 Name: 0, Length: 72, dtype: float64 Cluster 1 statistics: Yearly_avg_view_on_travel_page count 2001.000000 mean 364.580710 std 47.464425 min 223.000000 25% 329.000000 50% 367.000000 75% 403.000000 max 464.000000 total_likes_on_outstation_checkin_given count 2001.000000 mean 28573.460270 std 14618.311084 min 3570.000000 25% 16283.000000 50% 28785.000000 75% 41304.000000 max 152465.000000 yearly_avg_Outstation_checkins count 2001.000000 mean 8.064468 std 8.796041 min 1.000000 25% 1.000000 50% 3.000000 75% 13.000000 max 29.000000 member_in_family count 2001.000000 mean 3.109445 std 1.014651 min 1.000000 25% 3.000000 50% 3.000000 ... total_likes_on_outofstation_checkin_received std 4575.686433 min 2320.000000 25% 10539.000000 50% 13966.000000 75% 17924.000000 max 20065.000000 week_since_last_outstation_checkin count 2001.000000 mean 4.371314 std 2.655100 min 0.000000 25% 2.000000 50% 4.000000 75% 6.000000 max 11.000000 montly_avg_comment_on_company_page count 2001.000000 mean 23.740130 std 7.181325 min 11.000000 25% 18.000000 50% 23.000000 75% 29.000000 max 46.000000 Daily_Avg_mins_spend_on_traveling_page count 2001.000000 mean 26.392304 std 9.141035 min 9.000000 25% 21.000000 50% 26.000000 75% 31.000000 max 235.000000 Name: 1, Length: 72, dtype: float64 Cluster 2 statistics: Yearly_avg_view_on_travel_page count 1566.000000 mean 248.678799 std 66.522126 min 35.000000 25% 206.250000 50% 240.000000 75% 279.000000 max 446.000000 total_likes_on_outstation_checkin_given count 1566.000000 mean 26432.889527 std 13967.138942 min 3605.000000 25% 14352.500000 50% 24780.000000 75% 38171.250000 max 52414.000000 yearly_avg_Outstation_checkins count 1566.000000 mean 9.839080 std 8.985947 min 1.000000 25% 1.000000 50% 8.000000 75% 17.000000 max 29.000000 member_in_family count 1566.000000 mean 2.811622 std 1.015984 min 1.000000 25% 2.000000 50% 3.000000 ... total_likes_on_outofstation_checkin_received std 2678.781065 min 1009.000000 25% 2391.000000 50% 3017.500000 75% 5355.750000 max 13766.000000 week_since_last_outstation_checkin count 1566.000000 mean 3.434227 std 2.749699 min 0.000000 25% 1.000000 50% 3.000000 75% 5.000000 max 11.000000 montly_avg_comment_on_company_page count 1566.000000 mean 22.931673 std 7.066263 min 11.000000 25% 18.000000 50% 23.000000 75% 28.000000 max 46.000000 Daily_Avg_mins_spend_on_traveling_page count 1566.000000 mean 9.457854 std 5.826935 min 0.000000 25% 5.000000 50% 9.000000 75% 13.000000 max 29.000000 Name: 2, Length: 72, dtype: float64 Cluster 3 statistics: Yearly_avg_view_on_travel_page count 162.000000 mean 272.222222 std 69.541980 min 144.000000 25% 226.000000 50% 260.000000 75% 316.750000 max 436.000000 total_likes_on_outstation_checkin_given count 162.000000 mean 29008.716049 std 13407.054543 min 4241.000000 25% 17414.000000 50% 29051.500000 75% 40562.500000 max 52199.000000 yearly_avg_Outstation_checkins count 162.000000 mean 7.925926 std 10.049910 min 1.000000 25% 1.000000 50% 1.000000 75% 16.000000 max 29.000000 member_in_family count 162.000000 mean 2.944444 std 1.087764 min 1.000000 25% 2.000000 50% 3.000000 ... total_likes_on_outofstation_checkin_received std 4745.050630 min 1099.000000 25% 2907.500000 50% 4826.500000 75% 8646.000000 max 19894.000000 week_since_last_outstation_checkin count 162.000000 mean 3.580247 std 2.543359 min 0.000000 25% 2.000000 50% 3.000000 75% 5.000000 max 11.000000 montly_avg_comment_on_company_page count 162.000000 mean 399.753086 std 59.456134 min 300.000000 25% 346.250000 50% 403.500000 75% 454.750000 max 499.000000 Daily_Avg_mins_spend_on_traveling_page count 162.000000 mean 16.709877 std 8.747445 min 3.000000 25% 11.000000 50% 14.000000 75% 22.000000 max 45.000000 Name: 3, Length: 72, dtype: float64In [21]:
# 探索不同聚类之间的差异 # 对于数值特征,我们可以使用箱线图来可视化不同聚类的特征分布差异 def plot_feature_distribution(df, cols, hue): """绘制特征的分布对比图""" fig, ax = plt.subplots(len(cols), 1, figsize=(10, 5 * len(cols))) for i, col in enumerate(cols): sns.boxplot(x=hue, y=col, data=df, ax=ax[i] if len(cols) > 1 else ax) ax[i].set_title(f'Boxplot of {col} by {hue}', fontsize=14) plt.tight_layout() plt.show() # 绘制数值特征的分布对比图 plot_feature_distribution(Kmean_data, numerical_features, 'Cluster')In [22]:
# 每个聚类的购票用户比例分析 ticket_purchase_by_cluster = Kmean_data.groupby('Cluster')['Taken_product'].value_counts(normalize=True).unstack() # 可视化每个聚类的购票情况 ticket_purchase_by_cluster.plot(kind='bar', stacked=True, figsize=(18,8)) plt.title('Ticket Purchase Proportion by Cluster') plt.xlabel('Cluster') plt.ylabel('Proportion') plt.legend(title='Taken Product', bbox_to_anchor=(1.05, 1), loc='upper left') plt.xticks(rotation=0) plt.show()类0的用户(数量最多的聚类)平均查看旅行页面次数相对适中,平均外站签到点赞数较高,平均外站签到次数适中,家庭成员数通常在2到3人之间,在公司页面的评论数较低,每日平均在旅行页面花费的时间适中,而且不购票。
类1的用户(活跃用户聚类)平均查看旅行页面次数较高,平均外站签到点赞数较高,平均外站签到次数适中,家庭成员数略高,在公司页面的评论数适中,每日平均在旅行页面花费的时间较高,小概率会购票。
类2的用户(中等活跃用户聚类)平均查看旅行页面次数适中,平均外站签到点赞数适中,平均外站签到次数较高,家庭成员数通常在2到3人之间,在公司页面的评论数较低,每日平均在旅行页面花费的时间适中,但是都购买了票。
类3的用户(小型但高度活跃用户聚类)平均查看旅行页面次数适中,平均外站签到点赞数较高,平均外站签到次数适中,家庭成员3人左右,在公司页面的评论数极高,每日平均在旅行页面花费的时间较高。购票概率基于类1和类2之间。7.通过随机森林探究购票影响因素
7.1数据处理
In [23]:
# 选择特征(这里主要选择有一定影响关系的,p值小于0.0001的特征)和目标变量 features = ['Taken_product','preferred_device', 'preferred_location_type', 'following_company_page', 'travelling_network_rating', 'Adult_flag', 'Yearly_avg_view_on_travel_page', 'total_likes_on_outstation_checkin_given', 'yearly_avg_Outstation_checkins', 'member_in_family', 'total_likes_on_outofstation_checkin_received', 'week_since_last_outstation_checkin', 'Daily_Avg_mins_spend_on_traveling_page'] new_data = data[features]In [24]:
# 确定需要进行独热编码的分类变量 categorical_features = ['preferred_device', 'preferred_location_type'] # 重置行索引 new_data.reset_index(drop=True, inplace=True) # 创建独热编码器 encoder = OneHotEncoder(sparse=False, handle_unknown='ignore') # 应用独热编码 encoded_data = encoder.fit_transform(new_data[categorical_features]) # 将编码后的数据转换为DataFrame encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(categorical_features)) # 将编码后的数据与原始数据合并 new_data = new_data.drop(categorical_features, axis=1) new_data = pd.concat([new_data, encoded_df], axis=1)In [25]:
x = new_data.drop('Taken_product', axis=1) # 将目标变量转化为数值类型,不然使用category会影响模型 y = new_data['Taken_product'].cat.codes #采用分层抽样来保证训练集和测试集中Taken_product与整体数据集的Taken_product分布相似 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10, stratify=y) #37分In [26]:
#分离少数类和多数类 x_minority = x_train[y_train == 1] y_minority = y_train[y_train == 1] x_majority = x_train[y_train == 0] y_majority = y_train[y_train == 0] x_minority_resampled = resample(x_minority, replace=True, n_samples=len(x_majority), random_state=15) y_minority_resampled = resample(y_minority, replace=True, n_samples=len(y_majority), random_state=15) new_x_train = pd.concat([x_majority, x_minority_resampled]) new_y_train = pd.concat([y_majority, y_minority_resampled])7.2模型建立
In [27]:
rf_clf = RandomForestClassifier(random_state=15) rf_clf.fit(new_x_train, new_y_train)/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=15, verbose=0, warm_start=False)7.3模型评估
In [28]:
y_pred_rf = rf_clf.predict(x_test) class_report_rf = classification_report(y_test, y_pred_rf) print(class_report_rf)precision recall f1-score support 0 0.98 1.00 0.99 2629 1 0.98 0.92 0.95 508 accuracy 0.98 3137 macro avg 0.98 0.96 0.97 3137 weighted avg 0.98 0.98 0.98 3137In [29]:
#绘制混淆矩阵 cm_rf = confusion_matrix(y_test, y_pred_rf) plt.figure(figsize=(8,6)) sns.heatmap(cm_rf, annot=True, fmt='g', cmap='Blues', xticklabels=['Predicted 0', 'Predicted 1'], yticklabels=['Actual 0', 'Actual 1']) plt.title('Confusion Matrix for Random Forest Model') plt.show()In [30]:
#绘制ROU曲线 fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_clf.predict_proba(x_test)[:,1]) roc_auc_rf = auc(fpr_rf, tpr_rf) plt.figure(figsize=(8, 6)) plt.plot(fpr_rf, tpr_rf, color='darkorange', lw=2, label='ROC curve (area = %0.4f)' % roc_auc_rf) plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve for Random Forest') plt.legend(loc="lower right") plt.show()随机森林模型评分如下:
1.精确度: 对于类别0,精确度为0.98,对于类别1,精确度为0.98。
2.召回率: 对于类别0,召回率为1,对于类别1,召回率为0.92。
3.F1得分: 对于类别0,F1得分为0.99,对于类别1,F1得分为0.95。
4.准确率: 0.98
5.ROC: 0.9973
模型表现非常好,不需要进一步优化参数了。7.4重要特征分析
In [31]:
rf_feature_importance = rf_clf.feature_importances_ feature_names = new_x_train.columns rf_feature_df = pd.DataFrame({ 'Feature': feature_names, 'Importance': rf_feature_importance }) sorted_rf_feature_df = rf_feature_df.sort_values(by='Importance', ascending=False).head() #筛选出前五的重要特征 sorted_rf_feature_df
Feature Importance 7 total_likes_on_outofstation_checkin_received 0.174251 4 total_likes_on_outstation_checkin_given 0.122688 3 Yearly_avg_view_on_travel_page 0.115198 9 Daily_Avg_mins_spend_on_traveling_page 0.079616 5 yearly_avg_Outstation_checkins 0.077870 在随机森林模型中,重要程度最大的是:用户在过去一年收到的外站签到总点赞数>用户在过去一年对外站签到给予的总点赞数>用户每年在旅行相关页面的平均浏览次数>用户在公司旅行页面上的平均每日花费时间>用户平均每年的外站签到次数。
8.总结
本项目主要从用户购买行为分析、聚类分析、随机森林三个角度来探究用户情况,并且得出了影响用户购票的主要因素,以下是本项目得到的一些结论:
1.用户年龄状态为0时,购买率最高,而0可能代表的是年轻人,处于1、2年龄状态的购买率最低,可能这一类人处于中年状态;有无工作的用户购买率是一样的,这与后面的分析一致,网站中未工作的用户更多;家庭人数位于较低人数的时候,购买率比较高,但是随着人数到达5时,购买率出现了明显下降。
2.通过卡方检验得出来:用户登录的首选设备、用户旅行的首选地点类型、客户是否关注公司页面、用户是否有喜欢旅行的密切朋友的评级与购票情况有显著关系;通过斯皮尔曼相关分析得出了:用户每年在旅行相关页面的平均浏览次数、用户在过去一年对外站签到给予的总点赞数、用户平均每年的外站签到次数、用户账户中提及的家庭成员总数、用户在过去一年收到的外站签到总点赞数、用户最后一次外站签到更新以来的周数、用户在公司旅行页面上的平均每日花费时间与购票情况有一定的相关性,并且p值均小于0.0001。
3.通过聚类,得出了4类不同的用户,并且2类是主要购票用户,3类是潜在的购票用户。
4.通过建立随机森林模型,得到了一个非常不错的预测模型,并且探究了购票影响的重要因素(重要度前五):用户在过去一年收到的外站签到总点赞数>用户在过去一年对外站签到给予的总点赞数>用户每年在旅行相关页面的平均浏览次数>用户在公司旅行页面上的平均每日花费时间>用户平均每年的外站签到次数。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。