当前位置:   article > 正文

关于旅游网站用户行为数据集的探索_旅游数据集

旅游数据集

1.项目背景

本数据集提供了某旅游网站上客户行为的各种信息。通过分析这些数据能够更好的理解用户的旅游习惯、偏好以及与旅游内容的互动方式非常重要,对于旅游网站在市场营销、用户体验优化以及新服务开发等方面具有重要的参考价值。通过分析这些数据,旅游公司可以更有效地满足客户需求,提升服务质量,同时增强用户的参与度和忠诚度。
本项目主要从用户购买行为分析、聚类分析、随机森林三个角度来探究用户情况,并且探究影响用户购票的主要因素。

2.数据说明

变量描述
UserID用户的唯一ID
Taken_product下个月购买机票(目标变量)
Yearly_avg_view_on_travel_page用户每年在旅行相关页面的平均浏览次数
preferred_device用户登录的首选设备
total_likes_on_outstation_checkin_given用户在过去一年对外站签到给予的总点赞数
yearly_avg_Outstation_checkins用户平均每年的外站签到次数
member_in_family用户账户中提及的家庭成员总数
preferred_location_type用户旅行的首选地点类型
Yearly_avg_comment_on_travel_page用户每年在旅行相关页面的平均评论数
total_likes_on_outofstation_checkin_received用户在过去一年收到的外站签到总点赞数
week_since_last_outstation_checkin用户最后一次外站签到更新以来的周数
following_company_page客户是否关注公司页面(是或否)
montly_avg_comment_on_company_page用户每月在公司页面的平均评论数
working_flag客户是否在工作
travelling_network_rating表明用户是否有喜欢旅行的密切朋友的评级。1是最高,4是最低
Adult_flag客户的年龄状态(因为取值为0-3,我猜测应该和成人状态有关,而不是判断是否为成人)
Daily_Avg_mins_spend_on_traveling_page用户在公司旅行页面上的平均每日花费时间

3.Python库导入及数据读取

In [1]:

# 导入需要的库
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,roc_curve, auc

In [2]:

# 读取数据
data = pd.read_csv("/home/mw/input/data5466/Customer behaviour Tourism.csv")

4.数据预览及数据处理

4.1数据预览

In [3]:

# 查看数据维度
data.shape
(11760, 17)

In [4]:

# 查看数据信息
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11760 entries, 0 to 11759
Data columns (total 17 columns):
UserID                                          11760 non-null int64
Taken_product                                   11760 non-null object
Yearly_avg_view_on_travel_page                  11179 non-null float64
preferred_device                                11707 non-null object
total_likes_on_outstation_checkin_given         11379 non-null float64
yearly_avg_Outstation_checkins                  11685 non-null object
member_in_family                                11760 non-null object
preferred_location_type                         11729 non-null object
Yearly_avg_comment_on_travel_page               11554 non-null float64
total_likes_on_outofstation_checkin_received    11760 non-null int64
week_since_last_outstation_checkin              11760 non-null int64
following_company_page                          11657 non-null object
montly_avg_comment_on_company_page              11760 non-null int64
working_flag                                    11760 non-null object
travelling_network_rating                       11760 non-null int64
Adult_flag                                      11759 non-null float64
Daily_Avg_mins_spend_on_traveling_page          11759 non-null float64
dtypes: float64(5), int64(5), object(7)
memory usage: 1.5+ MB

In [5]:

# 查看各列缺失值
data.isna().sum()
UserID                                            0
Taken_product                                     0
Yearly_avg_view_on_travel_page                  581
preferred_device                                 53
total_likes_on_outstation_checkin_given         381
yearly_avg_Outstation_checkins                   75
member_in_family                                  0
preferred_location_type                          31
Yearly_avg_comment_on_travel_page               206
total_likes_on_outofstation_checkin_received      0
week_since_last_outstation_checkin                0
following_company_page                          103
montly_avg_comment_on_company_page                0
working_flag                                      0
travelling_network_rating                         0
Adult_flag                                        1
Daily_Avg_mins_spend_on_traveling_page            1
dtype: int64

In [6]:

# 查看重复值
data.duplicated().sum()
0

4.2数据处理

In [7]:

# 删除缺失值
data.dropna(inplace=True)

In [8]:

# 再次查看缺失值情况
data.isna().sum()
UserID                                          0
Taken_product                                   0
Yearly_avg_view_on_travel_page                  0
preferred_device                                0
total_likes_on_outstation_checkin_given         0
yearly_avg_Outstation_checkins                  0
member_in_family                                0
preferred_location_type                         0
Yearly_avg_comment_on_travel_page               0
total_likes_on_outofstation_checkin_received    0
week_since_last_outstation_checkin              0
following_company_page                          0
montly_avg_comment_on_company_page              0
working_flag                                    0
travelling_network_rating                       0
Adult_flag                                      0
Daily_Avg_mins_spend_on_traveling_page          0
dtype: int64

In [9]:

# 查看指定特征的唯一值(因为数据比较杂乱)
characteristic = ['Taken_product','preferred_device','yearly_avg_Outstation_checkins','member_in_family','preferred_location_type','following_company_page','working_flag']
for i in characteristic:
    print(f'{i}:')
    print(data[i].unique())
    print('-'*50)
Taken_product:
['Yes' 'No']
--------------------------------------------------
preferred_device:
['iOS and Android' 'iOS' 'ANDROID' 'Android' 'Android OS' 'Other' 'Others'
 'Tab' 'Laptop' 'Mobile']
--------------------------------------------------
yearly_avg_Outstation_checkins:
['1' '23' '16' '26' '19' '24' '21' '11' '15' '10' '25' '12' '18' '29' '22'
 '20' '28' '14' '27' '13' '17' '*' '5' '8' '2' '3' '9' '7' '6' '4']
--------------------------------------------------
member_in_family:
['2' '1' '4' '3' 'Three' '5' '10']
--------------------------------------------------
preferred_location_type:
['Financial' 'Other' 'Medical' 'Game' 'Entertainment' 'Social media'
 'Tour and Travel' 'Movie' 'OTT' 'Tour  Travel' 'Beach' 'Historical site'
 'Big Cities' 'Trekking' 'Hill Stations']
--------------------------------------------------
following_company_page:
['Yes' 'No' '1' '0']
--------------------------------------------------
working_flag:
['No' 'Yes']
--------------------------------------------------

**可以看到:
1.用户登录的首选设备中存在Other和Others,到时候需要统一称为Others,还有ANDROID和Android、Android OS需要统一成Android,Tab是平板电脑,Mobile应该也是指移动手机,正常来讲也是需要处理的,但是可能是一些特殊的系统,这里不作处理了。
2.用户平均每年的外站签到次数中存在'*'号,这里直接删除这个异常符号,并且将数据格式改成int格式。
3.家庭成员中存在Three,直接把Three改成3,然后把数据格式改成int格式。
4.用户旅行的首选地点类型中Tour and Travel和Tour Travel是同样的,统一成Tour and Travel。
5.客户是否关注公司页面存在了Yes、No、1、0,这里我们直接把Yes替换成1,No替换成0。
6.把下个月购买机票和客户是否在工作中的Yes和No分别替换成1和0。**

In [10]:

# 1. 用户登录首选设备的处理
data['preferred_device'] = data['preferred_device'].replace({'Other': 'Others', 'ANDROID': 'Android', 'Android OS': 'Android'})

# 2. 用户平均每年的外站签到次数处理
data = data[data['yearly_avg_Outstation_checkins'] != '*']
data['yearly_avg_Outstation_checkins'] = data['yearly_avg_Outstation_checkins'].astype(int)

# 3. 家庭成员处理
data['member_in_family'] = data['member_in_family'].replace({'Three': '3'})
data['member_in_family'] = data['member_in_family'].astype(int)

# 4. 用户旅行的首选地点类型处理
data['preferred_location_type'] = data['preferred_location_type'].replace({'Tour Travel': 'Tour and Travel'})

# 5. 客户是否关注公司页面的处理
data['following_company_page'] = data['following_company_page'].replace({'Yes': '1', 'No': '0'})

# 6. 把下个月购买机票和客户是否在工作中的Yes和No分别替换成1和0
data['Taken_product'] = data['Taken_product'].replace({'Yes': '1', 'No': '0'})
data['working_flag'] = data['working_flag'].replace({'Yes': '1', 'No': '0'})

In [11]:

# 将 UserID 修改为字符串类型
data['UserID'] = data['UserID'].astype(str)
# 将 Taken_product 修改为分类类型
data['Taken_product'] = data['Taken_product'].astype('category')
# 将 following_company_page 修改为分类变量
data['following_company_page'] = data['following_company_page'].astype('category')
# 将 working_flag 修改为分类变量
data['working_flag'] = data['working_flag'].astype('category')
# 将 travelling_network_rating 修改为分类变量
data['travelling_network_rating'] = data['travelling_network_rating'].astype('category')
# 将 Adult_flag 修改为分类变量
data['Adult_flag'] = data['Adult_flag'].astype('category')
# 再次检查数据类型修改后的结果
data.dtypes
UserID                                            object
Taken_product                                   category
Yearly_avg_view_on_travel_page                   float64
preferred_device                                  object
total_likes_on_outstation_checkin_given          float64
yearly_avg_Outstation_checkins                     int64
member_in_family                                   int64
preferred_location_type                           object
Yearly_avg_comment_on_travel_page                float64
total_likes_on_outofstation_checkin_received       int64
week_since_last_outstation_checkin                 int64
following_company_page                          category
montly_avg_comment_on_company_page                 int64
working_flag                                    category
travelling_network_rating                       category
Adult_flag                                      category
Daily_Avg_mins_spend_on_traveling_page           float64
dtype: object

In [12]:

# 预览一下处理好的数据
data.head(10)
UserIDTaken_productYearly_avg_view_on_travel_pagepreferred_devicetotal_likes_on_outstation_checkin_givenyearly_avg_Outstation_checkinsmember_in_familypreferred_location_typeYearly_avg_comment_on_travel_pagetotal_likes_on_outofstation_checkin_receivedweek_since_last_outstation_checkinfollowing_company_pagemontly_avg_comment_on_company_pageworking_flagtravelling_network_ratingAdult_flagDaily_Avg_mins_spend_on_traveling_page
010000011307.0iOS and Android38570.012Financial94.059938111010.08.0
110000020367.0iOS9765.011Financial61.051301023141.010.0
210000031277.0iOS and Android48055.012Other92.020906115020.07.0
310000040247.0iOS48720.014Financial56.029091111030.08.0
410000050202.0iOS and Android20685.011Medical40.034689012041.06.0
510000060240.0iOS35175.012Financial79.030680013030.08.0
810000090285.0iOS7560.0233Financial44.095260021120.010.0
1010000110262.0iOS and Android28315.0163Medical84.024260013031.06.0
1210000130232.0iOS and Android23450.0261Financial31.029111017041.05.0
1310000140255.0iOS and Android47110.0192Medical93.026610011031.03.0

5.用户购买行为分析

5.1用户基本特征与购买行为的关系

In [13]:

adult_flag_counts = pd.crosstab(data['Adult_flag'], data['Taken_product'])
working_flag_counts = pd.crosstab(data['working_flag'], data['Taken_product'])
member_in_family_counts = pd.crosstab(data['member_in_family'], data['Taken_product'])

# 计算百分比
adult_flag_percent = adult_flag_counts.div(adult_flag_counts.sum(1), axis=0) * 100
working_flag_percent = working_flag_counts.div(working_flag_counts.sum(1), axis=0) * 100
member_in_family_percent = member_in_family_counts.div(member_in_family_counts.sum(1), axis=0) * 100

plt.figure(figsize=(20,8))

# Adult_flag
plt.subplot(1, 3, 1)
ax_adult = adult_flag_counts.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Adult Flag and Ticket Purchase')
plt.xlabel('Adult Flag')
plt.ylabel('Number of Users')
plt.xticks(rotation=0)
for i, rect in enumerate(ax_adult.patches):
    if i >= len(adult_flag_counts):
        height = rect.get_height()
        if height > 0:
            percentage = adult_flag_percent.iloc[i % len(adult_flag_counts)][1]
            x = rect.get_x() + rect.get_width() / 2
            y = rect.get_y() + height / 2
            ax_adult.text(x, y, f'{percentage:.1f}%', ha='center', va='center', color='white')

# Working_flag
plt.subplot(1, 3, 2)
ax_working = working_flag_counts.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Working Flag and Ticket Purchase')
plt.xlabel('Working Flag')
plt.ylabel('Number of Users')
plt.xticks(rotation=0)
for i, rect in enumerate(ax_working.patches):
    if i >= len(working_flag_counts):
        height = rect.get_height()
        if height > 0:
            percentage = working_flag_percent.iloc[i % len(working_flag_counts)][1]
            x = rect.get_x() + rect.get_width() / 2
            y = rect.get_y() + height / 2
            ax_working.text(x, y, f'{percentage:.1f}%', ha='center', va='center', color='white')

# Member_in_family
plt.subplot(1, 3, 3)
ax_family = member_in_family_counts.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Family Members and Ticket Purchase')
plt.xlabel('Number of Family Members')
plt.ylabel('Number of Users')
plt.xticks(rotation=0)
for i, rect in enumerate(ax_family.patches):
    if i >= len(member_in_family_counts):
        height = rect.get_height()
        if height > 0:
            percentage = member_in_family_percent.iloc[i % len(member_in_family_counts)][1]
            x = rect.get_x() + rect.get_width() / 2
            y = rect.get_y() + height / 2
            ax_family.text(x, y, f'{percentage:.1f}%', ha='center', va='center', color='white')

plt.tight_layout()
plt.show()

结论:
1.用户年龄状态为0时,购买率最高,这也断定了我对这个属性的看法,这个表示的应该就是年龄购买状态,而0可能代表的是年轻人,处于1、2年龄状态的购买率最低,可能这一类人处于中年状态。
2.有无工作的用户购买率是一样的,只是网站中未工作的用户更多。
3.家庭人数位于较低人数的时候,购买率比较高,但是随着人数到达5时,购买率出现了明显下降。

5.2用户互动行为分析

In [14]:

plt.figure(figsize=(20, 8))

# 年均页面浏览次数
plt.subplot(1, 3, 1)
sns.boxplot(x='Taken_product', y='Yearly_avg_view_on_travel_page', data=data)
plt.title('Yearly Avg View on Travel Page by Purchase Status')
plt.xlabel('Ticket Purchase')
plt.ylabel('Yearly Avg View')

# 年均页面评论次数
plt.subplot(1, 3, 2)
sns.boxplot(x='Taken_product', y='Yearly_avg_comment_on_travel_page', data=data)
plt.title('Yearly Avg Comment on Travel Page by Purchase Status')
plt.xlabel('Ticket Purchase')
plt.ylabel('Yearly Avg Comment')

# 外站签到给予的总点赞数
plt.subplot(1, 3, 3)
sns.boxplot(x='Taken_product', y='total_likes_on_outstation_checkin_given', data=data)
plt.title('Total Likes on Outstation Checkin Given by Purchase Status')
plt.xlabel('Ticket Purchase')
plt.ylabel('Total Likes on Checkin')

plt.tight_layout()
plt.show()

结论:
1.用户每年在旅行相关页面的平均浏览次数中,购买票的比未购买票的用户低。
2.用户每年在旅行相关页面的平均评论数中,二者差距不大,表明这个可能不是一个重要的影响因素。
3.用户在过去一年对外站签到给予的总点赞数中,未购票的略高一些。

5.3用户旅游活动分析

In [15]:

plt.figure(figsize=(16,8))

# 平均每年的外站签到次数
plt.subplot(1, 2, 1)
sns.boxplot(x='Taken_product', y='yearly_avg_Outstation_checkins', data=data)
plt.title('Yearly Avg Outstation Checkins by Purchase Status')
plt.xlabel('Ticket Purchase')
plt.ylabel('Yearly Avg Outstation Checkins')

# 最后一次外站签到更新以来的周数
plt.subplot(1, 2, 2)
sns.boxplot(x='Taken_product', y='week_since_last_outstation_checkin', data=data)
plt.title('Weeks Since Last Outstation Checkin by Purchase Status')
plt.xlabel('Ticket Purchase')
plt.ylabel('Weeks Since Last Outstation Checkin')

plt.tight_layout()
plt.show()

结论:
1.用户平均每年的外站签到次数中,购票用户的签到次数高于未购票的用户。
2.用户最后一次外站签到更新以来的周数,购票的用户的周期更长。

5.4用户偏好分析

In [16]:

# 过滤出购买票的用户
purchased_data = data[data['Taken_product'] == '1']

plt.figure(figsize=(20,20))

# 用户登录的首选设备
plt.subplot(2,1, 1)
device_counts_purchased = purchased_data['preferred_device'].value_counts()
ax_device = device_counts_purchased.plot(kind='bar')
plt.title('Preferred Device for Users Who Purchased Tickets')
plt.xlabel('Preferred Device')
plt.ylabel('Number of Users')
plt.xticks(rotation=0)
# 添加数据标签
for i, count in enumerate(device_counts_purchased):
    ax_device.text(i, count, str(count), ha='center', va='bottom')

# 用户旅行的首选地点类型
plt.subplot(2,1,2)
location_counts_purchased = purchased_data['preferred_location_type'].value_counts()
ax_location = location_counts_purchased.plot(kind='bar')
plt.title('Preferred Location Type for Users Who Purchased Tickets')
plt.xlabel('Preferred Location Type')
plt.ylabel('Number of Users')
plt.xticks(rotation=0)
# 添加数据标签
for i, count in enumerate(location_counts_purchased):
    ax_location.text(i, count, str(count), ha='center', va='bottom')

plt.tight_layout()
plt.show()

5.5统计方法探究影响因素

In [17]:

from scipy.stats import chi2_contingency, spearmanr
# 定义卡方检验函数
def chi_square_test(data, column):
    crosstab = pd.crosstab(data[column], data['Taken_product'])
    chi2, p, dof, expected = chi2_contingency(crosstab)
    return chi2, p

# 定义斯皮尔曼相关系数检验函数
def spearman_correlation_test(data, column):
    correlation, p = spearmanr(data[column], data['Taken_product'].cat.codes)
    return correlation, p

# 分类变量进行卡方检验
categorical_features = ['preferred_device', 'preferred_location_type', 'following_company_page', 'working_flag','travelling_network_rating','Adult_flag']
chi_square_results = {feature: chi_square_test(data, feature) for feature in categorical_features}

# 连续变量进行斯皮尔曼相关系数检验
continuous_features = ['Yearly_avg_view_on_travel_page', 'total_likes_on_outstation_checkin_given', 'yearly_avg_Outstation_checkins',  'member_in_family', 'Yearly_avg_comment_on_travel_page','total_likes_on_outofstation_checkin_received','week_since_last_outstation_checkin','montly_avg_comment_on_company_page','Daily_Avg_mins_spend_on_traveling_page']
spearman_results = {feature: spearman_correlation_test(data, feature) for feature in continuous_features}

chi_square_df = pd.DataFrame.from_dict(chi_square_results,orient='index',columns=['Chi-Square','P-Value'])
spearman_df = pd.DataFrame.from_dict(spearman_results,orient='index',columns=['Spearman Correlation','P-Value'])
results_df = pd.concat([chi_square_df,spearman_df])
results_df
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:23: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

Chi-SquareP-ValueSpearman Correlation
preferred_device157.8559711.684407e-31NaN
preferred_location_type118.4957461.239277e-18NaN
following_company_page626.8943502.367415e-138NaN
working_flag0.0000299.956765e-01NaN
travelling_network_rating41.2805735.701811e-09NaN
Adult_flag429.3480209.718849e-93NaN
Yearly_avg_view_on_travel_pageNaN1.004774e-67-0.168867
total_likes_on_outstation_checkin_givenNaN7.511826e-08-0.052574
yearly_avg_Outstation_checkinsNaN2.427538e-130.071538
member_in_familyNaN4.625198e-04-0.034241
Yearly_avg_comment_on_travel_pageNaN3.772695e-01-0.008636
total_likes_on_outofstation_checkin_receivedNaN1.084557e-91-0.196705
week_since_last_outstation_checkinNaN2.979126e-050.040822
montly_avg_comment_on_company_pageNaN2.197510e-01-0.012003
Daily_Avg_mins_spend_on_traveling_pageNaN3.701855e-72-0.174373

卡方检验结果:
1.preferred_device(用户登录的首选设备):卡方值:157.86,p值<0.0001,可以认为用户登录的首选设备与购买机票之间具有显著关系。
2.preferred_location_type(用户旅行的首选地点类型):卡方值:118.50,p值<0.0001,认为用户旅行的首选地点类型与购买机票之间具有显著关系。
3.following_company_page(客户是否关注公司页面):卡方值:626.89,p值<0.0001,认为客户是否关注公司页面与购买机票之间具有显著关系。
4.working_flag(客户是否关注公司页面):卡方值近似于0,p值:0.996,认为客户是否在工作与购买机票之间没有显著关系。
5.travelling_network_rating(用户是否有喜欢旅行的密切朋友的评级):卡方值:41.28,p值<0.0001,认为用户是否有喜欢旅行的密切朋友的评级与购买机票之间具有显著关系。
6.Adult_flag(客户的年龄状态):卡方值:429.35,p值<0.0001,可以认为客户的年龄状态与购买机票之间具有显著关系。

斯皮尔曼相关系数检验结果:
1.Yearly_avg_view_on_travel_page(用户每年在旅行相关页面的平均浏览次数):相关系数:-0.169,p值<0.0001,年均页面浏览次数与购买行为呈负相关,且相关性显著。
2.total_likes_on_outstation_checkin_given(用户在过去一年对外站签到给予的总点赞数):相关系数:-0.053,p值<0.0001,给予的总点赞数与购买行为呈弱负相关,且相关性显著。
3.yearly_avg_Outstation_checkins(用户平均每年的外站签到次数):相关系数:0.072,p值<0.0001,用户平均每年的外站签到次数与购买行为呈弱正相关,且相关性显著。
4.member_in_family( 用户账户中提及的家庭成员总数):相关系数:-0.034,p值<0.0001,家庭成员总数与购买行为呈弱负相关,且相关性显著。
5.Yearly_avg_comment_on_travel_page(用户每年在旅行相关页面的平均评论数):相关系数:-0.009,p值约为0.377,平均评论数与购买行为相关性不显著。
6.total_likes_on_outofstation_checkin_received(用户在过去一年收到的外站签到总点赞数):相关系数:-0.198,p值<0.0001,收到的总点赞数与购买行为呈负相关,且相关性显著。
7.week_since_last_outstation_checkin(用户最后一次外站签到更新以来的周数):相关系数:0.041,p值<0.0001,最后一次外站签到更新以来的周数与购买行为呈正相关,且相关性显著。
8.montly_avg_comment_on_company_page(用户每月在公司页面的平均评论数):相关系数:-0.012,p值约为0.377,用户每月在公司页面的平均评论数与购买行为相关性不显著。
9.Daily_Avg_mins_spend_on_traveling_page(用户在公司旅行页面上的平均每日花费时间):相关系数:-0.174,p值<0.0001,用户在公司旅行页面上的平均每日花费时间与购买行为呈负相关,且相关性显著。

6.聚类分析

6.1数据预处理

In [18]:

# 添加 preferred_device 到特征列表
features = [
    'Taken_product', 'Yearly_avg_view_on_travel_page', 'total_likes_on_outstation_checkin_given', 
    'yearly_avg_Outstation_checkins', 'member_in_family', 'Yearly_avg_comment_on_travel_page', 'total_likes_on_outofstation_checkin_received', 
    'week_since_last_outstation_checkin', 'following_company_page', 'montly_avg_comment_on_company_page', 'working_flag', 'travelling_network_rating', 
    'Adult_flag', 'Daily_Avg_mins_spend_on_traveling_page']


# 数值变量
numeric_features = list(set(features) - set(categorical_features))

# 创建预处理器
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features)])

# 应用预处理器
X = preprocessor.fit_transform(data[features])

# 使用 KMeans 聚类算法
# 肘部法则确定最佳聚类数
inertia = []
K = range(1, 10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(X)
    inertia.append(kmeanModel.inertia_)

# 绘制肘部图
plt.figure(figsize=(16, 8))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Determining the Number of Clusters Using the Elbow Method')
plt.show()

通过肘部图可以看出来从第四个聚类开始变得缓慢,这里选择聚类数为4。

6.2建立KMeans聚类模型

In [19]:

# 初始化 KMeans 算法
kmeans = KMeans(n_clusters=4, random_state=15)

# 拟合模型
kmeans.fit(X)

# 获取聚类标签
cluster_labels = kmeans.labels_

Kmean_data = data.copy()
Kmean_data['Cluster'] = cluster_labels

Kmean_data.head()
UserIDTaken_productYearly_avg_view_on_travel_pagepreferred_devicetotal_likes_on_outstation_checkin_givenyearly_avg_Outstation_checkinsmember_in_familypreferred_location_typeYearly_avg_comment_on_travel_pagetotal_likes_on_outofstation_checkin_receivedweek_since_last_outstation_checkinfollowing_company_pagemontly_avg_comment_on_company_pageworking_flagtravelling_network_ratingAdult_flagDaily_Avg_mins_spend_on_traveling_pageCluster
010000011307.0iOS and Android38570.012Financial94.059938111010.08.02
110000020367.0iOS9765.011Financial61.051301023141.010.00
210000031277.0iOS and Android48055.012Other92.020906115020.07.02
310000040247.0iOS48720.014Financial56.029091111030.08.00
410000050202.0iOS and Android20685.011Medical40.034689012041.06.00

6.3分析不同类的用户特征

In [20]:

# 分析每个聚类内的用户特征分布
# 对于数值特征使用描述性统计
numerical_features = Kmean_data.select_dtypes(include=['float64', 'int64']).columns.tolist()

# 计算每个聚类内部的描述性统计
cluster_descriptions = Kmean_data.groupby('Cluster')[numerical_features].describe()

# 输出每个聚类的描述性统计数据
for cluster in cluster_descriptions.index:
    print(f"Cluster {cluster} statistics:\n")
    print(cluster_descriptions.loc[cluster])
Cluster 0 statistics:

Yearly_avg_view_on_travel_page                count      6725.000000
                                              mean        264.288773
                                              std          53.176680
                                              min          35.000000
                                              25%         227.000000
                                              50%         262.000000
                                              75%         293.000000
                                              max         462.000000
total_likes_on_outstation_checkin_given       count      6725.000000
                                              mean      28416.541413
                                              std       14142.871913
                                              min        3710.000000
                                              25%       16807.000000
                                              50%       28453.000000
                                              75%       40927.000000
                                              max      152430.000000
yearly_avg_Outstation_checkins                count      6725.000000
                                              mean          8.003271
                                              std           8.320505
                                              min           1.000000
                                              25%           1.000000
                                              50%           4.000000
                                              75%          12.000000
                                              max          29.000000
member_in_family                              count      6725.000000
                                              mean          2.895019
                                              std           1.048205
                                              min           1.000000
                                              25%           2.000000
                                              50%           3.000000
                                                           ...      
total_likes_on_outofstation_checkin_received  std        2395.897640
                                              min        1051.000000
                                              25%        2871.000000
                                              50%        4449.000000
                                              75%        6145.000000
                                              max       17452.000000
week_since_last_outstation_checkin            count      6725.000000
                                              mean          2.802677
                                              std           2.454801
                                              min           0.000000
                                              25%           1.000000
                                              50%           2.000000
                                              75%           4.000000
                                              max          11.000000
montly_avg_comment_on_company_page            count      6725.000000
                                              mean         22.922825
                                              std           6.984228
                                              min          11.000000
                                              25%          18.000000
                                              50%          23.000000
                                              75%          27.000000
                                              max          48.000000
Daily_Avg_mins_spend_on_traveling_page        count      6725.000000
                                              mean         11.068550
                                              std           5.205495
                                              min           0.000000
                                              25%           7.000000
                                              50%          10.000000
                                              75%          15.000000
                                              max          28.000000
Name: 0, Length: 72, dtype: float64
Cluster 1 statistics:

Yearly_avg_view_on_travel_page                count      2001.000000
                                              mean        364.580710
                                              std          47.464425
                                              min         223.000000
                                              25%         329.000000
                                              50%         367.000000
                                              75%         403.000000
                                              max         464.000000
total_likes_on_outstation_checkin_given       count      2001.000000
                                              mean      28573.460270
                                              std       14618.311084
                                              min        3570.000000
                                              25%       16283.000000
                                              50%       28785.000000
                                              75%       41304.000000
                                              max      152465.000000
yearly_avg_Outstation_checkins                count      2001.000000
                                              mean          8.064468
                                              std           8.796041
                                              min           1.000000
                                              25%           1.000000
                                              50%           3.000000
                                              75%          13.000000
                                              max          29.000000
member_in_family                              count      2001.000000
                                              mean          3.109445
                                              std           1.014651
                                              min           1.000000
                                              25%           3.000000
                                              50%           3.000000
                                                           ...      
total_likes_on_outofstation_checkin_received  std        4575.686433
                                              min        2320.000000
                                              25%       10539.000000
                                              50%       13966.000000
                                              75%       17924.000000
                                              max       20065.000000
week_since_last_outstation_checkin            count      2001.000000
                                              mean          4.371314
                                              std           2.655100
                                              min           0.000000
                                              25%           2.000000
                                              50%           4.000000
                                              75%           6.000000
                                              max          11.000000
montly_avg_comment_on_company_page            count      2001.000000
                                              mean         23.740130
                                              std           7.181325
                                              min          11.000000
                                              25%          18.000000
                                              50%          23.000000
                                              75%          29.000000
                                              max          46.000000
Daily_Avg_mins_spend_on_traveling_page        count      2001.000000
                                              mean         26.392304
                                              std           9.141035
                                              min           9.000000
                                              25%          21.000000
                                              50%          26.000000
                                              75%          31.000000
                                              max         235.000000
Name: 1, Length: 72, dtype: float64
Cluster 2 statistics:

Yearly_avg_view_on_travel_page                count     1566.000000
                                              mean       248.678799
                                              std         66.522126
                                              min         35.000000
                                              25%        206.250000
                                              50%        240.000000
                                              75%        279.000000
                                              max        446.000000
total_likes_on_outstation_checkin_given       count     1566.000000
                                              mean     26432.889527
                                              std      13967.138942
                                              min       3605.000000
                                              25%      14352.500000
                                              50%      24780.000000
                                              75%      38171.250000
                                              max      52414.000000
yearly_avg_Outstation_checkins                count     1566.000000
                                              mean         9.839080
                                              std          8.985947
                                              min          1.000000
                                              25%          1.000000
                                              50%          8.000000
                                              75%         17.000000
                                              max         29.000000
member_in_family                              count     1566.000000
                                              mean         2.811622
                                              std          1.015984
                                              min          1.000000
                                              25%          2.000000
                                              50%          3.000000
                                                           ...     
total_likes_on_outofstation_checkin_received  std       2678.781065
                                              min       1009.000000
                                              25%       2391.000000
                                              50%       3017.500000
                                              75%       5355.750000
                                              max      13766.000000
week_since_last_outstation_checkin            count     1566.000000
                                              mean         3.434227
                                              std          2.749699
                                              min          0.000000
                                              25%          1.000000
                                              50%          3.000000
                                              75%          5.000000
                                              max         11.000000
montly_avg_comment_on_company_page            count     1566.000000
                                              mean        22.931673
                                              std          7.066263
                                              min         11.000000
                                              25%         18.000000
                                              50%         23.000000
                                              75%         28.000000
                                              max         46.000000
Daily_Avg_mins_spend_on_traveling_page        count     1566.000000
                                              mean         9.457854
                                              std          5.826935
                                              min          0.000000
                                              25%          5.000000
                                              50%          9.000000
                                              75%         13.000000
                                              max         29.000000
Name: 2, Length: 72, dtype: float64
Cluster 3 statistics:

Yearly_avg_view_on_travel_page                count      162.000000
                                              mean       272.222222
                                              std         69.541980
                                              min        144.000000
                                              25%        226.000000
                                              50%        260.000000
                                              75%        316.750000
                                              max        436.000000
total_likes_on_outstation_checkin_given       count      162.000000
                                              mean     29008.716049
                                              std      13407.054543
                                              min       4241.000000
                                              25%      17414.000000
                                              50%      29051.500000
                                              75%      40562.500000
                                              max      52199.000000
yearly_avg_Outstation_checkins                count      162.000000
                                              mean         7.925926
                                              std         10.049910
                                              min          1.000000
                                              25%          1.000000
                                              50%          1.000000
                                              75%         16.000000
                                              max         29.000000
member_in_family                              count      162.000000
                                              mean         2.944444
                                              std          1.087764
                                              min          1.000000
                                              25%          2.000000
                                              50%          3.000000
                                                           ...     
total_likes_on_outofstation_checkin_received  std       4745.050630
                                              min       1099.000000
                                              25%       2907.500000
                                              50%       4826.500000
                                              75%       8646.000000
                                              max      19894.000000
week_since_last_outstation_checkin            count      162.000000
                                              mean         3.580247
                                              std          2.543359
                                              min          0.000000
                                              25%          2.000000
                                              50%          3.000000
                                              75%          5.000000
                                              max         11.000000
montly_avg_comment_on_company_page            count      162.000000
                                              mean       399.753086
                                              std         59.456134
                                              min        300.000000
                                              25%        346.250000
                                              50%        403.500000
                                              75%        454.750000
                                              max        499.000000
Daily_Avg_mins_spend_on_traveling_page        count      162.000000
                                              mean        16.709877
                                              std          8.747445
                                              min          3.000000
                                              25%         11.000000
                                              50%         14.000000
                                              75%         22.000000
                                              max         45.000000
Name: 3, Length: 72, dtype: float64

In [21]:

# 探索不同聚类之间的差异
# 对于数值特征,我们可以使用箱线图来可视化不同聚类的特征分布差异
def plot_feature_distribution(df, cols, hue):
    """绘制特征的分布对比图"""
    fig, ax = plt.subplots(len(cols), 1, figsize=(10, 5 * len(cols)))
    for i, col in enumerate(cols):
        sns.boxplot(x=hue, y=col, data=df, ax=ax[i] if len(cols) > 1 else ax)
        ax[i].set_title(f'Boxplot of {col} by {hue}', fontsize=14)
    plt.tight_layout()
    plt.show()

# 绘制数值特征的分布对比图
plot_feature_distribution(Kmean_data, numerical_features, 'Cluster')

In [22]:

# 每个聚类的购票用户比例分析
ticket_purchase_by_cluster = Kmean_data.groupby('Cluster')['Taken_product'].value_counts(normalize=True).unstack()

# 可视化每个聚类的购票情况
ticket_purchase_by_cluster.plot(kind='bar', stacked=True, figsize=(18,8))
plt.title('Ticket Purchase Proportion by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Proportion')
plt.legend(title='Taken Product', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=0)
plt.show()

类0的用户(数量最多的聚类)平均查看旅行页面次数相对适中,平均外站签到点赞数较高,平均外站签到次数适中,家庭成员数通常在2到3人之间,在公司页面的评论数较低,每日平均在旅行页面花费的时间适中,而且不购票。
类1的用户(活跃用户聚类)平均查看旅行页面次数较高,平均外站签到点赞数较高,平均外站签到次数适中,家庭成员数略高,在公司页面的评论数适中,每日平均在旅行页面花费的时间较高,小概率会购票。
类2的用户(中等活跃用户聚类)平均查看旅行页面次数适中,平均外站签到点赞数适中,平均外站签到次数较高,家庭成员数通常在2到3人之间,在公司页面的评论数较低,每日平均在旅行页面花费的时间适中,但是都购买了票。
类3的用户(小型但高度活跃用户聚类)平均查看旅行页面次数适中,平均外站签到点赞数较高,平均外站签到次数适中,家庭成员3人左右,在公司页面的评论数极高,每日平均在旅行页面花费的时间较高。购票概率基于类1和类2之间。

7.通过随机森林探究购票影响因素

7.1数据处理

In [23]:

# 选择特征(这里主要选择有一定影响关系的,p值小于0.0001的特征)和目标变量
features = ['Taken_product','preferred_device', 'preferred_location_type', 'following_company_page', 'travelling_network_rating', 'Adult_flag',
            'Yearly_avg_view_on_travel_page', 'total_likes_on_outstation_checkin_given', 'yearly_avg_Outstation_checkins', 'member_in_family',
            'total_likes_on_outofstation_checkin_received', 'week_since_last_outstation_checkin', 'Daily_Avg_mins_spend_on_traveling_page']
new_data = data[features]

In [24]:

# 确定需要进行独热编码的分类变量
categorical_features = ['preferred_device', 'preferred_location_type']
# 重置行索引
new_data.reset_index(drop=True, inplace=True)
# 创建独热编码器
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
# 应用独热编码
encoded_data = encoder.fit_transform(new_data[categorical_features])
# 将编码后的数据转换为DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(categorical_features))
# 将编码后的数据与原始数据合并
new_data = new_data.drop(categorical_features, axis=1)
new_data = pd.concat([new_data, encoded_df], axis=1)

In [25]:

x = new_data.drop('Taken_product', axis=1)
# 将目标变量转化为数值类型,不然使用category会影响模型
y = new_data['Taken_product'].cat.codes
#采用分层抽样来保证训练集和测试集中Taken_product与整体数据集的Taken_product分布相似
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10, stratify=y) #37分

In [26]:

#分离少数类和多数类
x_minority = x_train[y_train == 1]
y_minority = y_train[y_train == 1]
x_majority = x_train[y_train == 0]
y_majority = y_train[y_train == 0]
x_minority_resampled = resample(x_minority, replace=True, n_samples=len(x_majority), random_state=15)
y_minority_resampled = resample(y_minority, replace=True, n_samples=len(y_majority), random_state=15)
new_x_train = pd.concat([x_majority, x_minority_resampled])
new_y_train = pd.concat([y_majority, y_minority_resampled])

7.2模型建立

In [27]:

rf_clf = RandomForestClassifier(random_state=15)
rf_clf.fit(new_x_train, new_y_train)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=15, verbose=0,
                       warm_start=False)

7.3模型评估

In [28]:

y_pred_rf = rf_clf.predict(x_test)
class_report_rf = classification_report(y_test, y_pred_rf)
print(class_report_rf)
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      2629
           1       0.98      0.92      0.95       508

    accuracy                           0.98      3137
   macro avg       0.98      0.96      0.97      3137
weighted avg       0.98      0.98      0.98      3137

In [29]:

#绘制混淆矩阵
cm_rf = confusion_matrix(y_test, y_pred_rf)

plt.figure(figsize=(8,6))
sns.heatmap(cm_rf, annot=True, fmt='g', cmap='Blues', 
            xticklabels=['Predicted 0', 'Predicted 1'], 
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix for Random Forest Model')
plt.show()

In [30]:

#绘制ROU曲线
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_clf.predict_proba(x_test)[:,1])
roc_auc_rf = auc(fpr_rf, tpr_rf)
plt.figure(figsize=(8, 6))
plt.plot(fpr_rf, tpr_rf, color='darkorange', lw=2, label='ROC curve (area = %0.4f)' % roc_auc_rf)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Random Forest')
plt.legend(loc="lower right")
plt.show()

随机森林模型评分如下:
1.精确度: 对于类别0,精确度为0.98,对于类别1,精确度为0.98。
2.召回率: 对于类别0,召回率为1,对于类别1,召回率为0.92。
3.F1得分: 对于类别0,F1得分为0.99,对于类别1,F1得分为0.95。
4.准确率: 0.98
5.ROC: 0.9973
模型表现非常好,不需要进一步优化参数了。

7.4重要特征分析

In [31]:

rf_feature_importance = rf_clf.feature_importances_
feature_names = new_x_train.columns
rf_feature_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': rf_feature_importance
})
sorted_rf_feature_df = rf_feature_df.sort_values(by='Importance', ascending=False).head() #筛选出前五的重要特征

sorted_rf_feature_df
FeatureImportance
7total_likes_on_outofstation_checkin_received0.174251
4total_likes_on_outstation_checkin_given0.122688
3Yearly_avg_view_on_travel_page0.115198
9Daily_Avg_mins_spend_on_traveling_page0.079616
5yearly_avg_Outstation_checkins0.077870

在随机森林模型中,重要程度最大的是:用户在过去一年收到的外站签到总点赞数>用户在过去一年对外站签到给予的总点赞数>用户每年在旅行相关页面的平均浏览次数>用户在公司旅行页面上的平均每日花费时间>用户平均每年的外站签到次数。

8.总结

本项目主要从用户购买行为分析、聚类分析、随机森林三个角度来探究用户情况,并且得出了影响用户购票的主要因素,以下是本项目得到的一些结论:
1.用户年龄状态为0时,购买率最高,而0可能代表的是年轻人,处于1、2年龄状态的购买率最低,可能这一类人处于中年状态;有无工作的用户购买率是一样的,这与后面的分析一致,网站中未工作的用户更多;家庭人数位于较低人数的时候,购买率比较高,但是随着人数到达5时,购买率出现了明显下降。
2.通过卡方检验得出来:用户登录的首选设备、用户旅行的首选地点类型、客户是否关注公司页面、用户是否有喜欢旅行的密切朋友的评级与购票情况有显著关系;通过斯皮尔曼相关分析得出了:用户每年在旅行相关页面的平均浏览次数、用户在过去一年对外站签到给予的总点赞数、用户平均每年的外站签到次数、用户账户中提及的家庭成员总数、用户在过去一年收到的外站签到总点赞数、用户最后一次外站签到更新以来的周数、用户在公司旅行页面上的平均每日花费时间与购票情况有一定的相关性,并且p值均小于0.0001。
3.通过聚类,得出了4类不同的用户,并且2类是主要购票用户,3类是潜在的购票用户。
4.通过建立随机森林模型,得到了一个非常不错的预测模型,并且探究了购票影响的重要因素(重要度前五):用户在过去一年收到的外站签到总点赞数>用户在过去一年对外站签到给予的总点赞数>用户每年在旅行相关页面的平均浏览次数>用户在公司旅行页面上的平均每日花费时间>用户平均每年的外站签到次数。

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号