赞
踩
公司从招聘到培训一名员工,每个环节都需花费不少的资源,而一个员工的离职多多少少会给公司带来损失,为了了解员工离职的原因并预测潜在的离职对象,IBM 公布了他们真实的员工信息并提出以下问题陈述:
“预测员工的流失,即员工是否会减员,考虑到员工的详细信息,即导致员工流失的原因”
本文将利用 logistic regression
来探索这一问题。
import matplotlib.pyplot as plt
import pylab as pl
import pandas as pd
import seaborn as sns
import numpy as np
from IPython.core.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
'exec(%matplotlib inline)'
sns.set()
#loading the dataset using Pandas
data = pd.read_csv('/.../logistic_regression_data.csv',sep=",")
data.head()# Output shown below
在此只显示了部分信息
填充缺省值:
# Data preprocessing
data.fillna(0, inplace=True)
观察得到, Age
这一列数据跨度太大,因此我们需要对这个特征进行分组操作:
# function to create group of ages, this helps because we have 78 different values here
def Age(dataframe):
dataframe.loc[dataframe['Age'] <= 30, 'Age'] = 1
dataframe.loc[(dataframe['Age'] > 30) & (dataframe['Age'] <= 40), 'Age'] = 2
dataframe.loc[(dataframe['Age'
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。