当前位置:   article > 正文

基于随机森林算法实现泰坦尼克号的生存预测,python_train_x.loc

train_x.loc

Kaggle有这样一个经典的题目,根据船上的用户基本信息,判断剩下的人是否能生存下来。话不多说直接进入主题。

下载数据集

Kaggle

数据集下载

在这里插入图片描述
包含了源代码+训练集+ 测试集

整理数据

  • 这一部主要处理缺失的数据,
  • 将年龄等常数用平均值代替
  • 将登船口用众数代替
def select_data():
    selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare','Embarked']
    train_data = load_data()
    test_data = load_data("test")
    train_x = train_data[selected_features]
    train_y = train_data["Survived"]
    test_x = test_data[selected_features]
    
    train_x["Age"].fillna(train_x["Age"].mean(), inplace = True)
    
    train_x['Embarked'].fillna('S',inplace=True) #'S'出现次数最多,因此以'S'进行填充
    test_x["Age"].fillna(test_x["Age"].mean(), inplace = True)
    test_x["Fare"].fillna(test_x["Fare"].mean(), inplace = True)
    
    train_x = format_data(train_x)
    test_x = format_data(test_x)
    print(test_x.info())
    return train_x, train_y, test_x

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

数据数字化

  • 将性别,登机口用 数学的形式来表示,方便训练
def format_data(train_x):
    # 数据化性别
    train_x.loc[train_x['Sex'] == "male", "Sex"] = 0
    train_x.loc[train_x["Sex"] == "female", "Sex"] = 1
    
    train_x.loc[train_x['Embarked'] == "S", "Embarked"] = 0
    train_x.loc[train_x["Embarked"] == "C", "Embarked"] = 1
    train_x.loc[train_x['Embarked'] == "Q", "Embarked"] = 2
    return train_x
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

训练模型,并预测,写入到文件当中

def random_forest():
    test_data = load_data('test')
    x_train, y_train, x_test = select_data()
    model = RandomForestClassifier()
    paras = {'n_estimators': np.arange(10, 100, 10), 'criterion': ['gini', 'entropy'], 'max_depth': np.arange(5, 50, 5)}
    gs = GridSearchCV(model, paras, cv=5, verbose=1,n_jobs=-1)
    gs.fit(x_train, y_train)
    y_pre = gs.predict(x_test)
    print('best score:', gs.best_score_)
    print('best parameters:', gs.best_params_)
    print((test_data))
    result = ''
    with open('./result.csv', 'w', encoding="utf-8") as f:
        f.write("PassengerId,Survived" + "\n")
        for i in range(len(y_pre)): 
           
            result = str(test_data.iloc[i,0]) + "," +  str(y_pre[i])
            f.write(result + "\n")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/237112
推荐阅读
相关标签
  

闽ICP备14008679号