赞
踩
逻辑回归(LR,Logistic Regression)是一种传统机器学习分类模型,也是一种比较重要的非线性回归模型,其本质上是在线性回归模型的基础上,加了一个Sigmoid函数(也就是非线性映射),由于其简单、高效、易于并行计算的特点,在工业界受到了广泛的应用。
使用LR模型主要是用于分类任务,通常情况下也都是二分类任务,一般在推荐系统的业务中,会使用LR作为Baseline模型快速上线。
从本质上来讲,逻辑回归和线性回归一样同属于广义线性模型。虽然说逻辑回归可以实现回归预测,但是在推荐算法中,我们都将其看作是线性模型并把它应用在分类任务中。
总结:逻辑回归实际上就是在数据服从伯努利分布的假设下,通过极大似然的方法,运用梯度下降算法对参数进行求解,从而达到二分类。
Q&A
import pandas as pd import numpy as np import lightgbm as lgb from sklearn.linear_model import LogisticRegression from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score from sklearn.preprocessing import OneHotEncoder # 读取数据集 # 数据集地址https://grouplens.org/datasets/movielens/ ratings = pd.read_csv("../../data/ml-1m/ratings.dat", sep="::", header=None, names=["user_id", "movie_id", "rating", "timestamp"], encoding='ISO-8859-1', engine="python") movies = pd.read_csv("../../data/ml-1m/movies.dat", sep="::", header=None, names=["movie_id", "title", "genres"], encoding='ISO-8859-1', engine="python") # 将两个数据集根据movie_id合并,并去掉timestamp,title data = pd.merge(ratings, movies, on="movie_id").drop(columns=["timestamp", "title"]) # print(data.head(50)) # genres字段转换为多个二值型变量(使用pandas的get_dummies函数) genres_df = data.genres.str.get_dummies(sep="|") # genres_df = pd.get_dummies(data['genres']) # print(genres_df.head(50)) data = pd.concat([data, genres_df], axis=1).drop(columns=["genres"]) # print(data.head(50)) # 提取出用于训练 GBDT 模型和 LR 模型的特征和标签 features = data.drop(columns=["user_id", "movie_id", "rating"]) # print(features.head(50)) label = data['rating'] # 划分训练集和测试集(8,2开) split_index = int(len(data) * 0.8) train_x, train_y = features[:split_index], label[:split_index] test_x, test_y = features[split_index:], label[split_index:] # 训练GBDT模型 n_estimators=100 gbdt_model = lgb.LGBMRegressor(n_estimators=n_estimators, max_depth=5, learning_rate=0.1) gbdt_model.fit(train_x, train_y) gbdt_train_leaves = gbdt_model.predict(train_x, pred_leaf=True) gbdt_test_leaves = gbdt_model.predict(test_x, pred_leaf=True) # 将GBDT输出的叶子节点ID转换为one-hot编码的特征 one_hot = OneHotEncoder() one_hot_train = one_hot.fit_transform(gbdt_train_leaves).toarray() one_hot_test = one_hot.fit_transform(gbdt_test_leaves).toarray() # 训练LR模型 lr_model = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False) lr_model.fit(one_hot_train, train_y) # 在测试集上评估模型性能 y_pred = lr_model.predict(one_hot_test) print(f"Accuracy: {accuracy_score(test_y, y_pred)}") print(f"Precision: {precision_score(test_y, y_pred, average='macro')}") print(f"Recall: {recall_score(test_y, y_pred, average='macro')}") print(f"F1-Score (macro): {f1_score(test_y, y_pred, average='macro')}")
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。