赞
踩
训练集 验证集 测试集
默认将数据集的75%作为训练集,数据集的25%作为测试集。
把一个大的数据集分为k个小数据集,k等于数据集中数据的个数,每次只使用一个作为测试集,剩下的全部作为训练集,这种方法得出的结果与训练整个测试集的期望值最为接近,但是成本过于庞大,适合小样本数据集。
将数据集分成k个子集,每次选k-1个子集作为训练集,剩下的那个子集作为测试集。一共进行k次,将k次的平均交叉验证正确率作为结果。train_test_split,默认训练集、测试集比例为3:1。如果是5折交叉验证,训练集比测试集为4:1;10折交叉验证训练集比测试集为9:1。数据量越大,模型准确率越高。
from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from sklearn import metrics data = load_iris() #获取鸢尾花数据集 X = data.data y = data.target kf = KFold(n_splits=5, random_state=None) # 5折交叉验证 i = 1 for train_index, test_index in kf.split(X, y): print('\n{} of kfold {}'.format(i,kf.n_splits)) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model = LogisticRegression(random_state=1) model.fit(X_train, y_train) pred_test = model.predict(X_test) score = metrics.accuracy_score(y_test, pred_test) print('accuracy_score', score) i += 1 pred = model.predict_proba(X_test)[:, 1]
分层是重新将数据排列组合,使得每一折都能比较好地代表整体。
在一个二分类问题上,原始数据一共有两类(F和M),F:M的数据量比例大概是 1:3;划分了5折,每一折中F和M的比例都保持和原数据一致(1:3)。
from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris data = load_iris() X = data.data y = data.target skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0) for train_index, test_index in skf.split(X,y): print("Train:", train_index, "Validation:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model = LogisticRegression() scores = cross_val_score(model,X,y,cv=skf) print("straitified cross validation scores:{}".format(scores)) print("Mean score of straitified cross validation:{:.2f}".format(scores.mean()))
其实就是重复n次k-fold,每次重复有不同的随机性。
from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from sklearn import metrics data = load_iris() X = data.data y = data.target kf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=None) for train_index, test_index in kf.split(X): print("Train:", train_index, "Validation:",test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] i = 1 for train_index, test_index in kf.split(X, y): print('\n{} of kfold {}'.format(i,i)) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model = LogisticRegression(random_state=1) model.fit(X_train, y_train) pred_test = model.predict(X_test) score = metrics.accuracy_score(y_test, pred_test) print('accuracy_score', score) i += 1 #pred_test = model.predict(X_test) pred = model.predict_proba(X_test)[:, 1]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。