当前位置:   article > 正文

[机器学习] 低代码机器学习工具PyCaret库使用指北

[机器学习] 低代码机器学习工具PyCaret库使用指北

PyCaret是一个开源、低代码Python机器学习库,能够自动化机器学习工作流程。它是一个端到端的机器学习和模型管理工具,极大地加快了实验周期,提高了工作效率。PyCaret本质上是围绕几个机器学习库和框架(如scikit-learn、XGBoost、LightGBM、CatBoost、spaCy、Optuna、Hyperopt、Ray等)的Python包装器,与其他开源机器学习库相比,PyCaret可以用少量代码取代数百行代码。PyCaret开源仓库地址:pycaret,官方文档地址为:pycaret-docs

PyCaret基础版安装命令如下:

pip install pycaret

完整版安装代码如下:

pip install pycaret[full]

此外以下模型可以调用GPU

  • Extreme Gradient Boosting
  • Catboost
  • Light Gradient Boosting(需要安装lightgbm)
  • Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression(需要安装cuml0.15版本以上)
# 查看pycaret版本
import pycaret
pycaret.__version__
  • 1
  • 2
  • 3
'3.3.2'
  • 1

1 快速入门

PyCaret支持多种机器学习任务,包括分类、回归、聚类、异常检测和时序预测。本节主要介绍如何利用PyCaret构建相关任务模型的基础使用方法。关于更详细的PyCaret任务模型使用,请参考:

TopicNotebookLink
二分类BinaryClassificationlink
多分类MulticlassClassificationlink
回归Regressionlink
聚类Clusteringlink
异常检测AnomalyDetectionlink
时序预测TimeSeriesForecastinglink

1.1 分类

PyCaret的classification模块是一个可用于二分类或多分类的模块,用于将元素分类到不同的组中。一些常见的用例包括预测客户是否违约、预测客户是否流失、以及诊断疾病(阳性或阴性)。示例代码如下所示:

数据准备

加载糖尿病示例数据集:

from pycaret.datasets import get_data
# 从本地加载数据,注意dataset是数据的文件名
data = get_data(dataset='./datasets/diabetes', verbose=False)
# 从pycaret开源仓库下载公开数据
# data = get_data('diabetes', verbose=False)
  • 1
  • 2
  • 3
  • 4
  • 5
# 查看数据类型和数据维度
type(data), data.shape
  • 1
  • 2
(pandas.core.frame.DataFrame, (768, 9))
  • 1
# 最后一列表示是否为糖尿病患者,其他列为特征列
data.head()
  • 1
  • 2
Number of times pregnantPlasma glucose concentration a 2 hours in an oral glucose tolerance testDiastolic blood pressure (mm Hg)Triceps skin fold thickness (mm)2-Hour serum insulin (mu U/ml)Body mass index (weight in kg/(height in m)^2)Diabetes pedigree functionAge (years)Class variable
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331

利用PyCaret核心函数setup,初始化建模环境并准备数据以供模型训练和评估使用:

from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# target目标列,session_id设置随机数种子, preprocesss是否清洗数据,train_size训练集比例, normalize是否归一化数据, normalize_method归一化方式
s.setup(data, target = 'Class variable', session_id = 0, verbose= False, train_size = 0.7, normalize = True, normalize_method = 'minmax')
  • 1
  • 2
  • 3
  • 4
<pycaret.classification.oop.ClassificationExperiment at 0x200b939df40>
  • 1

查看基于setup函数创建的变量:

s.get_config()
  • 1
{'USI',
 'X',
 'X_test',
 'X_test_transformed',
 'X_train',
 'X_train_transformed',
 'X_transformed',
 '_available_plots',
 '_ml_usecase',
 'data',
 'dataset',
 'dataset_transformed',
 'exp_id',
 'exp_name_log',
 'fix_imbalance',
 'fold_generator',
 'fold_groups_param',
 'fold_shuffle_param',
 'gpu_n_jobs_param',
 'gpu_param',
 'html_param',
 'idx',
 'is_multiclass',
 'log_plots_param',
 'logging_param',
 'memory',
 'n_jobs_param',
 'pipeline',
 'seed',
 'target_param',
 'test',
 'test_transformed',
 'train',
 'train_transformed',
 'variable_and_property_keys',
 'variables',
 'y',
 'y_test',
 'y_test_transformed',
 'y_train',
 'y_train_transformed',
 'y_transformed'}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42

查看归一化的数据:

s.get_config('X_train_transformed')
  • 1
Number of times pregnantPlasma glucose concentration a 2 hours in an oral glucose tolerance testDiastolic blood pressure (mm Hg)Triceps skin fold thickness (mm)2-Hour serum insulin (mu U/ml)Body mass index (weight in kg/(height in m)^2)Diabetes pedigree functionAge (years)
340.5882350.6130650.6393440.3131310.0000000.4113260.1853120.400000
2210.1176470.7939700.7377050.0000000.0000000.4709390.3104180.750000
5310.0000000.5376880.6229510.0000000.0000000.6751120.2596070.050000
5180.7647060.3819100.4918030.0000000.0000000.4888230.0435530.333333
6500.0588240.4572860.4426230.2525250.1182030.3755590.0666100.033333
...........................
6280.2941180.6432160.6557380.0000000.0000000.5156480.0281810.400000
4560.0588240.6783920.4426230.0000000.0000000.3979140.2600340.683333
3980.1764710.4120600.5737700.0000000.0000000.3144560.1327920.066667
60.1764710.3919600.4098360.3232320.1040190.4619970.0725880.083333
2940.0000000.8090450.4098360.0000000.0000000.3263790.0751490.733333

537 rows × 8 columns

绘制某列数据的柱状图:

s.get_config('X_train_transformed')['Number of times pregnant'].hist()
  • 1
<AxesSubplot:>
  • 1

png

当然也可以利用如下代码创建任务示例来初始化环境:

from pycaret.classification import setup
# s = setup(data, target = 'Class variable', session_id = 0, preprocess = True, train_size = 0.7, verbose = False)
  • 1
  • 2

模型训练与评估

PyCaret提供了compare_models函数,通过使用默认的10折交叉验证来训练和评估模型库中所有可用估计器的性能:

best = s.compare_models()
# 选择某些模型进行比较
# best = s.compare_models(include = ['dt', 'rf', 'et', 'gbc', 'lightgbm'])
# 按照召回率返回n_select性能最佳的模型
# best_recall_models_top3 = s.compare_models(sort = 'Recall', n_select = 3)
  • 1
  • 2
  • 3
  • 4
  • 5
 ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.76330.81320.49680.74360.59390.43580.45490.2720
ridgeRidge Classifier0.76330.81130.51780.72850.60170.44060.45600.0090
ldaLinear Discriminant Analysis0.76330.81100.54970.70690.61540.44890.45830.0080
adaAda Boost Classifier0.74650.77680.56550.65800.60510.42080.42550.0190
svmSVM - Linear Kernel0.74080.80870.59210.69800.60200.41960.44800.0080
nbNaive Bayes0.73910.79390.54420.65150.58570.39950.40810.0080
rfRandom Forest Classifier0.73370.80330.54060.63310.57780.38830.39290.0350
etExtra Trees Classifier0.72980.78990.51810.64160.56770.37610.38400.0450
gbcGradient Boosting Classifier0.72810.80070.55670.62670.58580.38570.38960.0260
lightgbmLight Gradient Boosting Machine0.72420.78110.58270.60960.59350.38590.38760.0860
qdaQuadratic Discriminant Analysis0.71500.78750.49620.62250.54470.34280.35240.0080
knnK Neighbors Classifier0.71310.74250.52870.60050.55770.34800.35280.2200
dtDecision Tree Classifier0.66850.64610.57220.52660.54590.28680.28890.0100
dummyDummy Classifier0.65180.50000.00000.00000.00000.00000.00000.0120

返回当前设置中所有经过训练的模型中的最佳模型:

best_ml = s.automl()
# best_ml
  • 1
  • 2
# 打印效果最佳的模型
print(best)
  • 1
  • 2
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
  • 1
  • 2
  • 3
  • 4
  • 5

数据可视化

PyCaret也提供了plot_model函数可视化模型的评估指标,plot_model函数中的plot用于设置评估指标类型。plot可用参数如下(注意并不是所有的模型都支持以下评估指标):

  • pipeline: Schematic drawing of the preprocessing pipeline
  • auc: Area Under the Curve
  • threshold: Discrimination Threshold
  • pr: Precision Recall Curve
  • confusion_matrix: Confusion Matrix
  • error: Class Prediction Error
  • class_report: Classification Report
  • boundary: Decision Boundary
  • rfe: Recursive Feature Selection
  • learning: Learning Curve
  • manifold: Manifold Learning
  • calibration: Calibration Curve
  • vc: Validation Curve
  • dimension: Dimension Learning
  • feature: Feature Importance
  • feature_all: Feature Importance (All)
  • parameter: Model Hyperparameter
  • lift: Lift Curve
  • gain: Gain Chart
  • tree: Decision Tree
  • ks: KS Statistic Plot
# 提取所有模型预测结果
models_results = s.pull()
models_results
  • 1
  • 2
  • 3
ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.76330.81320.49680.74360.59390.43580.45490.272
ridgeRidge Classifier0.76330.81130.51780.72850.60170.44060.45600.009
ldaLinear Discriminant Analysis0.76330.81100.54970.70690.61540.44890.45830.008
adaAda Boost Classifier0.74650.77680.56550.65800.60510.42080.42550.019
svmSVM - Linear Kernel0.74080.80870.59210.69800.60200.41960.44800.008
nbNaive Bayes0.73910.79390.54420.65150.58570.39950.40810.008
rfRandom Forest Classifier0.73370.80330.54060.63310.57780.38830.39290.035
etExtra Trees Classifier0.72980.78990.51810.64160.56770.37610.38400.045
gbcGradient Boosting Classifier0.72810.80070.55670.62670.58580.38570.38960.026
lightgbmLight Gradient Boosting Machine0.72420.78110.58270.60960.59350.38590.38760.086
qdaQuadratic Discriminant Analysis0.71500.78750.49620.62250.54470.34280.35240.008
knnK Neighbors Classifier0.71310.74250.52870.60050.55770.34800.35280.220
dtDecision Tree Classifier0.66850.64610.57220.52660.54590.28680.28890.010
dummyDummy Classifier0.65180.50000.00000.00000.00000.00000.00000.012
s.plot_model(best, plot = 'confusion_matrix')
  • 1

png

如果在jupyter环境,可以通过evaluate_model函数来交互式展示模型的性能:

# s.evaluate_model(best)
  • 1

模型预测

predict_model函数实现对数据进行预测,并返回包含预测标签prediction_label和分数prediction_score的Pandas表格。当data为None时,它预测测试集(在设置功能期间创建)上的标签和分数。

# 预测整个数据集
res = s.predict_model(best, data=data)
# 查看各行预测结果
# res
  • 1
  • 2
  • 3
  • 4
 ModelAccuracyAUCRecallPrec.F1KappaMCC
0Logistic Regression0.77080.83120.51490.75000.61060.45610.4723
# 预测用于数据训练的测试集
res = s.predict_model(best)
  • 1
  • 2
 ModelAccuracyAUCRecallPrec.F1KappaMCC
0Logistic Regression0.75760.85530.50620.71930.59420.42870.4422

模型保存与导入

# 保存模型到本地
_ = s.save_model(best, 'best_model', verbose = False)
  • 1
  • 2
# 导入模型
model = s.load_model( 'best_model')
# 查看模型结构
# model
  • 1
  • 2
  • 3
  • 4
Transformation Pipeline and Model Successfully Loaded
  • 1
# 预测整个数据集
res = s.predict_model(model, data=data)
  • 1
  • 2
 ModelAccuracyAUCRecallPrec.F1KappaMCC
0Logistic Regression0.77080.83120.51490.75000.61060.45610.4723

1.2 回归

PyCaret提供了regression模型实现回归任务,regression模块与classification模块使用方法一致。

# 加载保险费用示例数据集
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/insurance', verbose=False)
# 从网络下载
# data = get_data(dataset='insurance', verbose=False)
  • 1
  • 2
  • 3
  • 4
  • 5
data.head()
  • 1
agesexbmichildrensmokerregioncharges
019female27.9000yessouthwest16884.92400
118male33.7701nosoutheast1725.55230
228male33.0003nosoutheast4449.46200
333male22.7050nonorthwest21984.47061
432male28.8800nonorthwest3866.85520
# 创建数据管道
from pycaret.regression import RegressionExperiment
s = RegressionExperiment()
# 预测charges列
s.setup(data, target = 'charges', session_id = 0)
# 另一种数据管道创建方式
# from pycaret.regression import *
# s = setup(data, target = 'charges', session_id = 0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
 DescriptionValue
0Session id0
1Targetcharges
2Target typeRegression
3Original data shape(1338, 7)
4Transformed data shape(1338, 10)
5Transformed train set shape(936, 10)
6Transformed test set shape(402, 10)
7Numeric features3
8Categorical features3
9PreprocessTrue
10Imputation typesimple
11Numeric imputationmean
12Categorical imputationmode
13Maximum one-hot encoding25
14Encoding methodNone
15Fold GeneratorKFold
16Fold Number10
17CPU Jobs-1
18Use GPUFalse
19Log ExperimentFalse
20Experiment Namereg-default-name
21USIeb9d
<pycaret.regression.oop.RegressionExperiment at 0x200dedc2d30>
  • 1
# 评估各类模型
best = s.compare_models()
  • 1
  • 2
 ModelMAEMSERMSER2RMSLEMAPETT (Sec)
gbrGradient Boosting Regressor2723.245323787529.58724832.47850.82540.44270.31400.0550
lightgbmLight Gradient Boosting Machine2998.131125738691.21815012.24040.81060.55250.37090.1140
rfRandom Forest Regressor2915.701826780127.00165109.50980.80310.48550.35200.0670
etExtra Trees Regressor2841.825728559316.95335243.58280.79310.46710.32180.0670
adaAdaBoost Regressor4180.266928289551.00485297.68170.78860.59350.65450.0210
ridgeRidge Regression4304.264038786967.47686188.69660.71520.57940.42830.0230
larLeast Angle Regression4293.988638781666.59916188.33010.71510.58930.42630.0210
llarLasso Least Angle Regression4294.213538780221.00396188.19060.71510.58910.42640.0200
brBayesian Ridge4299.853238785479.09846188.60260.71510.57840.42740.0200
lassoLasso Regression4294.218638780210.56656188.18980.71510.58920.42640.0370
lrLinear Regression4293.988638781666.59916188.33010.71510.58930.42630.0350
dtDecision Tree Regressor3550.653451149204.90327095.91700.61270.58390.45370.0230
huberHuber Regressor3769.307653638697.23377254.71080.60950.45280.21870.0250
parPassive Aggressive Regressor4144.718062949698.17757862.76040.54330.46340.24650.0210
enElastic Net7248.937689841235.95179405.58460.35340.73460.92380.0210
ompOrthogonal Matching Pursuit8916.1927130904492.306711356.41200.05610.87811.15980.0180
knnK Neighbors Regressor8161.8875137982796.800011676.3735-0.00110.87440.97420.0250
dummyDummy Regressor8892.4478141597492.800011823.4271-0.02210.98681.49090.0210
print(best)
  • 1
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='squared_error',
                          max_depth=3, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_samples_leaf=1,
                          min_samples_split=2, min_weight_fraction_leaf=0.0,
                          n_estimators=100, n_iter_no_change=None,
                          random_state=0, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

1.3 聚类

ParCaret提供了clustering模块实现无监督聚类。

数据准备

# 导入珠宝数据集
from pycaret.datasets import get_data
# 根据数据集特征进行聚类
data = get_data('./datasets/jewellery')
# data = get_data('jewellery')
  • 1
  • 2
  • 3
  • 4
  • 5
AgeIncomeSpendingScoreSavings
058777690.7913296559.829923
159817990.7910825417.661426
262747510.7026579258.992965
359743730.7656807346.334504
487177600.34877816869.507130
# 创建数据管道
from pycaret.clustering import ClusteringExperiment
s = ClusteringExperiment()
# normalize归一化数据
s.setup(data, normalize = True, verbose = False)
# 另一种数据管道创建方式
# from pycaret.clustering import *
# s = setup(data, normalize = True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
<pycaret.clustering.oop.ClusteringExperiment at 0x200dec86340>
  • 1

模型创建

PyCaret在聚类任务中提供create_model选择合适的方法来构建聚类模型,而不是全部比较。

kmeans = s.create_model('kmeans')
  • 1
 SilhouetteCalinski-HarabaszDavies-BouldinHomogeneityRand IndexCompleteness
00.75811611.26470.3743000

create_model函数支持的聚类方法如下:

s.models()
  • 1
NameReference
ID
kmeansK-Means Clusteringsklearn.cluster._kmeans.KMeans
apAffinity Propagationsklearn.cluster._affinity_propagation.Affinity...
meanshiftMean Shift Clusteringsklearn.cluster._mean_shift.MeanShift
scSpectral Clusteringsklearn.cluster._spectral.SpectralClustering
hclustAgglomerative Clusteringsklearn.cluster._agglomerative.AgglomerativeCl...
dbscanDensity-Based Spatial Clusteringsklearn.cluster._dbscan.DBSCAN
opticsOPTICS Clusteringsklearn.cluster._optics.OPTICS
birchBirch Clusteringsklearn.cluster._birch.Birch
print(kmeans)
# 查看聚类数
print(kmeans.n_clusters)
  • 1
  • 2
  • 3
KMeans(algorithm='lloyd', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init='auto', random_state=1459, tol=0.0001, verbose=0)
4
  • 1
  • 2
  • 3

数据展示

# jupyter环境下交互可视化展示
# s.evaluate_model(kmeans)
  • 1
  • 2
# 结果可视化 
# 'cluster' - Cluster PCA Plot (2d)
# 'tsne' - Cluster t-SNE (3d)
# 'elbow' - Elbow Plot
# 'silhouette' - Silhouette Plot
# 'distance' - Distance Plot
# 'distribution' - Distribution Plot
s.plot_model(kmeans, plot = 'elbow')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

png

标签分配与数据预测

为训练数据分配聚类标签:

result = s.assign_model(kmeans)
result.head()
  • 1
  • 2
AgeIncomeSpendingScoreSavingsCluster
058777690.7913296559.830078Cluster 2
159817990.7910825417.661621Cluster 2
262747510.7026579258.993164Cluster 2
359743730.7656807346.334473Cluster 2
487177600.34877816869.507812Cluster 1

为新的数据进行标签分配:

predictions = s.predict_model(kmeans, data = data)
predictions.head()
  • 1
  • 2
AgeIncomeSpendingScoreSavingsCluster
0-0.0422870.0627331.103593-1.072467Cluster 2
1-0.0008210.1748111.102641-1.303473Cluster 2
20.123577-0.0212000.761727-0.526556Cluster 2
3-0.000821-0.0317121.004705-0.913395Cluster 2
41.160228-1.606165-0.6026191.012686Cluster 1

1.4 异常检测

PyCaret的anomaly detection模块是一个无监督的机器学习模块,用于识别与大多数数据存在显著差异的罕见项目、事件或观测值。通常,这些异常项目会转化为某种问题,如银行欺诈、结构缺陷、医疗问题或错误。anomaly detection模块的使用类似于cluster模块。

数据准备

from pycaret.datasets import get_data
data = get_data('./datasets/anomaly')
# data = get_data('anomaly')
  • 1
  • 2
  • 3
Col1Col2Col3Col4Col5Col6Col7Col8Col9Col10
00.2639950.7649290.1384240.9352420.6058670.5187900.9122250.6082340.7237820.733591
10.5460920.6539750.0655750.2277720.8452690.8370660.2723790.3316790.4292970.367422
20.3367140.5388420.1928010.5535630.0745150.3329930.3657920.8613090.8990170.088600
30.0921080.9950170.0144650.1763710.2415300.5147240.5622080.1589630.0737150.208463
40.3252610.8059680.9570330.3316650.3079230.3553150.5018990.5584490.8851690.182754
from pycaret.anomaly import AnomalyExperiment
s = AnomalyExperiment()
s.setup(data, session_id = 0)
# 另一种加载方式
# from pycaret.anomaly import *
# s = setup(data, session_id = 0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
 DescriptionValue
0Session id0
1Original data shape(1000, 10)
2Transformed data shape(1000, 10)
3Numeric features10
4PreprocessTrue
5Imputation typesimple
6Numeric imputationmean
7Categorical imputationmode
8CPU Jobs-1
9Use GPUFalse
10Log ExperimentFalse
11Experiment Nameanomaly-default-name
12USI54db
<pycaret.anomaly.oop.AnomalyExperiment at 0x200e14f5250>
  • 1

模型创建

iforest = s.create_model('iforest')
print(iforest)
  • 1
  • 2
IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=0, verbose=0)
  • 1
  • 2
  • 3

anomaly detection模块所支持的模型列表如下:

s.models()
  • 1
NameReference
ID
abodAngle-base Outlier Detectionpyod.models.abod.ABOD
clusterClustering-Based Local Outlierpycaret.internal.patches.pyod.CBLOFForceToDouble
cofConnectivity-Based Local Outlierpyod.models.cof.COF
iforestIsolation Forestpyod.models.iforest.IForest
histogramHistogram-based Outlier Detectionpyod.models.hbos.HBOS
knnK-Nearest Neighbors Detectorpyod.models.knn.KNN
lofLocal Outlier Factorpyod.models.lof.LOF
svmOne-class SVM detectorpyod.models.ocsvm.OCSVM
pcaPrincipal Component Analysispyod.models.pca.PCA
mcdMinimum Covariance Determinantpyod.models.mcd.MCD
sodSubspace Outlier Detectionpyod.models.sod.SOD
sosStochastic Outlier Selectionpyod.models.sos.SOS

标签分配与数据预测

为训练数据分配聚类标签:

result = s.assign_model(iforest)
result.head()
  • 1
  • 2
Col1Col2Col3Col4Col5Col6Col7Col8Col9Col10AnomalyAnomaly_Score
00.2639950.7649290.1384240.9352420.6058670.5187900.9122250.6082340.7237820.7335910-0.016205
10.5460920.6539750.0655750.2277720.8452690.8370660.2723790.3316790.4292970.3674220-0.068052
20.3367140.5388420.1928010.5535630.0745150.3329930.3657920.8613090.8990170.08860010.009221
30.0921080.9950170.0144650.1763710.2415300.5147240.5622080.1589630.0737150.20846310.056690
40.3252610.8059680.9570330.3316650.3079230.3553150.5018990.5584490.8851690.1827540-0.012945

为新的数据进行标签分配:

predictions = s.predict_model(iforest, data = data)
predictions.head()
  • 1
  • 2
Col1Col2Col3Col4Col5Col6Col7Col8Col9Col10AnomalyAnomaly_Score
00.2639950.7649290.1384240.9352420.6058670.5187900.9122250.6082340.7237820.7335910-0.016205
10.5460920.6539750.0655750.2277720.8452690.8370660.2723790.3316790.4292970.3674220-0.068052
20.3367140.5388420.1928010.5535630.0745150.3329930.3657920.8613090.8990170.08860010.009221
30.0921080.9950170.0144650.1763710.2415300.5147240.5622080.1589630.0737150.20846310.056690
40.3252610.8059680.9570330.3316650.3079230.3553150.5018990.5584490.8851690.1827540-0.012945

1.5 时序预测

PyCaret时间序列预测Time Series模块支持多种预测方法,如ARIMA、Prophet和LSTM。它还提供了各种功能来处理缺失值、时间序列分解和数据可视化。

数据准备

# 乘客时序数据
from pycaret.datasets import get_data
# 下载路径:https://raw.githubusercontent.com/sktime/sktime/main/sktime/datasets/data/Airline/Airline.csv
data = get_data('./datasets/airline')
# data = get_data('airline')
  • 1
  • 2
  • 3
  • 4
  • 5
DatePassengers
01949-01112
11949-02118
21949-03132
31949-04129
41949-05121
import pandas as pd
data['Date'] = pd.to_datetime(data['Date'])
# 并将Date设置为列号
data.set_index('Date', inplace=True)
  • 1
  • 2
  • 3
  • 4
from pycaret.time_series import TSForecastingExperiment
s = TSForecastingExperiment()
# fh: 用于预测的预测范围。默认值设置为1,即预测前方一点。,fold: 交叉验证中折数
s.setup(data, fh = 3, fold = 5, session_id = 0, verbose = False)
# from pycaret.time_series import *
# s = setup(data, fh = 3, fold = 5, session_id = 0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x200dee26910>
  • 1

模型训练与评估

best = s.compare_models()
  • 1
 ModelMASERMSSEMAERMSEMAPESMAPER2TT (Sec)
stlfSTLF0.42400.442912.800215.19330.02660.02680.42960.0300
exp_smoothExponential Smoothing0.50630.537815.290018.44550.03340.0335-0.05210.0500
etsETS0.55200.580116.616419.83910.03540.0357-0.07400.0680
arimaARIMA0.64800.650119.572822.30270.04120.0420-0.07960.0420
auto_arimaAuto ARIMA0.65260.630019.740521.62020.04140.0421-0.056010.5220
thetaTheta Forecaster0.84580.822325.702428.33320.05240.0541-0.77100.0220
huber_cds_dtHuber w/ Cond. Deseasonalize & Detrending0.90020.890027.256830.57820.05500.0572-0.03090.0680
knn_cds_dtK Neighbors w/ Cond. Deseasonalize & Detrending0.93810.883028.567830.50070.05550.05750.09080.0920
lr_cds_dtLinear w/ Cond. Deseasonalize & Detrending0.94690.929728.633731.91630.05810.0605-0.16200.0820
ridge_cds_dtRidge w/ Cond. Deseasonalize & Detrending0.94690.929728.634031.91640.05810.0605-0.16200.0680
en_cds_dtElastic Net w/ Cond. Deseasonalize & Detrending0.94990.932028.727131.99520.05820.0606-0.15790.0700
llar_cds_dtLasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending0.95200.933628.791732.05280.05830.0607-0.15590.0560
lasso_cds_dtLasso w/ Cond. Deseasonalize & Detrending0.95210.933728.794132.05570.05830.0607-0.15600.0720
br_cds_dtBayesian Ridge w/ Cond. Deseasonalize & Detrending0.95510.934728.901832.10130.05820.0606-0.13770.0580
et_cds_dtExtra Trees w/ Cond. Deseasonalize & Detrending1.03220.994231.404834.30540.06070.0633-0.16600.1280
rf_cds_dtRandom Forest w/ Cond. Deseasonalize & Detrending1.08511.028632.979135.46660.06410.0670-0.35450.1400
lightgbm_cds_dtLight Gradient Boosting w/ Cond. Deseasonalize & Detrending1.14091.104034.599937.99180.06700.0701-0.39940.0900
ada_cds_dtAdaBoost w/ Cond. Deseasonalize & Detrending1.14411.084334.745137.36810.06640.0697-0.30040.0920
gbr_cds_dtGradient Boosting w/ Cond. Deseasonalize & Detrending1.16971.109435.440838.13730.06970.0729-0.41630.0900
omp_cds_dtOrthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending1.17931.125035.734838.67550.07060.0732-0.50950.0620
dt_cds_dtDecision Tree w/ Cond. Deseasonalize & Detrending1.27041.237138.497642.48460.07730.0814-1.03820.0860
snaiveSeasonal Naive Forecaster1.77001.599953.533354.91430.11360.1211-4.16300.1580
naiveNaive Forecaster1.81451.744454.866759.81600.11350.1151-3.77100.1460
polytrendPolynomial Trend Forecaster2.31542.250770.113877.34000.13630.1468-4.62020.1080
crostonCroston2.62112.498579.364585.84390.15150.1684-5.22940.0140
grand_meansGrand Means Forecaster7.12616.3506216.0214218.42590.43770.5682-59.26840.1400

数据展示

# jupyter环境下交互可视化展示
# plot参数支持:
# - 'ts' - Time Series Plot
# - 'train_test_split' - Train Test Split
# - 'cv' - Cross Validation
# - 'acf' - Auto Correlation (ACF)
# - 'pacf' - Partial Auto Correlation (PACF)
# - 'decomp' - Classical Decomposition
# - 'decomp_stl' - STL Decomposition
# - 'diagnostics' - Diagnostics Plot
# - 'diff' - Difference Plot
# - 'periodogram' - Frequency Components (Periodogram)
# - 'fft' - Frequency Components (FFT)
# - 'ccf' - Cross Correlation (CCF)
# - 'forecast' - "Out-of-Sample" Forecast Plot
# - 'insample' - "In-Sample" Forecast Plot
# - 'residuals' - Residuals Plot
# s.plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 24})
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

数据预测

# 使模型拟合包括测试样本在内的完整数据集
final_best = s.finalize_model(best)
s.predict_model(best, fh = 24)
s.save_model(final_best, 'final_best_model')
  • 1
  • 2
  • 3
  • 4
Transformation Pipeline and Model Successfully Saved





(ForecastingPipeline(steps=[('forecaster',
                             TransformedTargetForecaster(steps=[('model',
                                                                 ForecastingPipeline(steps=[('forecaster',
                                                                                             TransformedTargetForecaster(steps=[('model',
                                                                                                                                 STLForecaster(sp=12))]))]))]))]),
 'final_best_model.pkl')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

2 数据处理与清洗

2.1 缺失值处理

各种数据集可能由于多种原因存在缺失值或空记录。移除具有缺失值的样本是一种常见策略,但这会导致丢失可能有价值的数据。一种可替代的策略是对缺失值进行插值填充。在setup函数中可以指定如下参数,实现缺失值处理:

  • imputation_type:取值可以是 ‘simple’ 或 'iterative’或 None。当imputation_type设置为 ‘simple’ 时,PyCaret 将使用简单的方式(numeric_imputation和categorical_imputation)对缺失值进行填充。而当设置为 ‘iterative’ 时,则会使用模型估计的方式(numeric_iterative_imputer,categorical_iterative_imputer)进行填充处理。如果设置为 None,则不会执行任何缺失值填充操作
  • numeric_imputation: 设置数值类型的缺失值,方式如下:
    • mean: 用列的平均值填充,默认
    • drop: 删除包含缺失值的行
    • median: 用列的中值填充
    • mode: 用列最常见值填充
    • knn: 使用K-最近邻方法拟合
    • int or float: 用指定值替代
  • categorical_imputation:
    • mode: 用列最常见值填充,默认
    • drop: 删除包含缺失值的行
    • str: 用指定字符替代
  • numeric_iterative_imputer: 使用估计模型拟合值,可输入str或sklearn模型, 默认使用lightgbm
  • categorical_iterative_imputer: 使用估计模型差值,可输入str或sklearn模型, 默认使用lightgbm

加载数据

# load dataset
from pycaret.datasets import get_data
# 从本地加载数据,注意dataset是数据的文件名
data = get_data(dataset='./datasets/hepatitis', verbose=False)
# data = get_data('hepatitis',verbose=False)
# 可以看到第三行STEROID列出现NaN值
data.head()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
ClassAGESEXSTEROIDANTIVIRALSFATIGUEMALAISEANOREXIALIVER BIGLIVER FIRMSPLEEN PALPABLESPIDERSASCITESVARICESBILIRUBINALK PHOSPHATESGOTALBUMINPROTIMEHISTOLOGY
003021.022221.02.02.02.02.02.01.085.018.04.0NaN1
105011.021221.02.02.02.02.02.00.9135.042.03.5NaN1
207812.021222.02.02.02.02.02.00.796.032.04.0NaN1
30311NaN12222.02.02.02.02.02.00.746.052.04.080.01
403412.022222.02.02.02.02.02.01.0NaN200.04.0NaN1
# 使用均值填充数据
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
# 均值
# s.data['STEROID'].mean()
s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
        # 设置data_split_shuffle和data_split_stratify为False不打乱数据
        data_split_shuffle = False, data_split_stratify = False,
        imputation_type='simple', numeric_iterative_imputer='drop')
# 查看转换后的数据
s.get_config('dataset_transformed').head()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
AGESEXSTEROIDANTIVIRALSFATIGUEMALAISEANOREXIALIVER BIGLIVER FIRMSPLEEN PALPABLESPIDERSASCITESVARICESBILIRUBINALK PHOSPHATESGOTALBUMINPROTIMEHISTOLOGYClass
030.02.01.0000002.02.02.02.01.02.02.02.02.02.01.085.00000018.04.066.539681.00
150.01.01.0000002.01.02.02.01.02.02.02.02.02.00.9135.00000042.03.566.539681.00
278.01.02.0000002.01.02.02.02.02.02.02.02.02.00.796.00000032.04.066.539681.00
331.01.01.5094341.02.02.02.02.02.02.02.02.02.00.746.00000052.04.080.000001.00
434.01.02.0000002.02.02.02.02.02.02.02.02.02.01.099.659088200.04.066.539681.00
# 使用knn拟合数据
from pycaret.classification import ClassificationExperiment
s = ClassificationExperiment()
s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
        # 设置data_split_shuffle和data_split_stratify为False不打乱数据
        data_split_shuffle = False, data_split_stratify = False,
        imputation_type='simple', numeric_imputation = 'knn')
# 查看转换后的数据
s.get_config('dataset_transformed').head()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
AGESEXSTEROIDANTIVIRALSFATIGUEMALAISEANOREXIALIVER BIGLIVER FIRMSPLEEN PALPABLESPIDERSASCITESVARICESBILIRUBINALK PHOSPHATESGOTALBUMINPROTIMEHISTOLOGYClass
030.02.01.02.02.02.02.01.02.02.02.02.02.01.085.00000018.04.091.8000031.00
150.01.01.02.01.02.02.01.02.02.02.02.02.00.9135.00000042.03.561.5999981.00
278.01.02.02.01.02.02.02.02.02.02.02.02.00.796.00000032.04.075.8000031.00
331.01.01.81.02.02.02.02.02.02.02.02.02.00.746.00000052.04.080.0000001.00
434.01.02.02.02.02.02.02.02.02.02.02.02.01.0108.400002200.04.062.7999991.00
# 使用lightgbmn拟合数据
# from pycaret.classification import ClassificationExperiment
# s = ClassificationExperiment()
# s.setup(data = data, session_id=0, target = 'Class',verbose=False, 
#         # 设置data_split_shuffle和data_split_stratify为False不打乱数据
#         data_split_shuffle = False, data_split_stratify = False,
#         imputation_type='iterative', numeric_iterative_imputer = 'lightgbm')
# 查看转换后的数据
# s.get_config('dataset_transformed').head()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2.2 类型转换

虽然 PyCaret具有自动识别特征类型的功能,但PyCaret提供了数据类型自定义参数,用户可以对数据集进行更精细的控制和指导,以确保模型训练和特征工程的效果更加符合用户的预期和需求。这些自定义参数如下:

  • numeric_features:用于指定数据集中的数值特征列的参数。这些特征将被视为连续型变量进行处理
  • categorical_features:用于指定数据集中的分类特征列的参数。这些特征将被视为离散型变量进行处理
  • date_features:用于指定数据集中的日期特征列的参数。这些特征将被视为日期型变量进行处理
  • create_date_columns:用于指定是否从日期特征中创建新的日期相关列的参数
  • text_features:用于指定数据集中的文本特征列的参数。这些特征将被视为文本型变量进行处理
  • text_features_method:用于指定对文本特征进行处理的方法的参数
  • ignore_features:用于指定在建模过程中需要忽略的特征列的参数
  • keep_features:用于指定在建模过程中需要保留的特征列的参数
# 转换变量类型
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/hepatitis', verbose=False)

from pycaret.classification import *
s = setup(data = data, target = 'Class', ignore_features  = ['SEX','AGE'], categorical_features=['STEROID'],verbose = False,
         data_split_shuffle = False, data_split_stratify = False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
# 查看转换后的数据,前两列消失,STEROID变为分类变量
s.get_config('dataset_transformed').head()
  • 1
  • 2
STEROIDANTIVIRALSFATIGUEMALAISEANOREXIALIVER BIGLIVER FIRMSPLEEN PALPABLESPIDERSASCITESVARICESBILIRUBINALK PHOSPHATESGOTALBUMINPROTIMEHISTOLOGYClass
00.02.02.02.02.01.02.02.02.02.02.01.085.00000018.04.066.539681.00
10.02.01.02.02.01.02.02.02.02.02.00.9135.00000042.03.566.539681.00
21.02.01.02.02.02.02.02.02.02.02.00.796.00000032.04.066.539681.00
31.01.02.02.02.02.02.02.02.02.02.00.746.00000052.04.080.000001.00
41.02.02.02.02.02.02.02.02.02.02.01.099.659088200.04.066.539681.00

2.3 独热编码

当数据集中包含分类变量时,这些变量通常需要转换为模型可以理解的数值形式。独热编码是一种常用的方法,它将每个分类变量转换为一组二进制变量,其中每个变量对应一个可能的分类值,并且只有一个变量在任何给定时间点上取值为 1,其余变量均为 0。可以通过传递参数categorical_features来指定要进行独热编码的列。例如:

# load dataset
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/pokemon', verbose=False)
# data = get_data('pokemon')
data.head()
  • 1
  • 2
  • 3
  • 4
  • 5
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
23VenusaurGrassPoison525808283100100801False
33VenusaurMega VenusaurGrassPoison62580100123122120801False
44CharmanderFireNaN3093952436050651False
# 对Type 1实现独热编码
len(set(data['Type 1']))
  • 1
  • 2
18
  • 1
from pycaret.classification import *
s = setup(data = data, categorical_features =["Type 1"],target = 'Legendary', verbose=False)
# 查看转换后的数据Type 1变为独热编码
s.get_config('dataset_transformed').head()
  • 1
  • 2
  • 3
  • 4
#NameType 1_GrassType 1_GhostType 1_WaterType 1_SteelType 1_PsychicType 1_FireType 1_PoisonType 1_Fairy...Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
202187.0Hoppip1.00.00.00.00.00.00.00.0...Flying250.035.035.040.035.055.050.02.0False
477429.0Mismagius0.01.00.00.00.00.00.00.0...NaN495.060.060.060.0105.0105.0105.04.0False
349319.0SharpedoMega Sharpedo0.00.01.00.00.00.00.00.0...Dark560.070.0140.070.0110.065.0105.03.0False
777707.0Klefki0.00.00.01.00.00.00.00.0...Fairy470.057.080.091.080.087.075.06.0False
5045.0Vileplume1.00.00.00.00.00.00.00.0...Poison490.075.080.085.0110.090.050.01.0False

5 rows × 30 columns

2.4 数据平衡

在 PyCaret 中,fix_imbalance 和 fix_imbalance_method 是用于处理不平衡数据集的两个参数。这些参数通常用于在训练模型之前对数据集进行预处理,以解决类别不平衡问题。

  • fix_imbalance 参数:这是一个布尔值参数,用于指示是否对不平衡数据集进行处理。当设置为 True 时,PyCaret 将自动检测数据集中的类别不平衡问题,并尝试通过采样方法来解决。当设置为 False 时,PyCaret 将使用原始的不平衡数据集进行模型训练
  • fix_imbalance_method 参数:这是一个字符串参数,用于指定处理不平衡数据集的方法。可选的值包括:
    • 使用 SMOTE(Synthetic Minority Over-sampling Technique)来生成人工合成样本,从而平衡类别(默认参数smote)
    • 使用imbalanced-learn提供的估算模型
# 加载数据
from pycaret.datasets import get_data
data = get_data(dataset='./datasets/credit', verbose=False)
# data = get_data('credit')
data.head()
  • 1
  • 2
  • 3
  • 4
  • 5
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4PAY_5...BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6default
0200002212422-1-1-2...0.00.00.00.0689.00.00.00.00.01
1900002223400000...14331.014948.015549.01518.01500.01000.01000.01000.05000.00
2500002213700000...28314.028959.029547.02000.02019.01200.01100.01069.01000.00
35000012157-10-100...20940.019146.019131.02000.036681.010000.09000.0689.0679.00
4500001123700000...19394.019619.020024.02500.01815.0657.01000.01000.0800.00

5 rows × 24 columns

# 查看数据各类别数
category_counts = data['default'].value_counts()
category_counts
  • 1
  • 2
  • 3
default
0    18694
1     5306
Name: count, dtype: int64
  • 1
  • 2
  • 3
  • 4
from pycaret.classification import *
s = setup(data = data, target = 'default', fix_imbalance = True, verbose = False)
  • 1
  • 2
# 可以看到类1数据量变多了
s.get_config('dataset_transformed')['default'].value_counts()
  • 1
  • 2
default
0    18694
1    14678
Name: count, dtype: int64
  • 1
  • 2
  • 3
  • 4

2.5 异常值处理

PyCaret的remove_outliers函数可以在训练模型之前识别和删除数据集中的异常值。它使用奇异值分解技术进行PCA线性降维来识别异常值,并可以通过setup中的outliers_threshold参数控制异常值的比例(默认0.05)。

from pycaret.datasets import get_data

data = get_data(dataset='./datasets/insurance', verbose=False)
# insurance = get_data('insurance')
# 数据维度
data.shape
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
(1338, 7)
  • 1
from pycaret.regression import *
s = setup(data = data, target = 'charges', remove_outliers = True ,verbose = False, outliers_threshold = 0.02)
# 移除异常数据后,数据量变少
s.get_config('dataset_transformed').shape
  • 1
  • 2
  • 3
  • 4
(1319, 10)
  • 1

2.6 特征重要性

特征重要性是一种用于选择数据集中对预测目标变量最有贡献的特征的过程。与使用所有特征相比,仅使用选定的特征可以减少过拟合的风险,提高准确性,并缩短训练时间。在PyCaret中,可以通过使用feature_selection参数来实现这一目的。对于PyCaret中几个与特征选择相关参数的解释如下:

  • feature_selection:用于指定是否在模型训练过程中进行特征选择。可以设置为 True 或 False。
  • feature_selection_method:特征选择方法:
    • ‘univariate’: 使用sklearn的SelectKBest,基于统计测试来选择与目标变量最相关的特征。
    • ‘classic(默认)’: 使用sklearn的SelectFromModel,利用监督学习模型的特征重要性或系数来选择最重要的特征。
    • ‘sequential’: 使用sklearn的SequentialFeatureSelector,该类根据指定的算法(如前向选择、后向选择等)以及性能指标(如交叉验证得分)逐步选择特征。
  • n_features_to_select:特征选择的最大特征数量或比例。如果<1,则为起始特征的比例。默认为0.2。该参数在计数时不考虑 ignore_features 和 keep_features 中的特征。
from pycaret.datasets import get_data
data = get_data('./datasets/diabetes')
  • 1
  • 2
Number of times pregnantPlasma glucose concentration a 2 hours in an oral glucose tolerance testDiastolic blood pressure (mm Hg)Triceps skin fold thickness (mm)2-Hour serum insulin (mu U/ml)Body mass index (weight in kg/(height in m)^2)Diabetes pedigree functionAge (years)Class variable
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
from pycaret.regression import *
# feature_selection选择特征, n_features_to_select选择特征比例
s = setup(data = data, target = 'Class variable', feature_selection = True, feature_selection_method = 'univariate',
          n_features_to_select = 0.3, verbose = False)
  • 1
  • 2
  • 3
  • 4
# 查看哪些特征保留下来
s.get_config('X_transformed').columns
s.get_config('X_transformed').head()
  • 1
  • 2
  • 3
Plasma glucose concentration a 2 hours in an oral glucose tolerance testBody mass index (weight in kg/(height in m)^2)
56187.037.700001
541128.032.400002
269146.027.500000
304150.021.000000
3288.024.799999

2.7 归一化

数据归一化

在 PyCaret 中,normalize 和 normalize_method 参数用于数据预处理中的特征缩放操作。特征缩放是指将数据的特征值按比例缩放,使之落入一个小的特定范围,这样可以消除特征之间的量纲影响,使模型训练更加稳定和准确。下面是关于这两个参数的说明:

  • normalize: 这是一个布尔值参数,用于指定是否对特征进行缩放。默认情况下,它的取值为 False,表示不进行特征缩放。如果将其设置为 True,则会启用特征缩放功能。
  • normalize_method: 这是一个字符串参数,用于指定特征缩放的方法。可选的值有:
    • zscore(默认): 使用 Z 分数标准化方法,也称为标准化或 Z 标准化。该方法将特征的值转换为其 Z 分数,即将特征值减去其均值,然后除以其标准差,从而使得特征的均值为 0,标准差为 1。
    • minmax: 使用 Min-Max 标准化方法,也称为归一化。该方法将特征的值线性转换到指定的最小值和最大值之间,默认情况下是 [0, 1] 范围。
    • maxabs: 使用 MaxAbs 标准化方法。该方法将特征的值除以特征的最大绝对值,将特征的值缩放到 [-1, 1] 范围内。
    • robust: 使用 RobustScaler 标准化方法。该方法对数据的每个特征进行中心化和缩放,使用特征的中位数和四分位数范围来缩放特征。
from pycaret.datasets import get_data
data = get_data('./datasets/pokemon')
data.head()
  • 1
  • 2
  • 3
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
23VenusaurGrassPoison525808283100100801False
33VenusaurMega VenusaurGrassPoison62580100123122120801False
44CharmanderFireNaN3093952436050651False
#NameType 1Type 2TotalHPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison3184549496565451False
12IvysaurGrassPoison4056062638080601False
23VenusaurGrassPoison525808283100100801False
33VenusaurMega VenusaurGrassPoison62580100123122120801False
44CharmanderFireNaN3093952436050651False
# 归一化
from pycaret.classification import *
s = setup(data, target='Legendary', normalize=True, normalize_method='robust', verbose=False)
  • 1
  • 2
  • 3

数据归一化结果:

s.get_config('X_transformed').head()
  • 1
#NameType 1_WaterType 1_NormalType 1_IceType 1_PsychicType 1_FireType 1_RockType 1_FightingType 1_Grass...Type 2_ElectricType 2_NormalTotalHPAttackDefenseSp. AtkSp. DefSpeedGeneration
403-0.0216290.01.00.00.00.00.00.00.00.0...0.00.00.195387-0.3333330.2000000.8751.0888890.125-0.2888890.000000
4710.1398700.00.01.00.00.00.00.00.00.0...0.00.00.1791040.3333330.555556-0.100-0.111111-0.1001.1111110.333333
238-0.4484500.00.00.01.00.00.00.00.00.0...0.00.0-1.080054-0.500000-0.555556-0.750-0.777778-1.000-0.333333-0.333333
6460.6041820.00.01.00.00.00.00.00.00.0...0.00.0-0.618725-0.166667-0.333333-0.500-0.555556-0.5000.2222220.666667
69-0.8983420.00.00.00.01.00.00.00.00.0...0.00.0-0.265943-0.833333-0.888889-1.0001.2222220.0000.888889-0.666667

5 rows × 46 columns

特征变换

归一化会重新调整数据,使其在新的范围内,以减少方差中幅度的影响。特征变换是一种更彻底的技术。通过转换改变数据的分布形状,使得转换后的数据可以被表示为正态分布或近似正态分布。PyCaret中通过transformation参数开启特征转换,transformation_method设置转换方法:yeo-johnson(默认)和分位数。此外除了特征变换,还有目标变换。目标变换它将改变目标变量而不是特征的分布形状。此功能仅在pycarte.regression模块中可用。使用transform_target开启目标变换,transformation_method设置转换方法。

from pycaret.classification import *
s = setup(data = data, target = 'Legendary', transformation = True, verbose = False)
# 特征变换结果
s.get_config('X_transformed').head()
  • 1
  • 2
  • 3
  • 4
#NameType 1_PsychicType 1_WaterType 1_RockType 1_GrassType 1_DragonType 1_GhostType 1_BugType 1_Fairy...Type 2_ElectricType 2_BugTotalHPAttackDefenseSp. AtkSp. DefSpeedGeneration
16552.8990030.0092160.043322-0.000000-0.000000-0.000000-0.000000-0.0-0.0-0.0...-0.0-0.093.11840312.33684423.64909013.57301010.6924438.08170326.1342550.900773
625140.7302890.009216-0.0000000.095739-0.000000-0.000000-0.000000-0.0-0.0-0.0...-0.0-0.066.0913449.28667120.25915313.7646688.1604826.0566449.5525063.679456
628141.2830840.009216-0.000000-0.0000000.043322-0.000000-0.000000-0.0-0.0-0.0...-0.0-0.089.74793910.82329929.10537911.02957111.2033356.94209127.7930803.679456
606137.3968780.009216-0.000000-0.000000-0.0000000.061897-0.000000-0.0-0.0-0.0...-0.0-0.056.5605778.04301810.27620810.6049376.9492656.30246519.9438093.679456
672149.3039140.009216-0.000000-0.000000-0.000000-0.0000000.029706-0.0-0.0-0.0...-0.0-0.072.62619010.20224526.06125911.4354937.1996076.30246520.1411563.679456

5 rows × 46 columns

3 参考

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop】
推荐阅读
相关标签
  

闽ICP备14008679号