赞
踩
$ pip install -U "mxnet<2.0.0" -i https://pypi.tuna.tsinghua.edu.cn/simple # cpu版本
$ pip install -U "mxnet_cu101" -i https://pypi.tuna.tsinghua.edu.cn/simple # gpu版本
$ pip install autogluon -i https://pypi.tuna.tsinghua.edu.cn/simple
对于表示为表格的标准数据集(存储为 CSV 文件、来自数据库的数据 等)
AutoGluon 可以生成模型以根据其他列中的值预测一列中的值。
您就可以在标准监督学习任务(分类和回归)中实现高精度,而无需处理数据清理、特征工程、超参数优化、模型选择等繁琐问题
from autogluon.tabular import TabularDataset, TabularPredictor class DataFrameDataset: label = "class" # 表格数据集标签 def train_data(self): """ 加载线上训练数据集 : 预测一个人的收入是否超出5万美元 : 数据集返回结构是一个dataframe :return: """ # train_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') train_data_ = TabularDataset('./train.csv') print( "Train Data:\n", train_data_.head()) return train_data_ def test_data(self): """ 加载线上测试数据集 :return: """ # test_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv') test_data_ = TabularDataset('./test.csv') print( "Test Data:\n", test_data_.head()) y_test_ = test_data_[self.label] test_data_no_label_ = test_data_.drop(columns=[self.label]) return test_data_, test_data_no_label_, y_test_ def user_data(self): """用户自己的测试数据""" # test_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')[0] test_data_ = TabularDataset('./test.csv').head(1) y_test_ = test_data_[self.label] test_data_no_label_ = test_data_.drop(columns=[self.label]) return test_data_no_label_, y_test_ class Model: dataset = DataFrameDataset() model_path = "table_predictor" def __init__(self): self.predictor: TabularPredictor = None def train(self, eval_metric="roc_auc", presets="medium_quality_faster_train", time_limit=60, holdout_frac=0.1): """ 常用参数调节 :param eval_metric: 精度评估指标 f1: 用于二分类 roc_auc: 用于二分类 log_loss: 用于分类 mean_absolute_error: 用于回归 median_absolute_error: 用于回归 :param presets: 模型训练预设条件 best_quality: 牺牲时间训练高精度模型 medium_quality_faster_train: 牺牲质量快速产生模型 good_quality_faster_inference_only_refit: 相对好的模型且推理时间相对快 optimize_for_deployment: :param time_limit: 训练时长 :param holdout_frac: 指定从训练集出分割出多少比例的验证集 :param hyperparameters: 用户可以定义搜索空间,相见请参考sdk文档,例如你可以为模型设置迭代次数等num_epochs :return: """ print("开始训练 ..........") self.predictor = TabularPredictor(label=self.dataset.label, path=self.model_path, eval_metric=eval_metric) train_data = self.dataset.train_data() self.predictor.fit(train_data, time_limit=time_limit, excluded_model_types=['KNN', 'NN', 'custom'], presets=presets, holdout_frac=holdout_frac) print("评估模型 ..........") test_data, test_data_no_label, y_test = self.dataset.test_data() y_pred = self.predictor.predict_proba(test_data_no_label) evaluate = self.predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True) board = self.predictor.leaderboard(test_data, silent=True) print("评估结果:\n", evaluate ) print("模型在测试集上的效果:\n", board) print("模型特征重要性:\n", self.predictor.feature_importance(data=train_data)) print("模型信息:\n", self.predictor.info()) self.predictor.delete_models(models_to_keep='best', dry_run=False) # 保留最优模型,其他模型将删除 def predict(self): """ 预测数据 get_model_best(): 获取最好的模型 predict(): 输出结果对应标签 predict_proba(): 输出结果对应标签下的概率值 :return: """ self.predictor = TabularPredictor.load(self.model_path) best_model = self.predictor.get_model_best() print("Best Model:\n", best_model) test_data_no_label, y_test = self.dataset.user_data() import time start_time = time.time() y_pred = self.predictor.predict(test_data_no_label, model=best_model) print("inference time: ", time.time() - start_time) y_pred_prob = self.predictor.predict_proba(test_data_no_label, model=best_model) print("y_test: ", y_test) print("y_pred: ", y_pred) print(y_pred_prob) print("预测结果:", y_test == y_pred) if __name__ == '__main__': import fire fire.Fire(Model())
$ python3 auto.py train 开始训练 .......... Warning: path already exists! This predictor may overwrite an existing predictor! path="table_predictor" Loaded data from: ./train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073 Train Data: age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class 0 25 Private 178478 Bachelors 13 Never-married Tech-support Own-child White Female 0 0 40 United-States <=50K 1 23 State-gov 61743 5th-6th 3 Never-married Transport-moving Not-in-family White Male 0 0 35 United-States <=50K 2 46 Private 376789 HS-grad 9 Never-married Other-service Not-in-family White Male 0 0 15 United-States <=50K 3 55 ? 200235 HS-grad 9 Married-civ-spouse ? Husband White Male 0 0 50 United-States >50K 4 36 Private 224541 7th-8th 4 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 40 El-Salvador <=50K Presets specified: ['medium_quality_faster_train'] Beginning AutoGluon training ... Time limit = 60s AutoGluon will save models to "table_predictor/" AutoGluon Version: 0.3.1 Train Data Rows: 39073 Train Data Columns: 14 Preprocessing data ... AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed). 2 unique label values: [' <=50K', ' >50K'] If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression']) Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class. To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init. Using Feature Generators to preprocess the data ... Fitting AutoMLPipelineFeatureGenerator... Available Memory: 2261.62 MB Train Data (Original) Memory Usage: 22.92 MB (1.0% of available memory) Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features. Stage 1 Generators: ... ... AutoGluon training complete, total runtime = 67.34s ... TabularPredictor saved. To load, use: predictor = TabularPredictor.load("table_predictor/") 评估模型 .......... Loaded data from: ./test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769 Test Data: age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class 0 31 Private 169085 11th 7 Married-civ-spouse Sales Wife White Female 0 0 20 United-States <=50K 1 17 Self-emp-not-inc 226203 12th 8 Never-married Sales Own-child White Male 0 0 45 United-States <=50K 2 47 Private 54260 Assoc-voc 11 Married-civ-spouse Exec-managerial Husband White Male 0 1887 60 United-States >50K 3 21 Private 176262 Some-college 10 Never-married Exec-managerial Own-child White Female 0 0 30 United-States <=50K 4 17 Private 241185 12th 8 Never-married Prof-specialty Own-child White Male 0 0 20 United-States <=50K Evaluation: roc_auc on test data: 0.9323364763680665 Evaluations on test data: { "roc_auc": 0.9323364763680665, "accuracy": 0.8761388064284983, "balanced_accuracy": 0.8000729586881633, "mcc": 0.6412270975073234, "f1": 0.7151600753295669, "precision": 0.7870466321243523, "recall": 0.6553062985332183 } 评估结果: {'roc_auc': 0.9323364763680665, 'accuracy': 0.8761388064284983, 'balanced_accuracy': 0.8000729586881633, 'mcc': 0.6412270975073234, 'f1': 0.7151600753295669, 'precision': 0.7870466321243523, 'recall': 0.6553062985332183} 模型在测试集上的效果: model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order 0 WeightedEnsemble_L2 0.932336 0.935558 0.202042 0.067755 27.769457 0.018307 0.001736 1.468465 2 True 9 1 CatBoost 0.931855 0.934702 0.038226 0.020100 25.316929 0.038226 0.020100 25.316929 1 True 5 2 LightGBM 0.931088 0.934438 0.145509 0.045919 0.984063 0.145509 0.045919 0.984063 1 True 2 3 LightGBMXT 0.928022 0.930726 0.345757 0.102603 2.624385 0.345757 0.102603 2.624385 1 True 1 4 NeuralNetFastAI 0.914286 0.914985 0.178472 0.074495 15.291426 0.178472 0.074495 15.291426 1 True 8 5 RandomForestGini 0.911646 0.910570 0.547053 0.109948 3.671543 0.547053 0.109948 3.671543 1 True 3 6 RandomForestEntr 0.911283 0.911003 0.578620 0.111891 4.344845 0.578620 0.111891 4.344845 1 True 4 7 ExtraTreesEntr 0.904868 0.905856 0.734225 0.175408 3.133456 0.734225 0.175408 3.133456 1 True 7 8 ExtraTreesGini 0.904081 0.905642 1.049894 0.122711 2.265818 1.049894 0.122711 2.265818 1 True 6 Computing feature importance via permutation shuffling for 14 features using 1000 rows with 3 shuffle sets... 2.79s = Expected runtime (0.93s per shuffle set) 1.05s = Actual runtime (Completed 3 of 3 shuffle sets) 模型特征重要性: importance stddev p_value n p99_high p99_low capital-gain 0.067067 0.006993 0.001802 3 0.107138 0.026997 age 0.041595 0.015031 0.020439 3 0.127725 -0.044534 relationship 0.022320 0.003899 0.005010 3 0.044662 -0.000022 education-num 0.021446 0.006913 0.016467 3 0.061057 -0.018166 occupation 0.020063 0.004353 0.007664 3 0.045004 -0.004877 marital-status 0.018524 0.004348 0.008936 3 0.043436 -0.006389 capital-loss 0.016525 0.003324 0.006609 3 0.035571 -0.002520 hours-per-week 0.014453 0.000872 0.000606 3 0.019451 0.009455 fnlwgt 0.010131 0.001359 0.002972 3 0.017917 0.002345 workclass 0.006703 0.001663 0.009949 3 0.016229 -0.002824 education 0.004152 0.000358 0.001231 3 0.006200 0.002103 native-country 0.004013 0.002588 0.057583 3 0.018840 -0.010815 sex 0.002201 0.000911 0.026324 3 0.007422 -0.003020 race 0.002195 0.001788 0.083653 3 0.012439 -0.008049 模型信息: ... Deleting model LightGBMXT. All files under table_predictor/models/LightGBMXT/ will be removed. Deleting model RandomForestGini. All files under table_predictor/models/RandomForestGini/ will be removed. Deleting model RandomForestEntr. All files under table_predictor/models/RandomForestEntr/ will be removed. Deleting model ExtraTreesGini. All files under table_predictor/models/ExtraTreesGini/ will be removed. Deleting model ExtraTreesEntr. All files under table_predictor/models/ExtraTreesEntr/ will be removed. Deleting model NeuralNetFastAI. All files under table_predictor/models/NeuralNetFastAI/ will be removed. $ python3 auto.py predict Best Model: WeightedEnsemble_L2 inference time: 0.6184391975402832 y_test: 0 <=50K Name: class, dtype: object y_pred: 0 <=50K Name: class, dtype: object <=50K >50K 0 0.934742 0.065258 预测结果: 0 True Name: class, dtype: bool
为了对图像进行分类,AutoGluon可以自动生成高质量的图像分类模型。提供的图像数据集上训练高度准确的神经网络,并代表您自动利用诸如迁移学习和超参数优化等提高准确性的技术。
from autogluon.vision import ImageDataset, ImagePredictor from tensorflow.keras.datasets import mnist import abc import pandas as pd import os import numpy as np import requests import cv2 class MnistDataSets: """ 配置数据集,以及标签 """ datasets_dir = "mnist_datasets" def download_mnist_data(self): """ 加载官方的手写数据集 """ (self.x_train, self.y_train), (self.x_test, self.y_test) = mnist.load_data() # print(self.x_train.shape) # print(self.y_train.shape) # print(self.x_train[0].shape) # (60000, 28, 28) # (60000,) # (28, 28) # 这里输入可知,数据集包含了60000张图片,且素材是一个单通道28x28 for label in self.label_mapping.keys(): os.makedirs(name=f"{self.datasets_dir}/train/{label}", exist_ok=True) os.makedirs(name=f"{self.datasets_dir}/test/{label}", exist_ok=True) train_length = self.x_train.shape[0] # train_length = 3000 test_length = self.x_test.shape[0] # test_length = 1000 import time for index in range(train_length): cv2.imwrite(filename=f"{self.datasets_dir}/train/{self.y_train[index]}/{time.time()}.jpg", img=self.x_train[index]) # break for index in range(test_length): cv2.imwrite(filename=f"{self.datasets_dir}/test/{self.y_test[index]}/{time.time()}.jpg", img=self.x_test[index]) # break @property def label_mapping(self): """标签映射关系""" return {1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 0: 0} def get_online_test_data(self): """ 在线获取一张手写体图片,并做前处理 :return: """ label = 3 url = "https://img1.baidu.com/it/u=3472197447,93830654&fm=253&fmt=auto&app=138&f=JPEG?w=500&h=281" image = requests.get(url).content nparr = np.fromstring(image, np.uint8) gray = cv2.imdecode(nparr, cv2.IMREAD_GRAYSCALE) gray = cv2.resize(gray, (28, 28)) _, gray = cv2.threshold(gray, thresh=165, maxval=255, type=cv2.THRESH_BINARY) return gray, label def load_data(self): """ 加载数据集 :return: """ train_data, val_data, test_data = ImageDataset.from_folders(root=self.datasets_dir, train="train", test="test") print("训练数据集:\n", train_data) print("测试数据集:\n", test_data) print("标签信息\n", val_data) return train_data, test_data class Model(MnistDataSets): model_path = "mnist-model" def __init__(self): self.predictor: ImagePredictor = None def train(self): """ 模型训练 :return: """ train_data, test_data = self.load_data() self.predictor = ImagePredictor() print("开始训练模型....") self.predictor.fit(train_data, hyperparameters={'epochs': 10}) print("模型存储中....") self.predictor.save(self.model_path) print("模型评估中....") evaluate = self.predictor.evaluate(test_data) print("模型评估结果:\n", evaluate) def predict(self): gray, label = self.get_online_test_data() self.predictor = ImagePredictor.load(self.model_path) print(self.predictor.list_models()) _, nparr = cv2.imencode('.jpg', gray) cv2.imwrite("3.jpg", nparr) import time start_time = time.time() pred = self.predictor.predict("./3.jpg") print("inference time: ", time.time() - start_time) # 这里非常耗时,所以这个库并不是很优 print(pred) if __name__ == '__main__': import fire fire.Fire(Model())
$ python3 predictor-image.py download_mnist_data $ python3 predictor-image.py train 开始训练模型.... `time_limit=auto` set to `time_limit=7200`. Reset labels to [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Randomly split train_data into train[54000]/validation[6000] splits. No GPU detected/allowed, using most conservative search space. Starting fit without HPO modified configs(<old> != <new>): { root.img_cls.model resnet101 != resnet18 root.train.early_stop_baseline 0.0 != -inf root.train.batch_size 32 != 16 root.train.early_stop_patience -1 != 10 root.train.early_stop_max_value 1.0 != inf root.train.epochs 200 != 10 root.gpus (0,) != () root.misc.seed 42 != 48 } Saved config to /Users/rockontrol/Desktop/python_code/code/test/autopluon/e4f22295/.trial_0/config.yaml Model resnet18 created, param count: 11181642 AMP not enabled. Training in float32. ... Epoch[0] Batch [2349] Speed: 5.057824 samples/sec accuracy=0.626835 lr=0.000100 Epoch[0] Batch [2399] Speed: 5.103389 samples/sec accuracy=0.630052 lr=0.000100 Epoch[0] Batch [2449] Speed: 5.524156 samples/sec accuracy=0.632398 lr=0.000100 `time_limit=7199.991618871689` reached, exit early... Finished, total runtime is 7260.00 s { 'best_config': { 'batch_size': 16, 'dist_ip_addrs': None, 'early_stop_baseline': -inf, 'early_stop_max_value': inf, 'early_stop_patience': 10, 'epochs': 10, 'final_fit': False, 'gpus': [], 'log_dir': '/Users/rockontrol/Desktop/python_code/code/test/autopluon/e4f22295', 'lr': 0.01, 'model': 'resnet18', 'ngpus_per_trial': 0, 'nthreads_per_trial': 32, 'num_trials': 1, 'num_workers': 4, 'problem_type': 'multiclass', 'scheduler': 'local', 'search_strategy': 'random', 'searcher': 'random', 'seed': 48, 'time_limits': 7200, 'wall_clock_tick': 1642659798.8088949}, 'total_time': 7200.409552574158, 'train_acc': 0.6342192524115756, 'valid_acc': -inf} 模型存储中.... 模型评估中.... [Epoch 0] validation: top1=0.948200 top5=0.999200 模型评估结果: {'loss': 0.29002630821466446, 'top1': 0.9482, 'top5': 0.9992} $ python3 predictor-image.py predict inference time: 23.1709041595459 0 0 Name: label, dtype: int64
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。