为了实现RAInS项目(AI问责系统项目:RAInS Project Website)中数据分析和训练数据基本信息,将要使用Python来构建一个机器学习分析平台来帮助数据分析和数据建模。该平台会实现一些功能:设计一个网页版的用户界面,支持交互。支持从本地选取数据集,支持自动化可视化分析,支持回归分析和分类分析,支持查看训练记录,支持查看训练模型的参数和结果并且绘图。生成所需要的JSON文件,还可以预测新数据集,有异常检测,规则关联和其他细节功能。
import os
import mlflow
# use streamlit to achieve interactive use on the web side
import streamlit as st
import pandas as pd
# used to display the report in the web page
from streamlit_pandas_profiling import st_profile_report
# used to generate reports
from pandas_profiling import ProfileReport
# machine learning classification
import pycaret.classification as pc_cl
# machine learning regression
import pycaret.regression as pc_rg
data: 用来存放训练数据集和测试数据集
logs.log: 用于记录平台运行过程中系统产生的日志信息
mlruns: 用来管理训练的机器学习模板记录信息,可以作用到mlflow中
main.py: 机器学习分析平台的主程序代码
最开始我是希望实现一个可以捕获到机器学习(ML)中数据流(训练数据基本信息,对训练数据是否进行处理,在ML部署后实际输入ML的真实数据,以及ML对这些真实数据的预测结果,预测花费的时间.) 并且要尝试获取信息如下:运行时是否出现异常,比如内存溢出,CPU超负荷等。硬件是否报错.还有真实输入数据格式,尺寸异常等。然后记录这些数据并且生成JSON文件作为一个接口来完成项目中的其他部分的工作。于是我就基于OpenCV完成了摄像头信息,物体运动时间和异常信息的捕获。很快,我就意识到一个严重的问题,基于特定的机器学习尝试这项工作只能使用特定的方法和参数,机器学习不全是通过同一个模板或者一个标准来完成每一项的任务的。有没有一个通用的方法将特殊的方法作为一个子集也能满足需求呢?
对于管理整个模型还有预测工作,我想到了使用mlflow (MLflow Website). 这个工具中的Tracking功能可以记录每一次运行的参数和结果,模型可视化的绘制等数据。很惊喜的是,在pycaret中已经包含了mlflow的模板,当我在执行pycaret的时候会自动使用到mlflow管理运行记录和日志还有模型信息等。可以通过调用模板中的load_model函数来获取更多的模型信息和数据,最后很方便的一点开发人员只需要输入数据集就可以完成模型的预测工作。
MLflow 是 Databricks(spark) 推出的面对端到端机器学习的生命周期管理工具,它有如下四方面的功能:
跟踪、记录实验过程,交叉比较实验参数和对应的结果(MLflow Tracking).
把代码打包成可复用、可复现的格式,可用于成员分享和针对线上部署(MLflow Project).
管理、部署来自多个不同机器学习框架的模型到大部分模型部署和推理平台(MLflow Models).
针对模型的全生命周期管理的需求,提供集中式协同管理,包括模型版本管理、模型状态转换、数据标注(MLflow Model Registry).
MLflow 独立于第三方机器学习库,可以跟任何机器学习库、任何语言结合使用,因为 MLflow 的所有功能都是通过 REST API 和 CLI 的方式调用的,为了调用更方便,还提供了针对 Python、R、和 Java 语言的 SDK。
最后为了实现程序的可视化和UI交互,我使用了streamlit(Streamlit Website)来完成这项工作。streamlit库包含的组件满足大部分开发者需求,在设计网页UI只需要使用单个函数就可以完成html的设计和部署。
首先我们使用git将Github中的项目用git clone仓库克隆到本地.通过pip或者Conda完成所需要Python包的安装.可以使用Python的IDE来编写或者Debug程序.使用streamlit来运行项目中的main.py程序.在终端中输入**‘streamlit run main.py’**.看到下图信息说明8501端口已经开启(在local URL还会有一个Network URL),我们可以在浏览器中使用程序进入UI页面.
concatFilePath(file_folder, file_selected)
# get the full path of the file, used to read the dataset
def concatFilePath(file_folder, file_selected):
if str(file_folder)[-1] != '/':
fileSelectedPath = file_folder + '/' + file_selected
fileSelectedPath = file_folder + file_selected
return fileSelectedPath
在getModelTrainingLogs(n_lines = 10)
# read logs.log, display the number of the last
# selected line, the user can set the number of lines
def getModelTrainingLogs(n_lines = 10):
file = open('logs.log', 'r')
lines = file.read().splitlines()
return lines[-n_lines:]
# load the data set, put the data set into the cache
def load_csv(file_selected_path, nrows):
if nrows == -1:
df = pd.read_csv(file_selected_path)
df = pd.read_csv(file_selected_path, nrows=nrows)
except Exception as ex:
df = pd.DataFrame([])
return df
""" RAInS Project: machine-learning analysis platform Author: Junhao Song Email: songjh.john@gmail.com Website: http://junhaosong.com/ """ import os import mlflow # use streamlit to achieve interactive use on the web side import streamlit as st import pandas as pd # used to display the report in the web page from streamlit_pandas_profiling import st_profile_report # used to generate reports from pandas_profiling import ProfileReport # machine learning classification import pycaret.classification as pc_cl # machine learning regression import pycaret.regression as pc_rg # store some commonly used machine learning modeling techniques ML_LIST = ['Regression', 'Classification'] RG_LIST = ['lr', 'svm', 'rf', 'xgboost', 'lightgbm'] CL_LIST = ['lr', 'dt', 'svm', 'rf', 'xgboost', 'lightgbm'] # list certain extension files in the folder def listFiles(directory, extension): return [f for f in os.listdir(directory) if f.endswith('.' + extension)] # read logs.log, display the number of the last # selected line, the user can set the number of lines def getModelTrainingLogs(n_lines = 10): file = open('logs.log', 'r') lines = file.read().splitlines() file.close() return lines[-n_lines:] # get the full path of the file, used to read the dataset def concatFilePath(file_folder, file_selected): if str(file_folder)[-1] != '/': fileSelectedPath = file_folder + '/' + file_selected else: fileSelectedPath = file_folder + file_selected return fileSelectedPath # load the data set, put the data set into the cache @st.cache(suppress_st_warning=True) def load_csv(file_selected_path, nrows): try: if nrows == -1: df = pd.read_csv(file_selected_path) else: df = pd.read_csv(file_selected_path, nrows=nrows) except Exception as ex: df = pd.DataFrame([]) st.exception(ex) return df def app_main(): st.title("Machine learning analysis platform") if st.sidebar.checkbox('Define Data Source'): filesFolder = st.sidebar.text_input('folder', value="data") dataList = listFiles(filesFolder, 'csv') if len(dataList) ==0: st.warning('No data set available') else: file_selected = st.sidebar.selectbox( 'Select a document', dataList) file_selected_path = concatFilePath(filesFolder, file_selected) nrows = st.sidebar.number_input('Number of lines', value=-1) n_rows_str = 'All' if nrows == -1 else str(nrows) st.info('Selected file:{file_selected_path},The number of rows read is{n_rows_str}') else: file_selected_path = None nrows = 100 st.warning('The currently selected file is empty, please select:') if st.sidebar.checkbox('Exploratory Analysis'): if file_selected_path is not None: if st.sidebar.button('Report Generation'): df = load_csv(file_selected_path, nrows) pr = ProfileReport(df, explorative=True) st_profile_report(pr) else: st.info('No file selected, analysis cannot be performed') if st.sidebar.checkbox('Modeling'): if file_selected_path is not None: task = st.sidebar.selectbox('Select Task', ML_LIST) if task == 'Regression': model = st.sidebar.selectbox('Select Model', RG_LIST) elif task == 'Classification': model = st.sidebar.selectbox('Select Model', RG_LIST) df = load_csv(file_selected_path, nrows) try: cols = df.columns.to_list() target_col = st.sidebar.selectbox('Select Prediction Object', cols) except BaseException: st.sidebar.warning('The data format cannot be read correctly') target_col = None if target_col is not None and st.sidebar.button('Training Model'): if task == 'Regression': st.success('Data preprocessing...') pc_rg.setup( df, target=target_col, log_experiment=True, experiment_name='ml_', log_plots=True, silent=True, verbose=False, profile=True) st.success('Data preprocessing is complete') st.success('Training model. . .') pc_rg.create_model(model, verbose=False) st.success('The model training is complete. . .') #pc_rg.finalize_model(model) st.success('Model has been created') elif task == 'Classification': st.success('Data preprocessing. . .') pc_cl.setup( df, target=target_col, fix_imbalance=True, log_experiment=True, experiment_name='ml_', log_plots=True, silent=True, verbose=False, profile=True) st.success('Data preprocessing is complete.') st.success('Training model. . .') pc_cl.create_model(model, verbose=False) st.success('The model training is complete. . .') #pc_cl.finalize_model(model) st.success('Model has been created') if st.sidebar.checkbox('View System Log'): n_lines =st.sidebar.slider(label='Number of lines',min_value=3,max_value=50) if st.sidebar.button("Check View"): logs = getModelTrainingLogs(n_lines=n_lines) st.text('System log') st.write(logs) try: allOfRuns = mlflow.search_runs(experiment_ids=0) except: allOfRuns = [] if len(allOfRuns) != 0: if st.sidebar.checkbox('Preview model'): ml_logs = 'http://kubernetes.docker.internal:5000/ -->Open mlflow, enter the command line: mlflow ui' st.markdown(ml_logs) st.dataframe(allOfRuns) if st.sidebar.checkbox('Choose a model'): selected_run_id = st.sidebar.selectbox('Choose from saved models', allOfRuns[allOfRuns['tags.Source'] == 'create_model']['run_id'].tolist()) selected_run_info = allOfRuns[( allOfRuns['run_id'] == selected_run_id)].iloc[0, :] st.code(selected_run_info) if st.sidebar.button('Forecast data'): model_uri = 'runs:/' + selected_run_id + '/model/' model_loaded = mlflow.sklearn.load_model(model_uri) df = pd.read_csv(file_selected_path, nrows=nrows) #st.success('Model prediction. . .') pred = model_loaded.predict(df) pred_df = pd.DataFrame(pred, columns=['Predictive Data']) st.dataframe(pred_df) pred_df.plot() st.pyplot() else: st.sidebar.warning('Did not find a trained model') if __name__ == '__main__': app_main()
最后本项目感谢我的导师Wei Pang(Github)的学术指导和Danny(Github)的技术帮助.
[1]. Kaggle XGboost https://www.kaggle.com/alexisbcook/xgboost
[2]. Kaggle MissingValues https://www.kaggle.com/alexisbcook/missing-values
[3]. MLflow Tracking https://mlflow.org/docs/latest/tracking.html
[4]. Google AutoML https://cloud.google.com/automl-tables/docs/beginners-guide
[5]. 7StepML https://towardsdatascience.com/the-7-steps-of-machine-learning-2877d7e5548e
[6]. ScikitLearn https://scikit-learn.org/stable/getting_started.html#model-evaluation
[7]. UCIDataset https://archive.ics.uci.edu/ml/datasets.php
[8]. Wikipedia https://en.wikipedia.org/wiki/Gradient_boosting
[9]. ShuhariBlog https://shuhari.dev/blog/2020/02/streamlit-intro
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。