赞
踩
有个需求就是要使用数据分析团队实现好的模型,而且是python的,要求在Flink平台上跑起来提供实时调用模型处理数据
在Flink平台上通过调用现有python实现的模型,进行实时预测处理
sklearn2pmml
0.14.0 or newer.pip install sklearn2pmml -i https://pypi.tuna.tsinghua.edu.cn/simple/
这里基于sklearn做测试,其他框架的pmml包请查阅作者github示例。
作者示例:https://github.com/jpmml/sklearn2pmml
A typical workflow can be summarized as follows:
1.Create a
PMMLPipeline
object, and populate it with pipeline steps as usual. Classsklearn2pmml.pipeline.PMMLPipeline
extends classsklearn.pipeline.Pipeline
with the following functionality:
- If the
PMMLPipeline.fit(X, y)
method is invoked withpandas.DataFrame
orpandas.Series
object as anX
argument, then its column names are used as feature names. Otherwise, feature names default to “x1”, “x2”, …, “x{number_of_features}”.- If the
PMMLPipeline.fit(X, y)
method is invoked withpandas.Series
object as any
argument, then its name is used as the target name (for supervised models). Otherwise, the target name defaults to “y”.2.Fit and validate the pipeline as usual.
3.Optionally, compute and embed verification data into the
PMMLPipeline
object by invokingPMMLPipeline.verify(X)
method with a small but representative subset of training data.4.Convert the
PMMLPipeline
object to a PMML file in local filesystem by invoking utility methodsklearn2pmml.sklearn2pmml(pipeline, pmml_destination_path)
.
PMMLPipeline
对象,并设置它的pipeline。PMMLPipeline
对象。预热模型。PMMLPipeline
对象转换成PMML文件github上作者有两个示例,一个决策树分类iris数据集,一个逻辑回归分类iris数据集,我这只演示决策树的示例
import pandas
from sklearn.datasets import load_iris
# github上作者的代码示例,我这直接用sklearn里的,不读文件
# iris_df = pandas.read_csv("Iris.csv")
# iris_X = iris_df[iris_df.columns.difference(["Species"])]
# iris_y = iris_df["Species"]
# 加载鸢尾花数据集(sklearn中的数据集)
iris = load_iris()
# 通过feature_names构造dataFrame
iris_df = pandas.DataFrame(iris.data, columns=iris.feature_names)
# 把iris的结果放到dataFrame的label属性中
iris_df['label'] = iris.target
# 声明dataFrame的新列项
iris_df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
# 划分数据集
iris_X =iris_df[iris_df.columns.difference(["label"])]
iris_y = iris_df["label"]
from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml.pipeline import PMMLPipeline
# 这里分类模型就写classifier,作者定义好了不同模型的pipeline标识,
# 工作流内需要设置二元组,(名称,模型对象),名称也不是乱指定的,每个名称都是对应特定功能的transformer的
# 像"selector"对应特征选择,“mapper”对应特征预处理,”pca“对应pca,”classifier“对应分类器,”regressor“对应回归器
# 具体去看github上说明吧
pipeline = PMMLPipeline([
("classifier", DecisionTreeClassifier())
])
# 训练
pipeline.fit(iris_X, iris_y)
from sklearn2pmml import sklearn2pmml
# 把模型转成pmml文件
sklearn2pmml(pipeline, "D:\DecisionTreeIris.pmml", with_repr = True)
注意,执行时出现如下warn,无需理会
D:\ITinstall\anaconda3\lib\subprocess.py:848: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stdout = io.open(c2pread, 'rb', bufsize)
D:\ITinstall\anaconda3\lib\subprocess.py:853: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stderr = io.open(errread, 'rb', bufsize)
如下是针对PMMLPipeline构造的更多一些说明
能设置的名称其实不少,但是关于怎么设置这些二元组,作者都是在github上使用示例代码给出的,挺多使用方法分散在项目的不同角落(主要是README),找起来还挺费劲(估计都是用到了才会仔细一点一点搜,要不就在issue直接问作者了
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/926240
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。