赞
踩
UIE模型是百度今年开源出来的可以应用于zero-shot的新模型,其功能强大使用简便,虽不至于将NLP带入一个新的阶段,但也确实极大的降低了NLP基础任务的工程化使用门槛,是一个非常有效的工具。
在官方git上提供的使用方法中,用几行简单的代码对如何利用UIE进行预测进行了介绍:
>>> from pprint import pprint >>> from paddlenlp import Taskflow >>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction >>> ie = Taskflow('information_extraction', schema=schema) >>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")) # Better print results using pprint [{'时间': [{'end': 6, 'probability': 0.9857378532924486, 'start': 0, 'text': '2月8日上午'}], '赛事名称': [{'end': 23, 'probability': 0.8503089953268272, 'start': 6, 'text': '北京冬奥会自由式滑雪女子大跳台决赛'}], '选手': [{'end': 31, 'probability': 0.8981548639781138, 'start': 28, 'text': '谷爱凌'}]}]
用一行代码直接创建了可以用于推理的模型,在联网环境下非常方便,但是在某些场景下,我们需要在离线环境下进行部署,本文就介绍一下如何在没有联网的环境下使用UIE模型。
虽然我本人极少使用paddle,但是隐约记得paddle中的模型创建可以使用from_pretrained
方法创建模型,这就让我怀疑paddle的代码逻辑是借鉴了transformers
模块的,对于transformers
的代码和逻辑,那可就是滚瓜烂熟了,于是我决定试着按照transformers
的逻辑去应用于paddle
。
其基本逻辑是,模型首先假设输入的字符串是一个本地路径,然后尝试搜索这个路径并创建模型,如果失败的话,就认为这个字符串是一个模型的名称,然后拼接出url链接,去获取模型并保存在本地缓存,同时赋予一个md5码进行标记。
首先,安装paddlenlp
pip install paddlenlp==2.3.4
通过官方的代码,我们知道paddlenlp利用taskflow创建了一个模型,从这个名字可以看出,这应该是一个兼容了包括UIE在内的各类模型的类,那么我们要做的就是把uie从中取出来。无非就是两个步骤,下载模型储存在本地,以及实例化模型。
我们进入taskflow.py
中的Taskflow
类,找到它的构造函数,不论什么类,都必然是在构造函数中实例化的,那我们就来看一下构造函数的逻辑:
def __init__(self, task, model=None, mode=None, device_id=0, **kwargs): assert task in TASKS, "The task name:{} is not in Taskflow list, please check your task name.".format( task) self.task = task if self.task in ["word_segmentation", "ner"]: tag = "modes" ind_tag = "mode" self.model = mode else: tag = "models" ind_tag = "model" self.model = model if self.model is not None: assert self.model in set(TASKS[task][tag].keys( )), "The {} name: {} is not in task:[{}]".format(tag, model, task) else: self.model = TASKS[task]['default'][ind_tag] # 省略余下内容...
于是可以看到,在官方的示例中,代码进入第二个条件语句的else分支,在全局变量TASK
的帮助下创建了self.model,在同py文件中,可以看到这个TASK
,可以看到UIE对应的部分:
'information_extraction': { "models": { "uie-base": { "task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-base" }, "uie-medium": { "task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-medium" }, "uie-mini": { "task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-mini" }, "uie-micro": { "task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-micro" }, "uie-nano": { "task_class": UIETask, "hidden_size": 312, "task_flag": "information_extraction-uie-nano" }, "uie-tiny": { "task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-tiny" }, "uie-medical-base": { "task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-medical-base" }, }, "default": { "model": "uie-base" }
可以看出这是对应不同尺寸的预训练模型给入了不同的参数,顺着这个线索,我们找到UIETask
类,跳转到infomation_extraction.py
, 直接看到的就是在类属性里,找到了UIE各个模型的下载链接:
class UIETask(Task): """ Universal Information Extraction Task. Args: task(string): The name of task. model(string): The model name in the task. kwargs (dict, optional): Additional keyword arguments passed along to the specific task. """ resource_files_names = { "model_state": "model_state.pdparams", "model_config": "model_config.json", "vocab_file": "vocab.txt", "special_tokens_map": "special_tokens_map.json", "tokenizer_config": "tokenizer_config.json" } # vocab.txt/special_tokens_map.json/tokenizer_config.json are common to the default model. resource_files_urls = { "uie-base": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_v1.0/model_state.pdparams", "aeca0ed2ccf003f4e9c6160363327c9b" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_config.json", "a36c185bfc17a83b6cfef6f98b29c909" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] }, # 后边省略
在这里我们通过这几个链接,把相关的文件全都下载下来,放在一个文件夹内,把文件夹明明为uie
,这个路径就是一会儿我们要使用from_pretrained
引入的路径。
然后观察它的构造函数,发现并没有创建模型,那么应该是在父类中创建了模型。于是,我们再顺着这个线索,到task.py
中找到它的父类Task
,绕了一圈之后发现又绕回了子类,在子类重写的_construct_model
方法中建立模型,这样一来,又回到了UIETask
中,我们找到_construct_model
方法,把它的from_pretrained
传入的名称写死成我们本地的路径,也就是刚刚创建的包含了各个模型文件的那个目录的路径:
def _construct_model(self, model):
"""
Construct the inference model for the predictor.
"""
# model_instance = UIE.from_pretrained(self._task_path)
# 把这里的目录写死为刚刚下载的模型文件所在的目录
model_instance = UIE.from_pretrained({your_model_path})
self._model = model_instance
self._model.eval()
这样一来,模型在本地找到文件之后,就不会尝试联网去下载了。
原本事情到这里应该已经结束,但是在执行之后还是报错了,我又去检查了一下发现,原来paddle
好像本就没想让用户使用本地文件,在Task
类中有一个check的机制,它对md5码进行验证,由于我们在下载文件时没有注意这一点,所以检验没通过,它还是认为我们没有下载文件,又去请求互联网下载。所以我们只需要把它注释掉就可以了。
在UIETask
中,我们把check那行注释掉即可。
def __init__(self, task, model, schema, **kwargs): super().__init__(task=task, model=model, **kwargs) self._schema_tree = None self.set_schema(schema) # 就是这里,把下面这行注释掉 # self._check_task_files() self._construct_tokenizer() self._check_predictor_type() self._get_inference_model() self._usage = usage self._max_seq_len = self.kwargs[ 'max_seq_len'] if 'max_seq_len' in self.kwargs else 512 self._batch_size = self.kwargs[ 'batch_size'] if 'batch_size' in self.kwargs else 64 self._split_sentence = self.kwargs[ 'split_sentence'] if 'split_sentence' in self.kwargs else False self._position_prob = self.kwargs[ 'position_prob'] if 'position_prob' in self.kwargs else 0.5 self._lazy_load = self.kwargs[ 'lazy_load'] if 'lazy_load' in self.kwargs else False self._num_workers = self.kwargs[ 'num_workers'] if 'num_workers' in self.kwargs else 0
然后再通过原来的语句,就可以在离线条件下创建模型啦。
ie = Taskflow('information_extraction', schema=schema)
刚刚的描述话比较多,不想看的同学直接跳转这里。
第一步, 创建一个目录,并且下载模型放在这个目录下。下载你想要的模型即可,但是文件要全,例如你想用base模型,则需要下载uie-base
下的所有文件,并且都放在这个目录下。
下载路径如下:
resource_files_urls = { "uie-base": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_v1.0/model_state.pdparams", "aeca0ed2ccf003f4e9c6160363327c9b" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_config.json", "a36c185bfc17a83b6cfef6f98b29c909" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] }, "uie-medium": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium_v1.0/model_state.pdparams", "15874e4e76d05bc6de64cc69717f172e" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium/model_config.json", "6f1ee399398d4f218450fbbf5f212b15" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] }, "uie-mini": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_mini_v1.0/model_state.pdparams", "f7b493aae84be3c107a6b4ada660ce2e" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_mini/model_config.json", "9229ce0a9d599de4602c97324747682f" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] }, "uie-micro": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_micro_v1.0/model_state.pdparams", "80baf49c7f853ab31ac67802104f3f15" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_micro/model_config.json", "07ef444420c3ab474f9270a1027f6da5" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] }, "uie-nano": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_nano_v1.0/model_state.pdparams", "ba934463c5cd801f46571f2588543700" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_nano/model_config.json", "e3a9842edf8329ccdd0cf6039cf0a8f8" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] }, # Rename to `uie-medium` and the name of `uie-tiny` will be deprecated in future. "uie-tiny": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_tiny_v0.1/model_state.pdparams", "15874e4e76d05bc6de64cc69717f172e" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_tiny/model_config.json", "6f1ee399398d4f218450fbbf5f212b15" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] }, "uie-medical-base": { "model_state": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medical_base_v0.1/model_state.pdparams", "569b4bc1abf80eedcdad5a6e774d46bf" ], "model_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_config.json", "a36c185bfc17a83b6cfef6f98b29c909" ], "vocab_file": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt", "1c1c1f4fd93c5bed3b4eebec4de976a8" ], "special_tokens_map": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json", "8b3fb1023167bb4ab9d70708eb05f6ec" ], "tokenizer_config": [ "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json", "59acb0ce78e79180a2491dfd8382b28c" ] } }
刚刚我们下载了模型文件,并放在一个文件夹里,现在我们要记住这个文件夹的路径,然后找到paddlenlp中的information_extraction.py
,这个文件一般在site-packages
的paddlenlp
目录下的taskflow
中,如果找不到的话可以直接搜索。
我们找到大概290行,把引用的目录修改为我们刚刚创建的模型所在的路径:
def _construct_model(self, model):
"""
Construct the inference model for the predictor.
"""
# model_instance = UIE.from_pretrained(self._task_path)
# 把这里的目录写死为刚刚下载的模型文件所在的目录
model_instance = UIE.from_pretrained({your_model_path})
self._model = model_instance
self._model.eval()
最后还是在information_extraction.py
中,找到大概248行,注释掉self._check_task_files():
def __init__(self, task, model, schema, **kwargs):
super().__init__(task=task, model=model, **kwargs)
self._schema_tree = None
self.set_schema(schema)
# 就是这里,把下面这行注释掉
# self._check_task_files()
self._construct_tokenizer()
至此,大功告成。
如有疑问请留言,我们下期再见。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。