赞
踩
论文:KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases
⭐⭐⭐⭐
复旦肖仰华团队工作
KnowledGPT 提出了一个通过检索知识库来增强大模型生成的 RAG 框架。
在知识库中,存储着三类形式的知识:
["Socrates", "Military Service", "Socrates served as a Greek hoplite or heavy infantryman..."]
下面是三种知识形式的示例:
为了实现能够从这个知识库中检索知识,该工作预先实现了三个查询函数:
get_entity_info
:接受一个 entity 作为输入,返回关于这个 entity 的文本描述find_entity_or_value
:接受一个 entity 和一个 relation 作为输入,输出所有相关的 entity 或者 valuefind_relationship
:接受两个 entity 作为输入,返回所有他们的 relationship注意,这里的每个输入中,所有说的输入一个 entity 或 relation,其实是输入一个别名的列表,比如我想输入一个“擅长”这个关系,那我实际输入的是 ["be good at", "be expert in", "specialize in"]
这样的一个别名列表,因为我们并不事先知道“擅长”这个关系在知识库中是怎样被具体表示的。
KnowledGPT 在回答用户问题时,会让 LLM 先判断是否需要借助知识库的辅助,如果需要,那就会让 LLM 使用我们上面事先已经实现的三个查询函数,来生成一段 Python 代码来执行知识库的检索,并根据检索结果的辅助来完成用户问题的答案生成。这就是 KnowledGPT 工作的基本原理。
除此之外,KnowledGPT 还有两个额外的工作:
KnowledGPT 从多种知识库(KB)中获取外部知识,作为对 LLM 生成的补充。KnowledGPT 创新点主要在于完成了两个任务:
我们已经事先实现了对知识库的查询的三个函数:get_entity_info
、find_entity_or_value
、find_relationship
,该工作通过一个 prompt,让 LLM 生成基于这三个函数的 python 代码(就是一个 def search()
函数),这个 python 函数可以被执行用于实现知识的检索。
给定知识库查询函数,让 LLM 生成对应的 python 代码的 prompt,如下:
You are an awesome knowledge graph accessing agent that helps to RETRIEVE related knowledge about user queries via writing python codes to access external knowledge sources. Your python codes should implement a search function using exclusively built-in python functions and the provided functions listed below. ===PROVIDED FUNCTIONS=== 1. get_entity_info: obtain encyclopedic information about an entity from external sources, which is used to answer general queries like "Who is Steve Jobs". Args: "entity_aliases": a list of the entity's aliases, e.g. ['American', 'United States', 'U.S.'] for the entity 'American'. Return: two strings, 'result' and 'message'. 'result' is the encyclopedic information about the entity if retrieved, None otherwise. 'message' states this function call and its result. 2. find_entity_or_value: access knowledge graphs to answer factual queries like "Who is the founder of Microsoft?". Args: "entity_aliases": a list of the entity's aliases, "relation_aliases": a list of the relation's aliases. Return: two variables, 'result' and 'message'. 'result' is a list of entity names or attribute value to this query if retrieved, None otherwise. 'message' is a string states this function call and its result. 3. find_relationship: access knowledge graphs to predict the relationship between two entities, where the input query is like "What's the relationship between Steve Jobs and Apple Inc?". Args: "entity1_aliases": a list of entity1's aliases, "entity2_aliases": a list of entity2's aliases. Return: two strings, 'result' and 'message'. 'result' is the relationship between entity1 and entity2 if retrieved, None otherwise. 'message' states this function call and its result. ===REQUIREMENTS=== 1. [IMPORTANT] Always remember that your task is to retrieve related knowledge instead of answering the queries directly. Never try to directly answer user input in any form. Do not include your answer in your generated 'thought' and 'code'. 2. Exclusively use built-in python functions and the provided functions. 3. To better retrieve the intended knowledge, you should make necessary paraphrase and list several candidate aliases for entities and relations when calling the provided functions, sorted by the frequency of the alias. E.g., "Where is Donald Trump born" should be paraphrased as find_entity_or_value(["Donald Trump", "President Trump"], ["place of birth", "is born in"]). Avoid entity alias that may refer to other entities, such as 'Trump' for 'Donald Trump'. 4. When using find_entity_or_value, make sure the relation is a clear relation. Avoid vague and broad relation aliases like "information". Otherwise, use get_entity_info instead. For example, for the question 'Who is related to the Battle of Waterloo?', you should use get_entity_info(entity_aliases = ['the Battle of Waterloo']) instead of find_entity_or_value(entity_aliases = ['the Battle of Waterloo'], relation_aliases = ['related to']) since 'related to' is too vague to be searched. 5. The input can be in both English and Chinese. If the input language is NOT English, make sure the args of get_entity_info, find_entity_or_value and find_relationship is in the input language. 6. The queries may need multiple or nested searching. Use smart python codes to deal with them. Note that find_entity_or_value will return a list of results. 7. Think step by step. Firstly, you should determine whether the user input is a query that "need knowledge". If no, simply generate "no" and stop. Otherwise, generate "yes", and go through the following steps: First, Come up with a "thought" about how to find the knowledge related to the query step by step. Make sure your "thought" covers all the entities mentioned in the input. Then, implement your "thought" into "code", which is a python function with return. After that, make an "introspection" whether your "code" is problematic, including whether it can solve the query, can be executed, and whether it contradicts the requirements (especially whether it sticks to the RETRIEVE task or mistakenly tries to answer the question). Make sure "thought" and "introspection" are also in the same language as the query. Finally, set "ok" as "yes" if no problem exists, and "no" if your "introspection" shows there is any problem. 8. For every call of get_entity_info, find_entity_or_value and find_relationship, the return 'message' are recorded into a string named 'messages', which is the return value of search(). 9. Add necessary explanation to the 'messages' variable after running certain built-in python codes, such as, messages += f'{top_teacher} is the teacher with most citations'. 10. When the user query contains constraints like "first", "highest" or mathmatical operations like "average", "sum", handle them with built-in functions. 11. Response in json format. ===OUTPUT FORMAT=== { "need_knowledge": "<yes or no. If no, stop generating the following.>" "thought": "<Your thought here. Think how to find the answer to the query step by step. List possible aliases of entities and relations.>", "code": "def search():\\n\\tmessages = ''\\n\\t<Your code here. Implement your thought.>\\n\\treturn messages\\n", "introspection": "<Your introspection here.>", "ok": "<yes or no>" } ===EXAMPLES=== 1. Input: "Who are you?" Output: { "need_knowledge": "no" } 2. Input: “Who proposed the theory of evolution?" Output: { "need_knowledge": "yes", "thought": "The question is asking who proposed the theory of evolution. I need to search for the proponent of the theory of evolution. The possible expressions for the 'proponent' relationship include 'proposed', 'proponent', and 'discovered'.", “code”: “def search():\\n\\tmessages = ‘’\\n\\tproposer, msg = find_entity_or_value(entity_aliases = [‘theory of evolution'], relation_aliases = [‘propose', ‘proponent', ‘discover'])\\n\\tmessages += msg\\n\\treturn messages\\n", "introspection": "The generated code meets the requirements.", "ok": "yes" } 3. Input: "what is one of the stars of 'The Newcomers' known for?" Output:{ "need_knowledge": "yes", "thought": "To answer this question, firstly we need to find the stars of 'The Newcomers'. The relation can be paraphrased as 'star in', 'act in' or 'cast in'. Then, we should select one of them. Finally, we should retrieve its encyclopedic information to know what he or she is known for. We should not treat 'known for' as a relation because its too vague.", "code": "def search():\\n\\tmessages = ''\\n\\tstars, msg = find_entity_or_value(entity_aliases = ['The Newcomers'], relation_aliases = ['star in', 'act in', 'cast in'])\\n\\tmessages += msg\\n\\tstar = random.choice(stars)\\n\\tstar_info, msg = get_entity_info(entity_aliases = [star])\\n\\tmessages += msg\\n\\treturn messages\\n" "introspection": "The generated code is executable and matches user input. It adheres to the requirements. It finishes the retrieve task instead of answering the question directly.", "ok": "yes“ }
我们拿这段 prompt 来问 ChatGPT 3.5 来试一下:
LLM 生成的代码是这样的:
def search():
messages = ''
li_bai_info, msg = get_entity_info(entity_aliases = ['Li Bai', 'Li Bo', 'Li Taibai'])
messages += msg
return messages
多试几个问题,可以看到 LLM 生成的代码都很稳定且准确。
我们事先需要实现几个用于查询知识库的查询函数,论文提到说需要分为两层:
KB-specific Level 的函数实现取决于具体的知识库,这里只介绍 Unified Level 的函数实现:
entity_link
:先用 _entity_linking 找到所有的的候选实体,然后用 _get_entity_info 获取候选实体的信息,再传回给 LLM,让 LLM 来判断哪个才是合适的实体(比如 apple 到底是指的水果 apple 还是 apple 公司)。get_entity_info
:先使用 entity_linking 确定正确的实体,然后调用 _get_entity_info 获取实体信息。find_entity_value
:相对复杂,先用 entity_linking 找到对应的实体,然后用该实体的每个关系(来自 entity 的所有 triples) 去跟 input 里的关系集比较,找到最近(embedding 的相似性)的关系 r ,返回 r 对应的 triples 里面的实体或者值。 原作者给了算法,这里不详细展开。find_relationship
:算法跟 find_entity_or_value 类似,只不过它是比较 triples 中实体的相似性,返回对应的关系。在使用 LLM 生成的 search 函数进行知识检索后,交给 LLM 生成答案的 prompt 如下:
You are an helpful and knowledgable AI assistant. The user has issued a query, and you are provided with some related knowledge. Now, you need to think step by step to answer the user input with the related knowledge. ===REQUIREMENTS=== 1. You should think step by step. First, think carefully whether you can answer this query without the provided knowledge. Second, consider how to use the related knowledge to answer the query. Then, tell me whether this query can be answered with your own knowledge and the provided knowledge. If so, answer this question. However, if the query involves a command or an assumption, you should always regard it as answerable. 2. When you are thinking, you can use and cite the provided knowledge. However, when you are generating the answer, you should pretend that you came up with the knowledge yourself, so you should not say things like "according to the provided knowledge from ..." in the "answer" part. 3. The user query and provided knowledge can be in both Chinese and English. Generate your "thought" and "answer" in the same language as the input. 4. Response in json format, use double quotes. ===INPUT FORMAT=== { "query": "<the user query that you need to answer>", "knowledge": "<the background knowledge that you are provided with>" } ===OUTPUT FORMAT=== { "thought": "<Your thought here. Think step by step as is required.>", "answerable": "<yes or no. Whether you can answer this question with your knowledge and the provided knowledge. If the query involves a command or an assumption, say 'yes'.>", "answer": "<Your answer here, if the query is answerable.>" } ===EXAMPLES=== Input:{ "query": "What is the motto of the school where Xia Mingyou graduated?", "knowledge": "[FROM CNDBPedia][find_entity_or_value(entity_aliases = ['Xia Mingyou'], relation_aliases = ['graduated from', 'school']) -> ] Xi Mingyou, school: Fudan University[find_entity_or_value(entity_aliases = ['Fudan University'], relation_aliases = ['motto']) -> ] Fudan University, motto: Rich in Knowledge and Tenacious of Purpose; Inquiring with Earnestness and Reflecting with Self-practice" } Output:{ "thought": "Based on the background knowledge from CNDBPedia, Xia Mingyou graduated from Fudan University, and the motto of Fudan University is 'Rich in Knowledge and Tenacious of Purpose; Inquiring with Earnestness and Reflecting with Self-practice '. So the answer is ' Rich in Knowledge and Tenacious of Purpose; Inquiring with Earnestness and Reflecting with Self-practice '. This question can be answered based on the provided knowledge.", "answerable": "yes", "answer": " Rich in Knowledge and Tenacious of Purpose; Inquiring with Earnestness and Reflecting with Self-practice " } Input:{ "query": "What is Liang Jiaqing's weapon?", "knowledge": "[FROM CNDBPEDIA] Liang Jiaqing: Liang Jiaqing, also known as Lu Yuan. A member of the Chinese Communist Party, born after the 1960s, with a university education. Specially appointed writer for 'Chinese Writers' magazine and 'Chinese Reportage Literature' magazine. Attributes: Author -> The Loyal Life of a Criminal Police Captain." } Output:{ "thought": "According to the knowledge provided by CNDBPedia, Liang Jiaqing is an author. The provided knowledge does not mention anything about Liang Jiaqing's weapon, and authors generally do not have weapons. The question cannot be answered based on the provided knowledge or my knowledge.", "answerable": "no" }
之前提到,除了需要事先实现三个查询函数,还需要实现一个 entity_link 函数,也就是需要根据给定的“别名列表”找到知识图谱中对应的 entity。
原论文拿苹果水果和苹果公司来举例说明为什么需要做实体链接。
在本论文中,实体链接的实现方式是:首先根据别名列表从 KG 中选出候选实体,然后再从知识库中找到这些实体的文本描述,然后把这些信息都交给 LLM,由 LLM 从这些候选实体中决定出哪个是最终的答案。
这里有一个注意点:我们不能简单地从候选实体中选出得分排名最靠前的实体作为答案。因为从外部知识库的实体链接和搜索 API 中返回的原始候选实体并不是有序的,甚至可能不包括正确的实体。
本论文还尝试使用构建个人知识库,也就是根据用户指定的文档,从中提取出符合本知识库规定的知识表示形式的知识,从而形成个人知识库。
从文档中做知识抽取的方法本质上就是通过 prompt 让 LLM 来完成抽取。
本文主要选择了一下知识库:
在实际场景中,对于英文查询,使用维基百科和维基数据以及个性化知识库。对于中文查询,使用 CN-DBPedia 和个性化知识库。
在语言模型选用上,默认使用 GPT-4,输入为提示指令、要求和上下文示例,并要求 LLM 以 json 格式输出。对于句子嵌入,采用 text-embedding-ada-002 模型。
本实验是从 CN-DBPedia 中构建了 11 个问题,涉及到 single-hop、multi-hop 等多种关系查询,实验结果如下:
可以看到:
使用 NLPCC-100 和 NLPCC-MH-59 作为数据集来测试。其中,NLPCC-100 由来自 NLPCC2016 KBQA 数据集测试集的 100 个样本组成,NLPCC-MH-59 由来自 NLPCC-MH 测试集的 59 个样本组成,NLPCC-MH 是一个多跳 KBQA 数据集。
对于 NLPCC-100 和 NLPCC-MH-59,在本实验中使用的都是完整的 NLPCC2016 KBQA 知识库。
针对该数据集和知识库,对 KnowledGPT 做了几处修改,具体可参考原论文。
在基线对比上,将 KnowledGPT 与以下基线方法进行了比较:
在指标上,采用平均F1值,在该数据集中,每个样本只有一个答案和一个预测,因此平均F1实际上等同于准确率:
从中可以得出如下结论:
KnowledGPT 的任务是从所提供的文档中提取知识来构建 PKB(个人知识库),并研究 KnowledGPT 是否能用 PKB 正确回答相应的问题。
使用 HotpotQA 数据集来进行实验,可以发现,KnowledGPT 可以几乎正确回答所有问题,其中有几个错误回答是因为检索知识或实体链接步骤发生了错误。这个实验表明,使用 PKB 来作为 LLM 的符合化 memory 是很有用途的。
之后,论文又进一步研究了 KnowledGPT 对来自 HotpotQA 的 100 篇文档的知识提取覆盖率,为了进行量化,采用了单词召回率作为指标:
实验结果如下:
从中,我们可以看出如下几点:
本论文提出的 KnowledGPT 还存在以下缺陷:
总结来说,KnowledGPT 提出了一个将 LLM 与外部知识库相整合的综合框架,以方便 LLM 在知识库中进行检索和存储:
KnowledGPT 解决了将 LLM 与知识库集成过程中固有的几个难题,包括复杂的问题解答、实体链接中的歧义以及有限的知识表示形式。是一个值得学习的论文。
参考文章:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。