赞
踩
RAPTOR: RECURSIVE ABSTRACTIVE PROCESSING FOR TREE-ORGANIZED RETRIEVAL
Raptor:用于树组织检索的递归抽象处理
Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy.
检索增强的语言模型能够更好地适应世界状态的变化,并融入长尾知识。然而,大多数现有方法仅从检索语料库中检索短的连续块,限制了对整个文档上下文的整体理解。我们引入了递归嵌入、聚类和总结文本块的新方法,从下向上构建了一个具有不同总结级别的树。在推理时,我们的RAPTOR模型从这棵树中检索,在不同抽象级别上集成跨冗长文档的信息。对照实验表明,与传统的检索增强LMs相比,递归摘要检索在若干任务上有显著的改进。在涉及复杂、多步骤推理的问答任务上,我们展示了最先进的结果;例如,通过将RAPTOR检索与GPT-4结合使用,我们可以将质量基准上的最佳性能提高20%的绝对准确性。
Large Language Models (LLMs) have emerged as transformative tools showing impressive performance on many tasks. With the growing size of LLMs, they can serve standalone as very effective knowledge stores, with facts encoded within their parameters (Petroni et al, 2019; Jiang et al, 2020; Talmor et al, 2020; Rae et al, 2021; Hoffmann et al, 2022; Chowdhery et al, 2022; Bubeck et al, 2023; Kandpal et al, 2023) and models can be further improved with fine-tuning on downstream tasks (Roberts et al, 2020). Nevertheless, even a large model does not contain sufficient domainspecific knowledge for particular tasks and the world continues to change, invalidating facts in the LLM. Updating the knowledge of these models through additional fine-tuning or editing is difficult, particularly when dealing with vast text corpora (Lewis et al, 2020; Mitchell et al, 2022). An alternative approach, pioneered in open domain question answering systems (Chen et al, 2017; Yu et al, 2018), is to index large quantities of text, after splitting it into chunks (paragraphs), in a separate information retrieval system. Retrieved information is then presented to the LLM along with the question as context (“retrieval augmentation”, Lewis et al, 2020; Izacard et al, 2022; Min et al, 2023; Ram et al, 2023), making it easy to provide a system with current knowledge particular to some domain and enabling easy interpretability and provenance tracking, whereas the parametric knowledge of LLMs is opaque and difficult to trace back to its source (Akyurek et al, 2022).
Nevertheless, existing retrieval-augmented approaches also have flaws. The one we tackle is that most existing methods retrieve only a few short, contiguous text chunks, which limits their ability to represent and leverage large-scale discourse structure. This is particularly relevant for thematic questions that require integrating knowledge from multiple parts of a text, such as understanding an entire book, as in the NarrativeQA dataset (Kocisk ˇ y et al ` , 2018). Consider the fairy tale of Cinderella, and the question “How did Cinderella reach her happy ending?”. The top-k retrieved short contiguous texts will not contain enough context to answer the question.
To address this, we design an indexing and retrieval system that uses a tree structure to capture both high-level and low-level details about a text. As shown in Figure 1, our system, RAPTOR, clusters chunks of text, generates text summaries of those clusters, and then repeats, generating a tree from the bottom up. This structure enables RAPTOR to load into an LLM’s context chunks representing the text at different levels so that it can effectively and efficiently answer questions at different levels.
Our main contribution is the idea of using text summarization to allow retrieval augmentation of
context at different scales, and to show its effectiveness in experiments on collections of long documents. Controlled experiments with three language models (UnifiedQA (Khashabi et al, 2020),
GPT-3 (Brown et al, 2020) and GPT-4 (OpenAI, 2023)) show that RAPTOR outperforms current
retrieval augmentation. Moreover, RAPTOR coupled with GPT-4, and sometimes even with UnifiedQA, gives new state-of-the-art results on three QA tasks: free text response questions on books
and movies (NarrativeQA, Kocisk ˇ y et al ` 2018), full-text NLP papers (QASPER, Dasigi et al 2021),
and multiple-choice questions based on medium-length passages (QuALITY, Pang et al 2022).
大型语言模型(llm)已经成为变革性的工具,在许多任务中表现出令人印象深刻的性能。随着法学硕士规模的不断扩大,它们可以独立作为非常有效的知识存储,并在其参数中编码事实(Petroni等人,2019;Jiang等,2020;Talmor et al, 2020;Rae等,2021;Hoffmann et al, 2022;Chowdhery等人,2022;Bubeck等,2023;Kandpal等人,2023),并且可以通过对下游任务进行微调来进一步改进模型(Roberts等人,2020)。然而,即使是一个大型模型也不包含足够的特定领域知识来完成特定的任务,而且世界在不断变化,使法学硕士中的事实失效。通过额外的微调或编辑来更新这些模型的知识是困难的,特别是在处理大量文本语料库时(Lewis等人,2020;Mitchell et al, 2022)。在开放领域问答系统中开创的另一种方法(Chen等人,2017;Yu等人,2018),是将大量文本分成块(段落)后,在单独的信息检索系统中对其进行索引。然后将检索到的信息与问题一起作为上下文呈现给法学硕士(“检索增强”,Lewis等人,2020;Izacard等人,2022;Min等,2023;Ram等人,2023),这使得为系统提供特定于某个领域的当前知识变得容易,并且易于可解释性和来源跟踪,而法学硕士的参数化知识是不透明的,难以追溯到其来源(Akyurek等人,2022)。
然而,现有的检索增强方法也存在缺陷。我们解决的一个问题是,大多数现有的方法只检索几个简短的、连续的文本块,这限制了它们表示和利用大规模话语结构的能力。这对于需要整合文本多个部分的知识的主题问题尤其相关,例如理解整本书,如在NarrativeQA数据集中(Kocisk + y等人,2018)。想想灰姑娘的童话故事,以及“灰姑娘是如何得到幸福结局的?”这个问题。检索到的前k个连续短文本将不包含足够的上下文来回答问题。
为了解决这个问题,我们设计了一个索引和检索系统,该系统使用树形结构来捕获关于文本的高级和低级细节。如图1所示,我们的系统RAPTOR对文本块进行聚类,生成这些聚类的文本摘要,然后重复,从下向上生成树。这种结构使RAPTOR能够将表示不同层次文本的上下文块加载到法学硕士的上下文块中,以便它能够有效地回答不同层次的问题。
我们的主要贡献是使用文本摘要的想法,以允许检索增强
上下文在不同尺度下,并在长文档集合实验中显示其有效性。三种语言模型的对照实验(UnifiedQA (Khashabi et al, 2020),
GPT-3 (Brown et al, 2020)和GPT-4 (OpenAI, 2023))表明RAPTOR优于当前检索增大。此外,RAPTOR与GPT-4结合,有时甚至与UnifiedQA结合,在三个QA任务上提供了新的最先进的结果:书籍上的自由文本回答问题
电影(nartiveqa, Kocisk等人2018年),全文NLP论文(QASPER, Dasigi等人2021年),
以及基于中等长度文章的多项选择题(QuALITY, Pang et al . 2022)。
Figure 1: Tree construction process: RAPTOR recursively clusters chunks of text based on their vector embeddings and generates text summaries of those clusters, constructing a tree from the bottom up. Nodes clustered together are siblings; a parent node contains the text summary of that cluster.
图1:树构建过程:RAPTOR基于它们的向量嵌入递归地对文本块进行聚类,并生成这些聚类的文本摘要,从下向上构建树。聚集在一起的节点是兄弟姐妹;父节点包含该集群的文本摘要。
Recent advances in hardware and algorithms have indeed expanded the context lengths that models can handle, leading to questions about the need for retrieval systems (Dai et al, 2019; Dao et al, 2022; Liu et al, 2023). However, as Liu et al (2023) and Sun et al (2021) have noted, models tend to underutilize long-range context and see diminishing performance as context length increases, especially when pertinent information is embedded within a lengthy context.
Moreover, practically, use of long contexts is expensive and slow. This suggests that selecting the most relevant information for knowledge-intensive tasks is still crucial
为什么检索?
硬件和算法的最新进展确实扩大了模型可以处理的上下文长度,从而引发了关于检索系统需求的问题(Dai等人,2019;Dao等,2022;Liu et al, 2023)。然而,正如Liu等人(2023)和Sun等人(2021)所指出的那样,模型往往没有充分利用远程上下文,并且随着上下文长度的增加,特别是当相关信息嵌入到冗长的上下文中时,模型的性能会下降。
此外,实际上,使用长上下文既昂贵又缓慢。这表明,为知识密集型任务选择最相关的信息仍然至关重要
Retrieval-augmented language models (RALMs) have seen improvements in various components: the retriever, the reader, and end-to-end system training. Retrieval methods have transitioned from traditional term-based techniques like TF-IDF (Sparck Jones ¨ , 1972) and BM25 (Robertson et al, 1995; Roberts et al, 2020) to deep learning–based strategies (Karpukhin et al, 2020; Khattab & Zaharia, 2020; Sachan et al, 2023). Some recent work proposes using large language models as retrievers due to their ability to memorize extensive knowledge (Yu et al, 2022; Sun et al, 2022). Research on the reader component includes Fusion-in-Decoder (FiD) (Izacard & Grave, 2022), which employs both DPR and BM25 for retrieval and processes passages independently in the encoder and RETRO (Borgeaud et al, 2022; Wang et al, 2023), which utilizes cross-chunked attention and chunkwise retrieval to generate text grounded on retrieved context.
End-to-end system training work includes Atlas (Izacard et al, 2022), which fine-tunes an encoderdecoder model in conjunction with the retriever; REALM (Guu et al, 2020), a bidirectional, masked LM fine-tuned for open-domain question answering; and RAG (Retrieval-Augmented Generation) (Lewis et al, 2020), which integrates pre-trained sequence-to-sequence models with a neural retriever. Min et al (2021) introduced Joint Passage Retrieval (JPR) model which uses a treedecoding algorithm to handle passage diversity and relevance in multi-answer retrieval. Dense Hierarchical Retrieval (DHR) and Hybrid Hierarchical Retrieval (HHR) represent advancements in retrieval accuracy by combining document and passage level retrievals and integrating sparse and dense retrieval methods, respectively (Liu et al, 2021; Arivazhagan et al, 2023).
检索方法
检索增强语言模型(ralm)在各个组件中都得到了改进:检索器、阅读器和端到端系统训练。检索方法已经从传统的基于术语的技术,如TF-IDF (Sparck Jones¨,1972)和BM25 (Robertson et al, 1995;Roberts等人,2020)到基于深度学习的策略(Karpukhin等人,2020;哈塔卜和扎哈里亚,2020;Sachan et al, 2023)。最近的一些研究建议使用大型语言模型作为检索器,因为它们具有记忆广泛知识的能力(Yu et al ., 2022;Sun et al, 2022)。对读取器组件的研究包括融合解码器(FiD) (Izacard & Grave, 2022),它使用DPR和BM25进行检索,并在编码器和RETRO中独立处理段落(Borgeaud et al ., 2022;Wang et al ., 2023),它利用交叉分块注意和分块检索来生成基于检索上下文的文本。
端到端系统训练工作包括Atlas (Izacard等人,2022),它与检索器一起微调编码器-解码器模型;REALM (Guu et al ., 2020),一种双向、掩膜的LM,用于开放域问答;以及RAG(检索增强生成)(Lewis et al, 2020),它将预先训练的序列到序列模型与神经检索器集成在一起。Min等人(2021)引入了联合通道检索(JPR)模型,该模型使用树解码算法来处理多答案检索中的通道多样性和相关性。密集层次检索(DHR)和混合层次检索(HHR)分别通过结合文档级和通道级检索以及整合稀疏和密集检索方法来提高检索精度(Liu et al ., 2021;Arivazhagan等人,2023)。
Despite a diversity in methods, the retrieving components of models predominantly rely on standard approaches, i.e., chunking corpora and encoding with BERT-based retrievers. Although this approach is widely adopted, Nair et al (2023) highlights a potential shortcoming: contiguous segmentation might not capture the complete semantic depth of the text. Reading extracted snippets from technical or scientific documents may lack important context making them difficult to read or even misleading. (Cohan & Goharian, 2017; Newman et al, 2023; Zhang et al, 2023).
尽管方法多种多样,但模型的检索组件主要依赖于标准方法,即语料库分块和基于bert的检索器编码。尽管这种方法被广泛采用,但Nair等人(2023)强调了一个潜在的缺点:连续分割可能无法捕获文本的完整语义深度。阅读从技术或科学文件中摘录的片段可能缺乏重要的上下文,使其难以阅读,甚至具有误导性。(Cohan & Goharian, 2017;Newman等人,2023;Zhang et al, 2023)。
Summarization techniques provide a condensed view of documents, enabling more focused engagement with the content (Angelidis & Lapata, 2018). The summarization/snippet model by Gao et al (2023) uses summarizations and snippets of passages, which improves correctness on most datasets but can sometimes be a lossy means of compression.
The recursive-abstractive summarization model by Wu et al (2021) employs task decomposition to summarize smaller text chunks, which are later integrated to form summaries of larger sections.
While this method is effective for capturing broader themes, it can miss granular details. LlamaIndex (Liu, 2022) mitigates this issue by similarly summarizing adjacent text chunks but also retaining intermediate nodes thus storing varying levels of detail, keeping granular details. However, both methods, due to their reliance on adjacency for grouping or summarizing adjacent nodes, may still overlook distant interdependencies within the text, which we can find and group with RAPTOR.
作为上下文的递归汇总
摘要技术提供了文档的浓缩视图,使人们能够更集中地参与内容(Angelidis & Lapata, 2018)。Gao等人(2023)的摘要/片段模型使用段落的摘要和片段,这提高了大多数数据集的正确性,但有时可能是一种有损的压缩方式。
Wu等人(2021)的递归-抽象摘要模型采用任务分解来总结较小的文本块,然后将其集成为较大部分的摘要。
虽然这种方法对于捕获更广泛的主题是有效的,但它可能会错过颗粒状的细节。LlamaIndex (Liu, 2022)通过类似地总结相邻的文本块来缓解这个问题,但也保留中间节点,从而存储不同级别的细节,保持颗粒细节。然而,这两种方法,由于它们依赖邻接性来分组或总结相邻节点,可能仍然忽略了文本中遥远的相互依赖关系,我们可以找到并使用RAPTOR进行分组。
Building on the idea that long texts often present subtopics and hierarchical structures (Cao & Wang, 2022; Dong et al, 2023b), RAPTOR addresses the issue of semantic depth and connection in reading by building a recursive tree structure that balances broader thematic comprehension with granular details and which allows nodes to be grouped based on semantic similarity not just order in the text.
Construction of the RAPTOR tree begins with segmenting the retrieval corpus into short, contiguous texts of length 100, similar to traditional retrieval augmentation techniques. If a sentence exceeds the 100-token limit, we move the entire sentence to the next chunk, rather than cutting it mid-sentence.
This preserves the contextual and semantic coherence of the text within each chunk. These texts are then embedded using SBERT, a BERT-based encoder (multi-qa-mpnet-base-cos-v1) (Reimers & Gurevych, 2019). The chunks and their corresponding SBERT embeddings form the leaf nodes of our tree structure.
To group similar text chunks, we employ a clustering algorithm. Once clustered, a Language Model is used to summarize the grouped texts. These summarized texts are then re-embedded, and the cycle of embedding, clustering, and summarization continues until further clustering becomes infeasible, resulting in a structured, multi-layered tree representation of the original documents. An important aspect of RAPTOR is its computational efficiency. The system scales linearly in terms of both build time and token expenditure, making it suitable for processing large and complex corpora. For a comprehensive discussion on RAPTOR’s scalability, please refer to the Appendix A.
For querying within this tree, we introduce two distinct strategies: tree traversal and collapsed tree.
The tree traversal method traverses the tree layer-by-layer, pruning and selecting the most relevant nodes at each level. The collapsed tree method evaluates nodes collectively across all layers to find the most relevant ones.
RAPTOR概述
基于长文本通常呈现子主题和层次结构的想法(Cao & Wang, 2022;Dong等人,2023b), RAPTOR通过构建递归树结构来解决阅读中的语义深度和连接问题,该结构平衡了更广泛的主题理解和粒度细节,并允许基于语义相似性而不仅仅是文本中的顺序对节点进行分组。
RAPTOR树的构建始于将检索语料库分割成长度为100的简短连续文本,类似于传统的检索增强技术。如果一个句子超过100个标记的限制,我们将整个句子移动到下一个块,而不是在句子中间切断它。
这保留了每个块内文本的上下文和语义一致性。然后使用SBERT,一种基于bert的编码器(multi-qa-mpnet-base-cos-v1)嵌入这些文本(Reimers & Gurevych, 2019)。这些块和它们对应的SBERT嵌入构成了树形结构的叶节点。
为了对相似的文本块进行分组,我们使用了聚类算法。聚类后,使用语言模型来总结分组的文本。然后重新嵌入这些总结的文本,然后继续进行嵌入、聚类和总结的循环,直到进一步的聚类变得不可行的时候,从而得到原始文档的结构化、多层树表示。RAPTOR的一个重要方面是它的计算效率。系统在构建时间和令牌支出方面呈线性扩展,使其适合处理大型和复杂的语料库。有关RAPTOR可伸缩性的全面讨论,请参阅附录a。
对于该树中的查询,我们引入了两种不同的策略:树遍历和折叠树。
树遍历方法逐层遍历树,在每一层上修剪和选择最相关的节点。折叠树方法对所有层的节点进行集体评估,以找到最相关的节点。
Clustering plays a key role in building the RAPTOR tree, organizing text segments into cohesive groups. This step groups related content together, which helps the subsequent retrieval process.
One of the unique aspects of our clustering approach is the use of soft clustering, where nodes can belong to multiple clusters without requiring a fixed number of clusters. This flexibility is essential because individual text segments often contain information relevant to various topics, thereby warranting their inclusion in multiple summaries.
Our clustering algorithm is based on Gaussian Mixture Models (GMMs), an approach that offers both flexibility and a probabilistic framework. GMMs assume that data points are generated from a mixture of several Gaussian distributions.
聚类算法
聚类在构建RAPTOR树、将文本段组织成内聚组方面起着关键作用。此步骤将相关内容分组在一起,这有助于后续的检索过程。
我们的集群方法的一个独特之处在于使用了软集群,其中节点可以属于多个集群,而不需要固定数量的集群。这种灵活性是必不可少的,因为单个文本段通常包含与各种主题相关的信息,因此保证它们包含在多个摘要中。
我们的聚类算法基于高斯混合模型(GMMs),这种方法提供了灵活性和概率框架。gmm假设数据点是由几个高斯分布的混合产生的。
Given a set of N text segments, each represented as a d-dimensional dense vector embedding, the likelihood of a text vector, x, given its membership in the k th Gaussian distribution, is denoted by P(x|k) = N (x; µk, Σk). The overall probability distribution is a weighted combination P(x) = 滑K k=1 πkN (x; µk, Σk), where πk signifies the mixture weight for the k th Gaussian distribution.
The high dimensionality of vector embeddings presents a challenge for traditional GMMs, as distance metrics may behave poorly when used to measure similarity in high-dimensional spaces (Aggarwal et al, 2001). To mitigate this, we employ Uniform Manifold Approximation and Projection (UMAP), a manifold learning technique for dimensionality reduction (McInnes et al, 2018). The number of nearest neighbors parameter, n neighbors, in UMAP determines the balance between the preservation of local and global structures. Our algorithm varies n neighbors to create a hierarchical clustering structure: it first identifies global clusters and then performs local clustering within these global clusters. This two-step clustering process captures a broad spectrum of relationships among the text data, from broad themes to specific details.
Should a local cluster’s combined context ever exceed the summarization model’s token threshold, our algorithm recursively applies clustering within the cluster, ensuring that the context remains within the token threshold.
To determine the optimal number of clusters, we employ the Bayesian Information Criterion (BIC) for model selection. BIC not only penalizes model complexity but also rewards goodness of fit (Schwarz, 1978). The BIC for a given GMM is BIC = ln(N)k − 2 ln(ˆL), where N is the number of text segments (or data points), k is the number of model parameters, and ˆL is the maximized value of the likelihood function of the model. In the context of GMM, the number of parameters k is a function of the dimensionality of the input vectors and the number of clusters.
With the optimal number of clusters determined by BIC, the Expectation-Maximization algorithm is then used to estimate the GMM parameters, namely the means, covariances, and mixture weights.
While the Gaussian assumption in GMMs may not perfectly align with the nature of text data, which often exhibits a sparse and skewed distribution, our empirical observations suggest that it offers an effective model for our purpose. We run an ablation comparing GMM Clustering with summarizing contiguous chunks and provide details in Appendix B.
给定一组N个文本片段,每个文本片段都表示为一个d维密集向量嵌入,给定文本向量x在第k个高斯分布中的隶属度,文本向量x的似然表示为P(x|k) = N (x;µkΣk)。总体概率分布为P(x) =滑K K =1 πkN (x;µk, Σk),其中πk表示第k个高斯分布的混合权值。
向量嵌入的高维性对传统的gmm提出了挑战,因为距离度量在用于测量高维空间中的相似性时可能表现不佳(Aggarwal等人,2001)。为了缓解这种情况,我们采用了统一流形近似和投影(UMAP),这是一种用于降维的流形学习技术(McInnes等人,2018)。UMAP中最近邻居的数目参数n决定了局部结构和全局结构保存之间的平衡。我们的算法通过改变n个邻居来创建分层聚类结构:它首先识别全局聚类,然后在这些全局聚类中执行局部聚类。这个两步聚类过程捕获文本数据之间广泛的关系,从广泛的主题到特定的细节。
如果本地集群的组合上下文超过了摘要模型的令牌阈值,我们的算法将递归地在集群内应用聚类,确保上下文保持在令牌阈值内。
为了确定最优的聚类数量,我们采用贝叶斯信息准则(BIC)进行模型选择。BIC不仅惩罚模型复杂性,而且奖励拟合优度(Schwarz, 1978)。给定GMM的BIC为BIC = ln(N)k−2 ln(L),其中N为文本段(或数据点)的数量,k为模型参数的数量,L为模型的似然函数的最大值。在GMM中,参数的数量k是输入向量的维数和簇的数量的函数。
根据BIC确定的最优簇数,然后使用期望最大化算法估计GMM参数,即均值、协方差和混合权值。
虽然gmm中的高斯假设可能不完全符合文本数据的性质,文本数据通常呈现稀疏和偏斜分布,但我们的经验观察表明,它为我们的目的提供了一个有效的模型。我们将GMM聚类与汇总连续块进行了消融比较,并在附录B中提供了详细信息。
After clustering the nodes using Gaussian Mixture Models, the nodes in each cluster are sent to a language model for summarization. This step allows the model to transform large chunks of text into concise, coherent summaries of the selected nodes. For our experiments, we use gpt-3.5-turbo to generate the summaries. The summarization step condenses the potentially large volume of retrieved information into a manageable size. We provide statistics on the compression due to the summarization in Appendix C and the prompt used for summarization in Appendix D.
While the summarization model generally produces reliable summaries, a focused annotation study revealed that about 4% of the summaries contained minor hallucinations. These did not propagate to parent nodes and had no discernible impact on question-answering tasks. For an in-depth analysis of hallucinations, refer to the appendix E.
基于模型的总结
使用高斯混合模型对节点进行聚类后,将每个聚类中的节点发送给语言模型进行汇总。这一步允许模型将大块文本转换为所选节点的简洁、连贯的摘要。在我们的实验中,我们使用gpt-3.5 turbo来生成摘要。摘要步骤将潜在的大量检索信息压缩为可管理的大小。我们提供了由于附录C中的汇总和附录D中用于汇总的提示所导致的压缩的统计数据。
虽然摘要模型通常会产生可靠的摘要,但一项重点注释研究显示,大约4%的摘要包含轻微的幻觉。这些不会传播到父节点,并且对问答任务没有明显的影响。有关幻觉的深入分析,请参阅附录E。
In this section, we elaborate on the two querying mechanisms employed by RAPTOR: tree traversal and collapsed tree. These methods offer unique ways of traversing the multi-layered RAPTOR tree to retrieve relevant information, each with its own advantages and trade-offs. We provide the pseudocode of both methods in Appendix F. Note that we embed all nodes using SBERT.
The tree traversal method first selects the top-k most relevant root nodes based on their cosine similarity to the query embedding. The children of these selected nodes are considered at the next layer and the top-k nodes are selected from this pool again based on their cosine similarity to the query vector. This process is repeated until we reach the leaf nodes. Finally, the text from all selected nodes is concatenated to form the retrieved context. The algorithm’s steps are outlined below:
1. Start at the root layer of the RAPTOR tree. Compute the cosine similarity between the query embedding and the embeddings of all nodes present at this initial layer.
2. Choose the top-k nodes based on the highest cosine similarity scores, forming the set S1.
3. Proceed to the child nodes of the elements in set S1. Compute the cosine similarity between the query vector and the vector embeddings of these child nodes.
4. Select the top k child nodes with the highest cosine similarity scores to the query, forming the set S2.
5. Continue this process recursively for d layers, producing sets S1, S2, . . . , Sd.
6. Concatenate sets S1 through Sd to assemble the relevant context to the query.
查询
在本节中,我们将详细介绍RAPTOR使用的两种查询机制:树遍历和折叠树。这些方法提供了遍历多层RAPTOR树以检索相关信息的独特方法,每种方法都有自己的优点和缺点。我们在附录f中提供了这两种方法的伪代码。注意,我们使用SBERT嵌入所有节点。
树遍历方法首先根据查询嵌入的余弦相似度选择最相关的前k个根节点。在下一层考虑这些选定节点的子节点,并根据它们与查询向量的余弦相似性再次从该池中选择top-k节点。重复这个过程,直到我们到达叶节点。最后,将来自所有选定节点的文本连接起来,形成检索到的上下文。算法的步骤概述如下:
1. 从RAPTOR树的根层开始。计算查询嵌入与初始层中所有节点嵌入之间的余弦相似度。
2. 根据最高的余弦相似度分数选择top-k个节点,形成集合S1。
3. 继续到集合S1中元素的子节点。计算查询向量和这些子节点的向量嵌入之间的余弦相似度。
4. 选择与查询余弦相似度得分最高的前k个子节点,形成集合S2。
5. 将此过程递归地进行d层,生成集S1, S2,…。, Sd。
6. 连接集S1到Sd,将相关上下文组装到查询中。
Figure 2: Illustration of the tree traversal and collapsed tree retrieval mechanisms. Tree traversal starts at the root level of the tree and retrieves the top-k (here, top-1) node(s) based on cosine similarity to the query vector. At each level, it retrieves the top-k node(s) from the child nodes of the previous layer’s top-k. Collapsed tree collapses the tree into a single layer and retrieves nodes until a threshold number of tokens is reached, based on cosine similarity to the query vector. The nodes on which cosine similarity search is performed are highlighted in both illustrations.
图2:树遍历和折叠树检索机制的说明。树遍历从树的根级别开始,并根据与查询向量的余弦相似度检索top-k(这里是top-1)个节点。在每一层,它从前一层的top-k子节点中检索top-k节点。根据与查询向量的余弦相似度,折叠树将树折叠成单个层并检索节点,直到达到令牌的阈值数量。执行余弦相似度搜索的节点在两个插图中都突出显示。
By adjusting the depth d and the number of nodes k selected at each layer, the tree traversal method offers control over the specificity and breadth of the information retrieved. The algorithm starts with a broad outlook by considering the top layers of the tree and progressively focuses on finer details as it descends through the lower layers.
The collapsed tree approach offers a simpler way to search for relevant information by considering all nodes in the tree simultaneously, as depicted in Figure 2. Instead of going layer-by-layer, this method flattens the multi-layered tree into a single layer, essentially bringing all the nodes onto the same level for comparison. The steps for this method are outlined below:
1. First, collapse the entire RAPTOR tree into a single layer. This new set of nodes, denoted as C, contains nodes from every layer of the original tree.
2. Next, calculate the cosine similarity between the query embedding and the embeddings of all nodes present in the collapsed set C.
3. Finally, pick the top-k nodes that have the highest cosine similarity scores with the query.
Keep adding nodes to the result set until you reach a predefined maximum number of tokens, ensuring you don’t exceed the model’s input limitations.
通过调整深度d和每一层选择的节点数k,树遍历方法提供了对检索信息的特异性和广度的控制。该算法从考虑树的顶层开始,从一个广阔的前景开始,随着它逐渐下降到下层,逐渐关注更精细的细节。
折叠树方法提供了一种更简单的方法,通过同时考虑树中的所有节点来搜索相关信息,如图2所示。这种方法不是一层一层地进行,而是将多层树扁平化为单层,本质上是将所有节点放在同一层上进行比较。此方法的步骤概述如下:
1. 首先,将整个RAPTOR树折叠成一层。这个新的节点集,用C表示,包含了原始树的每一层的节点。
2. 接下来,计算查询嵌入与折叠集C中所有节点的嵌入之间的余弦相似度。
3. 最后,选择与查询具有最高余弦相似度分数的top-k节点。
不断向结果集中添加节点,直到达到预定义的最大令牌数量,确保不会超出模型的输入限制。
We tested both approaches on 20 stories from the QASPER dataset. Figure 3 shows the performance of tree traversal with different top- sizes and collapsed tree with different maximum token numbers.
The collapsed tree approach consistently performs better. We believe collapsed tree retrieval is better due to offering greater flexibility than tree traversal; i.e., by searching through all the nodes simultaneously, it retrieves information that is at the correct level of granularity for a given question.
In comparison, while using tree traversal with the same values of d and k, the ratio of nodes from each level of the tree will be constant. So, the ratio of higher-order thematic information to granular details will remain the same regardless of the question.
我们在来自QASPER数据集的20个故事上测试了这两种方法。图3显示了不同顶大小的树遍历和不同最大令牌数的折叠树的性能。
折叠树方法始终表现得更好。我们认为折叠树检索更好,因为它提供了比树遍历更大的灵活性;也就是说,通过同时搜索所有节点,它可以为给定问题检索粒度级别正确的信息。
相比之下,当使用d和k值相同的树遍历时,树的每一层节点的比例将是恒定的。所以,无论问题是什么,高阶主题信息与细粒度细节的比例都将保持不变。
Figure 3: Comparison of querying methods.
Results on 20 stories from the QASPER dataset using tree traversal with different top-k values, and collapsed tree with different context lengths.
Collapsed tree with 2000 tokens produces the best results, so we use this querying strategy for our main results.
图3:查询方法的比较。
使用不同top-k值的树遍历和不同上下文长度的折叠树对来自QASPER数据集的20个故事的结果。
包含2000个令牌的折叠树产生了最好的结果,所以我们使用这种查询策略作为我们的主要结果。
One drawback, however, of the collapsed tree approach is that it requires cosine similarity search to be performed on all nodes in the tree. However, this can be made more efficient with fast k-nearest neighbor libraries such as FAISS (Johnson et al, 2019).
然而,折叠树方法的一个缺点是它需要对树中的所有节点执行余弦相似性搜索。然而,使用快速的k近邻库(如FAISS)可以提高效率(Johnson et al, 2019)。
Overall, given the collapsed tree approach’s greater flexibility and its superior performance on the subset of the QASPER dataset, this is the querying approach with which we proceed.
Specifically, we use the collapsed tree with 2000 maximum tokens, which approximately equates to retrieving the top-20 nodes. Using a token-based approach ensures the context does not exceed model context constraints as token counts can vary across nodes. For experiments with the UnifiedQA model, we provide 400 tokens of context, as UnifiedQA has a max context length of 512 tokens. We provide the same amount of tokens of context to RAPTOR and to the baselines.
总的来说,考虑到折叠树方法更大的灵活性及其在QASPER数据集子集上的优越性能,这是我们继续使用的查询方法。
具体来说,我们使用具有2000个最大令牌的折叠树,这大约相当于检索前20个节点。使用基于令牌的方法可确保上下文不会超出模型上下文约束,因为令牌计数可能会因节点而异。对于UnifiedQA模型的实验,我们提供了400个上下文令牌,因为UnifiedQA的最大上下文长度为512个令牌。我们为RAPTOR和基线提供相同数量的上下文令牌。
We conduct a qualitative analysis to understand the benefits of RAPTOR’s retrieval process compared to Dense Passage Retrieval (DPR) methods. Our study focuses on thematic, multi-hop questions using a 1500-word Cinderella fairytale. As illustrated in Figure 4, RAPTOR’s tree-based retrieval allows it to choose nodes from different tree layers, matching the question’s detail level. This approach often yields more relevant and comprehensive information for downstream tasks than DPR. For a detailed discussion and examples, including the text retrieved by both RAPTOR and DPR for specific questions, please refer to the appendix G.
定性研究
我们进行了定性分析,以了解RAPTOR的检索过程相比,密集通道检索(DPR)方法的好处。我们的研究以1500字的灰姑娘童话为主题,采用多跳题。如图4所示,RAPTOR基于树的检索允许它从不同的树层中选择节点,以匹配问题的详细级别。与DPR相比,这种方法通常为下游任务提供更相关和更全面的信息。有关详细的讨论和示例,包括RAPTOR和DPR针对特定问题检索的文本,请参阅附录G。
Figure 4: Querying Process: Illustration of how RAPTOR retrieves information for two questions about the Cinderella story: “What is the central theme of the story?” and “How did Cinderella find a happy ending?”. Highlighted nodes indicate RAPTOR’s selections, while arrows point to DPR’s leaf nodes. Notably, RAPTOR’s context often encompasses the information retrieved by DPR, either directly or within higher-layer summaries.
图4:查询过程:说明RAPTOR如何检索关于灰姑娘故事的两个问题的信息:“故事的中心主题是什么?”以及“灰姑娘是如何找到幸福结局的?”高亮显示的节点表示RAPTOR的选择,而箭头指向DPR的叶节点。值得注意的是,RAPTOR的上下文通常包含由DPR直接或在更高层摘要中检索的信息。
We measure RAPTOR’s performance across three question-answering datasets: NarrativeQA, QASPER, and QuALITY.
NarrativeQA is a dataset that comprises question-answer pairs based on the full texts of books and movie transcripts, totaling 1,572 documents (Kocisk ˇ y et al ` , 2018; Wu et al, 2021). The NarrativeQA-Story task requires a comprehensive understanding of the entire narrative in order to accurately answer its questions, thus testing the model’s ability to comprehend longer texts in the literary domain. We measure performance on this dataset using the standard BLEU (B-1, B-4), ROUGE (R-L), and METEOR (M) metrics. Please see appendix H for more details on the NarrativeQA evaluation script used in our experiments.
The QASPER dataset includes 5,049 questions across 1,585 NLP papers, with each question probing for information embedded within the full text (Dasigi et al, 2021). The answer types in QASPER are categorized as Answerable/Unanswerable, Yes/No, Abstractive, and Extractive. Accuracy is measured using standard F1.
Lastly, the QuALITY dataset consists of multiple-choice questions, each accompanied by context passages averaging approximately 5,000 tokens in length (Pang et al, 2022). This dataset calls for reasoning over the entire document for QA tasks, enabling us to measure the performance of our retrieval system on medium-length documents. The dataset includes a challenging subset, QuALITYHARD, which contains questions that a majority of human annotators answered incorrectly in a speed-setting. We report accuracies for both the entire test set and the HARD subset.
我们通过三个问答数据集来衡量RAPTOR的性能:NarrativeQA, QASPER和QuALITY。
NarrativeQA是一个数据集,由基于书籍和电影脚本全文的问答对组成,共计1,572个文档(Kocisk等人,2018;Wu et al, 2021)。NarrativeQA-Story任务要求全面理解整个叙事,以便准确地回答问题,从而测试模型理解文学领域较长文本的能力。我们使用标准BLEU (B-1、B-4)、ROUGE (R-L)和METEOR (M)指标来衡量该数据集上的性能。关于我们实验中使用的叙事性qa评估脚本的更多细节,请参见附录H。
QASPER数据集包括1,585篇NLP论文中的5,049个问题,每个问题都探测全文中嵌入的信息(Dasigi et al, 2021)。QASPER中的回答类型分为可回答/不可回答、是/否、抽象和抽取。使用标准F1测量精度。
最后,QuALITY数据集由多项选择题组成,每个选择题都伴随着平均长度约为5000个令牌的上下文段落(Pang et al, 2022)。该数据集要求对QA任务的整个文档进行推理,使我们能够测量中等长度文档的检索系统的性能。该数据集包括一个具有挑战性的子集QuALITYHARD,其中包含大多数人类注释者在速度设置中回答错误的问题。我们报告了整个测试集和HARD子集的准确性。
We first present controlled comparisons using the UnifiedQA 3B as the reader, with SBERT (Reimers & Gurevych, 2019), BM25 (Robertson et al, 1995; 2009), and DPR (Karpukhin et al, 2020) as the embedding models with and without the RAPTOR tree structure, on three datasets: QASPER, NarrativeQA, and QuALITY. As shown in Tables 1 and 2,
our results demonstrate that RAPTOR, when combined with any retriever, consistently outperforms the respective retriever across all datasets. 2 Since RAPTOR with SBERT has the best performance, we use it in all subsequent experiments.
对照基线比较
我们首先使用UnifiedQA 3B作为阅读器,与SBERT (Reimers & Gurevych, 2019)、BM25 (Robertson et al, 1995;2009)和DPR (Karpukhin等人,2020)作为在三个数据集上的嵌入模型:QASPER、NarrativeQA和QuALITY,其中包含和不包含RAPTOR树结构。如表1和表2所示,
我们的结果表明,当与任何检索器结合使用时,RAPTOR在所有数据集上的表现始终优于各自的检索器。2由于带有SBERT的RAPTOR性能最好,我们在后续的所有实验中都使用了它。
Table 1: NarrativeQA Performance With + Without RAPTOR: Performance comparison of various retrieval methods (SBERT, BM25, DPR) with and without RAPTOR on the NarrativeQA dataset, using UnifiedQA-3B as the language model. RAPTOR outperforms baselines of each respective retrieval method.
表1:在使用UnifiedQA-3B作为语言模型的情况下,在使用和不使用RAPTOR的NarrativeQA数据集上,各种检索方法(SBERT, BM25, DPR)的性能比较。RAPTOR优于每种检索方法的基线。
Likewise, in the QuALITY dataset as shown in Table 4, RAPTOR achieves an accuracy of 62.4%, which is a 2% and 5.1% improvement over DPR and BM25. Similar trends are observed when UnifiedQA is employed, with RAPTOR outperforming DPR and BM25 by 2.7% and 6.7%, respectively.
Finally, in the NarrativeQA dataset, as presented in Table 6, RAPTOR excels across multiple metrics. For ROUGE-L, it surpasses BM25 and DPR by 7.3 and 2.7 points, respectively. In other metrics like BLEU-1, BLEU-4, and METEOR, RAPTOR outperforms BM25 and DPR by margins ranging from 1.7 to 5.8 and 0.7 to 2.1 points, respectively.
同样,在表4所示的QuALITY数据集中,RAPTOR的准确率为62.4%,比DPR和BM25分别提高了2%和5.1%。当使用UnifiedQA时,也观察到类似的趋势,RAPTOR的性能分别比DPR和BM25高2.7%和6.7%。
最后,在NarrativeQA数据集中,如表6所示,RAPTOR在多个指标上表现出色。ROUGE-L比BM25和DPR分别高出7.3分和2.7分。在其他指标如BLEU-1、BLEU-4和METEOR中,RAPTOR分别比BM25和DPR高出1.7到5.8分和0.7到2.1分。
Table 2: QuALITY and QASPER Performance With + Without RAPTOR: Performance comparison across the QuALITY and QASPER datasets of various retrieval methods (SBERT, BM25, DPR) with and without RAPTOR. UnifiedQA-3B is used as the language model. RAPTOR outperforms baselines of each respective retrieval method for both datasets.
表2:使用和不使用RAPTOR时的质量和QASPER性能:使用和不使用RAPTOR时,各种检索方法(SBERT, BM25, DPR)的质量和QASPER数据集的性能比较。使用UnifiedQA-3B作为语言模型。RAPTOR在这两个数据集上都优于各自检索方法的基线。
Table 3: Controlled comparison of F-1 scores on the QASPER dataset, using three different language models (GPT-3, GPT-4, UnifiedQA 3B) and various retrieval methods. The column ”Title + Abstract” reflects performance when only the title and abstract of the papers are used for context.
RAPTOR outperforms the established baselines BM25 and DPR across all tested language models.
Specifically, RAPTOR’s F-1 scores are at least 1.8% points higher than DPR and at least 5.3% points higher than BM25
表3:使用三种不同的语言模型(GPT-3、GPT-4、UnifiedQA 3B)和各种检索方法,QASPER数据集上F-1分数的对照比较。当仅使用论文的标题和摘要作为上下文时,“标题+摘要”栏反映了性能。
RAPTOR在所有测试的语言模型中都优于已建立的基线BM25和DPR。
具体来说,RAPTOR的F-1得分比DPR至少高出1.8%,比BM25至少高出5.3%
Table 4: Comparison of accuracies on the QuALITY dev dataset for two different language models (GPT-3, UnifiedQA 3B) using various retrieval methods. RAPTOR outperforms the baselines of BM25 and DPR by at least 2.0% in accuracy.
表4:两种不同语言模型(GPT-3, UnifiedQA 3B)使用不同检索方法在QuALITY开发数据集上的准确性比较。RAPTOR的准确率至少比BM25和DPR的基线高出2.0%。
Table 5: Results on F-1 Match scores of various models on the QASPER dataset.
表5:各种模型在QASPER数据集上的F-1 Match得分结果。
Systems Building upon our controlled comparisons, we examine RAPTOR’s performance relative to other state-of-the-art models. As shown in Table 5, RAPTOR with GPT-4 sets a new benchmark on QASPER, with a 55.7% F-1 score, surpassing the CoLT5 XL’s score of 53.9%.
In the QuALITY dataset, as shown in Table 7, RAPTOR paired with GPT-4 sets a new stateof-the-art with an accuracy of 82.6%, surpassing the previous best result of 62.3%. In particular, it outperforms CoLISA by 21.5% on QuALITY-HARD, which represents questions that humans took unusually long to correctly answer, requiring rereading parts of the text, difficult reasoning, or both.
For the NarrativeQA dataset, as represented in Table 6, RAPTOR paired with UnifiedQA sets a new state-of-the-art METEOR score. When compared to the recursively summarizing model by Wu et al (2021), which also employs UnifiedQA, RAPTOR outperforms it on all metrics. While Wu et al (2021) rely solely on the summary in the top root node of the tree structure, RAPTOR benefits from its intermediate layers and clustering approaches, which allows it to capture a range of information, from general themes to specific details, contributing to its overall strong performance.
与最先进系统的比较建立在我们的控制比较之上,我们检查了RAPTOR相对于其他最先进模型的性能。如表5所示,搭载GPT-4的RAPTOR在QASPER上设定了新的基准,F-1得分为55.7%,超过了CoLT5 XL的53.9%。
在质量数据集中,如表7所示,RAPTOR与GPT-4配对设置了新的最先进的精度为82.6%,超过了之前的最佳结果62.3%。特别是,它在QuALITY-HARD上的表现比CoLISA高出21.5%,QuALITY-HARD代表人类花了很长时间才能正确回答的问题,需要重读部分文本,困难的推理,或者两者兼而有之。
对于NarrativeQA数据集,如表6所示,RAPTOR与UnifiedQA配对设置了一个新的最先进的METEOR分数。与Wu等人(2021)的递归总结模型(也使用UnifiedQA)相比,RAPTOR在所有指标上都优于它。虽然Wu等人(2021)仅依赖于树结构的顶层根节点中的摘要,但RAPTOR受益于其中间层和聚类方法,这使得它能够捕获从一般主题到特定细节的一系列信息,从而有助于其整体性能的提高。
树形结构的贡献
Table 6: Performance comparison on the NarrativeQA dataset across multiple models, focusing on four metrics: ROUGE-L, BLEU-1, BLEU-4, and METEOR. RAPTOR, when paired with UnifiedQA 3B, not only surpasses retrieval methods like BM25 and DPR but also sets a new state-ofthe-art in the METEOR metric.
表6:跨多个模型的NarrativeQA数据集上的性能比较,关注四个指标:ROUGE-L、BLEU-1、BLEU-4和METEOR。RAPTOR在与UnifiedQA 3B配合使用时,不仅超越了BM25和DPR等检索方法,而且还在METEOR指标中设定了新的最先进水平。
Table 7: Accuracies of the QuALITY dataset on both the overall test set and the more challenging hard subset. GPT-4 with RAPTOR sets a new state-of-the-art.
表7:QuALITY数据集在整个测试集和更具挑战性的硬子集上的准确性。GPT-4与猛禽建立了一个新的国家。
Table 8: Performance of RAPTOR when querying different tree layers for Story 1 from the QuALITY dataset. Columns represent different starting points (highest layer) and rows represent different numbers of layers queried.
表8:从QuALITY数据集中查询Story 1的不同树层时RAPTOR的性能列表示不同的起始点(最高层),行表示查询的不同层数。
We examine the contribution of each layer of nodes to RAPTOR’s retrieval capabilities. We hypothesized that upper nodes play a crucial role in handling thematic or multi-hop queries requiring a broader understanding of the text.
We validated this hypothesis both quantitatively and qualitatively. We present qualitative analysis in appendix G. To quantitatively understand the contribution of the upper-level nodes, we used stories from the QuALITY dataset. The RAPTOR tree is built for each of these stories, as described in Section 3. However, during retrieval, we limit the search to different subsets of layers. For example, we exclusively retrieve from the leaf nodes and each upper layer, as well as from different contiguous subsets of the layers. We show findings specific to one story in Table 8, revealing that a full-tree search, utilizing all layers, outperformed retrieval strategies that focused only on specific layers.
These findings highlight the importance of the full tree structure in RAPTOR. By providing both the original text and higher-level summaries for retrieval, RAPTOR can effectively handle a wider range of questions, from higher-order thematic queries to detail-oriented questions. Detailed results for additional stories and an ablation study on layer contributions can be found in Appendix I.
我们研究了每一层节点对RAPTOR检索能力的贡献。我们假设上层节点在处理需要对文本有更广泛理解的主题或多跳查询中起着至关重要的作用。
我们从定量和定性两方面验证了这一假设。我们在附录g中给出了定性分析。为了定量地理解上层节点的贡献,我们使用了QuALITY数据集中的故事。RAPTOR树是为每一个故事构建的,如第3节所述。然而,在检索过程中,我们将搜索限制在不同的层子集中。例如,我们只从叶节点和每个上层检索,以及从层的不同连续子集检索。我们在表8中展示了特定于一个故事的发现,揭示了利用所有层的全树搜索优于只关注特定层的检索策略。
这些发现突出了RAPTOR完整树状结构的重要性。通过提供原始文本和更高级别的检索摘要,RAPTOR可以有效地处理更广泛的问题,从高阶主题查询到面向细节的问题。附加层的详细结果和层贡献的消融研究可在附录1中找到。
5 结论
In this paper, we have presented RAPTOR, a novel tree-based retrieval system that augments the parametric knowledge of large language models with contextual information at various levels of abstraction. By employing recursive clustering and summarization techniques, RAPTOR creates a hierarchical tree structure that is capable of synthesizing information across various sections of the retrieval corpora. During the query phase, RAPTOR leverages this tree structure for more effective retrieval. Our controlled experiments demonstrated that RAPTOR not only outperforms traditional retrieval methods but also sets new performance benchmarks on several question-answering tasks.
在本文中,我们提出了RAPTOR,一个新的基于树的检索系统,它在不同的抽象层次上用上下文信息增强了大型语言模型的参数化知识。通过采用递归聚类和摘要技术,RAPTOR创建了一个分层树结构,能够综合检索语料库的各个部分的信息。在查询阶段,RAPTOR利用这个树结构进行更有效的检索。我们的对照实验表明,RAPTOR不仅优于传统的检索方法,而且在一些问答任务中设置了新的性能基准。
6再现性声明
Language Models for QA and Summarization Four language models are used in our RAPTOR experiments: GPT-3 and GPT-4 for QA tasks, and GPT-3.5-turbo for summarization. The gpt-3, gpt-4, and gpt-3.5-turbo models can be accessed via API calls (OpenAI API). UnifiedQA, which is used for QA tasks, is publicly available at Hugging Face.
Evaluation Datasets The three evaluation datasets used in our experiments—QuALITY, QASPER, and NarrativeQA—are all publicly accessible. These datasets ensure that the retrieval and QA tests conducted in this study can be replicated.
我们在RAPTOR实验中使用了四种语言模型:GPT-3和GPT-4用于QA任务,GPT-3.5-turbo用于摘要。gpt-3, gpt-4和gpt-3.5 turbo模型可以通过API调用(OpenAI API)访问。UnifiedQA用于QA任务,在hug Face上公开可用。
我们实验中使用的三个评估数据集——quality、QASPER和narativeqa——都是公开的。这些数据集确保本研究中进行的检索和QA测试可以被复制。
树形构建过程的可扩展性和计算效率
To assess the computational efficiency and cost-effectiveness of RAPTOR’s tree-building process, we conducted experiments on a consumer-grade laptop, specifically an Apple M1 Mac with 16GB of RAM. These experiments aimed to demonstrate the scalability and feasibility of RAPTOR on typical hardware. We varied the context length from 12,500 to 78,000 tokens and measured both the token expenditure and the time required to complete the tree-building process, from initial splitting and embedding to the construction of the final root node.
为了评估RAPTOR树形构建过程的计算效率和成本效益,我们在一台消费级笔记本电脑上进行了实验,特别是一台内存为16GB的苹果M1 Mac。这些实验旨在证明RAPTOR在典型硬件上的可扩展性和可行性。我们将上下文长度从12,500更改为78,000个令牌,并测量了令牌支出和完成树构建过程所需的时间,从最初的分裂和嵌入到最终根节点的构建。
Figure 5: Token cost as a function of document length for QASPER, NarrativeQA, and QuALITY.
RAPTOR tree construction costs scale linearly with document length for each of the datasets.
图5:令牌成本作为QASPER、NarrativeQA和QuALITY文档长度的函数。
RAPTOR树的构建成本与每个数据集的文档长度成线性关系。
Token Expenditure
We empirically investigated the relationship between the initial document length and the total number of tokens expended during the tree-building process, which includes both the prompt and completion tokens. The document lengths varied significantly across the three datasets examined: QuALITY, QASPER, and NarrativeQA. Figure 5 illustrates a clear linear correlation between the initial document length and the total token expenditure, emphasizing that RAPTOR maintains a linear token scaling regardless of document complexity or length.
令牌支出
我们实证地研究了初始文档长度和在树构建过程中使用的令牌总数之间的关系,其中包括提示和完成令牌。在检查的三个数据集:QuALITY、QASPER和NarrativeQA之间,文档长度差异很大。图5说明了初始文档长度与令牌总开销之间的清晰线性关系,强调了RAPTOR保持线性令牌扩展,而不管文档复杂性或长度如何。
Figure 6: Build time as a function of document length for documents of up to 80,000 tokens. RAPTOR tree construction time scales linearly with document length for each of the datasets.
图6:对于多达80,000个令牌的文档,构建时间作为文档长度的函数。对于每个数据集,RAPTOR树的构建时间与文档长度呈线性关系。
Build Time
We also empirically observed a consistent linear trend between the document length and the build time, as shown in Figure 6. This suggests that RAPTOR scales linearly in terms of time, making it a viable solution for efficiently processing large corpora of varying lengths.
Conclusion
Overall, our empirical results indicate that RAPTOR scales both in terms of tokens expended and build time. Even as the complexity and volume of the input text grow, the cost of constructing the tree scales predictably and linearly. This demonstrates that RAPTOR is computationally efficient and well-suited for processing large and diverse corpora.
构建时间
我们还根据经验观察到文档长度和构建时间之间存在一致的线性趋势,如图6所示。这表明RAPTOR在时间上呈线性扩展,使其成为有效处理不同长度的大型语料库的可行解决方案。
结论
总的来说,我们的实证结果表明,RAPTOR在代币消耗和构建时间方面都是可扩展的。即使输入文本的复杂性和体积增加,构建树的成本也会以可预测的线性方式增长。这表明RAPTOR的计算效率很高,非常适合处理大型和多样化的语料库。
B消融对猛禽聚类机制的研究
To assess the effectiveness of the clustering mechanism in our RAPTOR approach, we conducted an ablation study on the QuALITY dataset. This study compares RAPTOR’s performance with a balanced tree-style encoding and summarization of contiguous chunks, in contrast to our standard clustering method.
为了评估RAPTOR方法中聚类机制的有效性,我们对QuALITY数据集进行了消融研究。本研究将RAPTOR的性能与平衡的树式编码和连续块的汇总进行比较,并与我们的标准聚类方法进行对比。
Both configurations in this ablation study utilized SBERT embeddings and UnifiedQA to maintain consistency in retrieval. For RAPTOR, we employed our typical clustering and summarization process. In contrast, the alternative setup involved creating a balanced tree by recursively encoding and summarizing contiguous text chunks. We determined the window size for this setup based on the average cluster size observed in RAPTOR, which is approximately 6.7 nodes. Hence, we chose a window size of 7 nodes. The collapsed tree approach was applied for retrieval in both models.
责任的方法
消融研究中的两种配置都使用了SBERT嵌入和UnifiedQA来保持检索的一致性。对于RAPTOR,我们采用了典型的聚类和总结过程。相比之下,另一种设置涉及通过递归编码和汇总连续文本块来创建平衡树。我们根据在RAPTOR中观察到的平均集群大小(大约6.7个节点)确定了该设置的窗口大小。因此,我们选择7个节点的窗口大小。两种模型均采用了折叠树方法进行检索。
The results of the ablation study are presented in table 9. The results from this ablation study clearly indicate an improvement in accuracy when employing RAPTOR’s clustering mechanism over the recency-based tree approach. This finding substantiates our hypothesis that the clustering strategy in RAPTOR is more effective in capturing homogeneous content for summarization, thereby enhancing the overall retrieval performance.
B.2结果与讨论
烧蚀研究结果见表9。这项消融研究的结果清楚地表明,与基于最近的树方法相比,采用RAPTOR的聚类机制可以提高准确性。这一发现证实了我们的假设,即RAPTOR中的聚类策略在捕获同质内容进行摘要方面更有效,从而提高了整体检索性能。
Table 9: Ablation study results comparing RAPTOR with a recency-based tree approach
表9:消融研究结果比较RAPTOR和基于近期的树方法
C数据集统计和压缩比
The average ratio of the summary length to the sum of child node lengths across all datasets is 0.28, indicating a 72% compression rate. On average, the summary length is 131 tokens, and the average child node length is 86 tokens. Below are the detailed statistics for all three datasets:
所有数据集的汇总长度与子节点长度之和的平均比率为0.28,表明压缩率为72%。汇总长度平均为131个令牌,子节点平均长度为86个令牌。以下是所有三个数据集的详细统计数据:
Table 10: Statistics of Average Summary Length and Child Node Length Across Datasets
表10:跨数据集的平均汇总长度和子节点长度统计
D总结提示
Table 11 shows the prompt used for summarization.
表11显示了用于汇总的提示符。
E幻觉分析
To assess the quality and accuracy of the summarizations within our RAPTOR model, we conducted an analysis focusing on hallucinations in the generated summaries. The summaries were generated by gpt-3.5-turbo and subsequently annotated to quantify the rates of hallucinations, to examine whether such inaccuracies propagate to parent nodes, and to evaluate their impact on questionanswering (QA) tasks.
为了评估RAPTOR模型中总结的质量和准确性,我们对生成的总结中的幻觉进行了分析。摘要由gpt-3.5-turbo生成,随后进行注释以量化幻觉率,检查这种不准确性是否传播到父节点,并评估其对问答(QA)任务的影响。
We randomly sampled 150 nodes across 40 stories and evaluated them for hallucinations. This sampling strategy provides a broad view of the model’s performance across different contexts. Each node was annotated by hand, and determined if it contained a hallucination.
我们在40个楼层中随机抽取150个节点,评估它们是否存在幻觉。这种抽样策略提供了模型在不同上下文中性能的广泛视图。每个节点都是手工标注的,并确定它是否包含幻觉。
Out of the 150 nodes sampled, 4% (6 nodes) contained some form of hallucination. Most commonly, these hallucinations originated from the model adding minor information possibly from its training data that was not present in the text being summarized, or from incorrectly extrapolating some information when creating the summary.
E.2发现
在抽样的150个节点中,4%(6个节点)包含某种形式的幻觉。最常见的是,这些幻觉源于模型添加了可能来自其训练数据的次要信息,这些信息可能不存在于被总结的文本中,或者是由于在创建摘要时错误地推断了一些信息。
Example:
Text of the child nodes:
”And you will come with me to my people? We may live here among them, and you will be a great warrior–oh, when Jor dies you may even be chief, for there is none so mighty as my warrior...”But your father will not permit it–Jor, my father, High Chief of the Galus, will not permit it, for like me you are cos-ata-lo. Oh, CoTan, if we but could!... Bradley noticed that she spoke in English–broken English like Co-Tan’s but equally appealing.
Summary found in the parent of that node:
The protagonist, Bradley, is being asked by Co-Tan to stay with her people and become a great warrior, but he refuses and must return to his own country. Tom Billings of Santa Monica arrives and tells them he came to search for a man named Bowen J. Tyler, Jr. Ajor, Co-Tan’s sister, is excited about the possibility of going to Tom’s country to see strange and wonderful things...
例子:子节点的文本:“你会和我一起去我的族人那里吗?我们可以住在他们中间,你会成为一个伟大的战士——哦,当约珥死后,你甚至可以成为首领,因为没有人比我的战士更强大……”但是你的父亲不会允许的,乔,我的父亲,加洛斯的最高首领,不会允许的,因为你和我一样,都是胆小鬼。啊,科坦,要是我们能……布拉德利注意到她说的英语很蹩脚,就像科-谭一样,但同样有吸引力。
在该节点的父节点中找到的摘要:主人公布拉德利被Co-Tan要求留在她的人民中,成为一名伟大的战士,但他拒绝了,必须回到自己的国家。圣塔莫尼卡的汤姆·比林斯来了,告诉他们他是来找一个叫鲍恩·j·泰勒的人的。科-谭的妹妹阿杰(Bowen J. Tyler, Jr. Ajor)很兴奋,因为有可能去汤姆的国家看看奇怪而奇妙的东西……
The hallucination here is that the summary states that Jr. Ajor and Co-Tan are sisters, but does not explicitly mention or imply this.
Upon reviewing all parent nodes, we found that hallucinations did not propagate to higher layers.
Generally, the hallucinations were minor and did not alter the thematic interpretation of the text.
这里的幻觉是,摘要指出,Jr. Ajor和Co-Tan是姐妹,但没有明确提及或暗示这一点。
在检查所有父节点后,我们发现幻觉不会传播到更高层。
一般来说,幻觉是轻微的,不会改变文本的主题解释。
E.3对qa任务的影响
In our findings, hallucinations had no discernible impact on the performance of QA tasks. This suggests that hallucination is not a major concerns for the summarization component in our RAPTOR architecture.
在我们的研究中,幻觉对QA任务的表现没有明显的影响。这表明幻觉并不是我们的RAPTOR架构中总结组件的主要关注点。
用于检索方法的伪代码
G定性分析
To qualitatively examine RAPTOR’s retrieval process, we test it on thematic, multi-hop questions about a 1500-word version of the fairytale Cinderella. We compare the context retrieved by RAPTOR with the context retrieved by Dense Passage Retrieval (DPR). Figure 4 in the main paper details the retrieval process within RAPTOR’s tree structure for two questions. The nodes that RAPTOR selects for each question are highlighted, while the leaf nodes that DPR selects for the same question are indicated with arrows. This comparison illustrates the advantage of RAPTOR’s tree structure.
RAPTOR selects nodes from different layers depending on the level of granularity required by the question at hand. Further, the information that would be retrieved by DPR is more often than not included in the context retrieved by RAPTOR, either directly as a leaf node or indirectly as part of a summary from a higher layer.
为了定性地检验RAPTOR的检索过程,我们测试了关于1500字童话灰姑娘版本的主题,多跳问题。我们将RAPTOR检索的上下文与密集通道检索(DPR)检索的上下文进行了比较。主论文中的图4详细描述了RAPTOR树结构中两个问题的检索过程。RAPTOR为每个问题选择的节点被突出显示,而DPR为同一问题选择的叶节点用箭头表示。这个对比说明了迅猛龙树状结构的优势。
RAPTOR根据手头问题所需的粒度级别从不同层中选择节点。此外,由DPR检索的信息通常包含在RAPTOR检索的上下文中,或者直接作为叶节点,或者间接作为来自更高层的摘要的一部分。
Table 12: Relevant excerpts from text retrieved by RAPTOR and DPR for the questions on the fairytale Cinderella.
表12:RAPTOR和DPR检索到的关于童话灰姑娘问题的相关文本摘录。
Question: What is the central theme of the story?
问题:故事的中心主题是什么?
RAPTOR 回复:
Fairy Godmother helps Cinderella attend a ball by transforming her rags. . . Cinderella impresses the Prince at the ball. . . she loses track of time and has to run home alone in the darkness. The Prince is unable to find Cinderella and goes in search of her . . . She forgave her sisters, and treated them always very kindly, and the Prince had great cause to be glad that he had found the glass slipper.
仙女教母帮助灰姑娘参加一个舞会,把她的破衣服变成了……灰姑娘在舞会上给王子留下了深刻的印象…她忘记了时间,不得不在黑暗中独自跑回家。王子找不到灰姑娘,就去找她…她原谅了她的姐妹们,而且总是对她们很好。王子有充分的理由为他找到了水晶鞋而高兴。
DPR回复
Two mice were turned into footmen; four grasshoppers into white horses. Next, the Fairy touched Cinderella’s rags, and they became rich satin robes, trimmed with point lace. . . . Alas! the clock had struck Eleven. She found no coach, only a pumpkin, and the rat and mice ran quickly away when they saw her; while all her fine dress turned to rags, and she had to run home alone. . . They told her a beautiful Princess had been at the ball, with whom the Prince was delighted. They did not know it was Cinderella.
两只老鼠变成了男仆;四只蚱蜢变成了白马。接着,仙女碰了碰灰姑娘的破布,它们变成了华丽的缎子长袍,镶着尖花边. . . .唉!钟敲了11下。她没有找到马车,只有一个南瓜,老鼠一看到她就飞快地跑开了;当她所有漂亮的衣服都变成破衣服时,她不得不独自跑回家……他们告诉她,舞会上有一位美丽的公主,王子很喜欢她。他们不知道那是灰姑娘。
Question: How does Cinderella find a happy ending?
问题:灰姑娘是如何找到一个幸福的结局的?
RAPTOR 回复:
The story of Cinderella involves her godmother, a fairy, who transforms a pumpkin into a grand coach with her wand and allows Cinderella to attend the ball. However, Cinderella must return home before the clock strikes eleven or her dress will turn back into rags. . . Cinderella impresses the Prince at the ball but leaves before he can find out who she is. . . The Prince searched for the owner of a lost glass slipper and found it belonged to Cinderella. She forgave her sisters and the Prince was glad to have found her
灰姑娘的故事涉及到她的教母,一个仙女,她用她的魔杖把南瓜变成了一辆大马车,并允许灰姑娘参加舞会。然而,灰姑娘必须在时钟敲11点之前回家,否则她的衣服会变成破衣服…灰姑娘在舞会上给王子留下了深刻的印象,但在他发现她是谁之前就离开了…王子寻找一只丢失的水晶鞋的主人,发现它是灰姑娘的。她原谅了她的姐妹们,王子很高兴找到了她
DPR回复:
the clock had struck Eleven. . . The Prince was very much surprised when he missed Cinderella again, and leaving the ball, went in search of her. . . Fairy touched Cinderella’s rags, and they became rich satin robes, trimmed with point lace... Her old shoes became a charming pair of glass slippers, which shone like diamonds. “Now go to the ball, my love,” she said, “and enjoy yourself. But remember, you must leave the room before the clock strikes eleven. If you do not your dress will return to its original rags.”
钟敲了11下……当王子再次错过灰姑娘时,他感到非常惊讶,离开舞会,去寻找她…仙女碰了碰灰姑娘的破布,它们变成了华丽的缎子长袍,镶着尖花边……她的旧鞋变成了一双迷人的水晶鞋,像钻石一样闪闪发光。“现在去参加舞会吧,亲爱的,”她说,“好好享受吧。但是记住,你必须在钟敲11点之前离开房间。如果你不这样做,你的衣服就会变回原来的破衣服。”
”The first question we examine is “How does Cinderella find a happy ending?”, a multi-hop question best answered by synthesizing information from various text segments. To control for the language model’s potential familiarity with the Cinderella story, we instructed it to rely solely on the retrieved information for its answers. Table 13 shows the text retrieved by both RAPTOR and DPR for this question. RAPTOR’s context succinctly describes Cinderella’s journey to happiness, while DPR’s leaf nodes primarily focus on her initial transformation. The difference in retrieved information significantly impacts downstream tasks. When GPT-4 is provided with RAPTOR’s context, it generates a detailed answer: “Cinderella finds a happy ending when the Prince searches for the owner of the lost glass slipper and discovers it belongs to Cinderella. They eventually marry, transforming Cinderella’s life for the better.” In contrast, using DPR’s context, GPT-4 states: “Based on the given context, it is not possible to determine how Cinderella finds a happy ending, as the text lacks information about the story’s conclusion.”
我们研究的第一个问题是“灰姑娘如何找到一个幸福的结局?”,这是一个多跳问题,最好通过综合各种文本片段的信息来回答。为了控制语言模型对灰姑娘故事的潜在熟悉程度,我们指示它完全依赖检索到的信息来给出答案。表13显示了RAPTOR和DPR为这个问题检索的文本。RAPTOR的上下文简洁地描述了灰姑娘的幸福之旅,而DPR的叶节点主要关注她最初的转变。检索信息的差异显著影响下游任务。当GPT-4提供RAPTOR的上下文时,它会生成一个详细的答案:“当王子寻找丢失的水晶鞋的主人并发现它属于灰姑娘时,灰姑娘找到了一个幸福的结局。他们最终结婚了,让灰姑娘的生活变得更好。”相比之下,使用DPR的上下文,GPT-4指出:“基于给定的上下文,不可能确定灰姑娘如何找到一个幸福的结局,因为文本缺乏关于故事结局的信息。”
The second question we examine is “What is the central theme of the story?”, a thematic question that requires holistic understanding of the entire text. The text retrieved by RAPTOR and DPR for this question is shown in Table 13. The text retrieved by RAPTOR contains short descriptions of all the major parts of the story, whereas the text retrieved by DPR contains detailed descriptions of a narrow subset of the story. Again, the difference in retrieval mechanisms affects the performance of GPT-4 when answering the question. Given DPR’s context, it outputs “The central theme of the story is transformation and the power of inner beauty, as Cinderella, a kind and humble girl, is magically transformed into a beautiful princess, capturing the attention and admiration of the Prince and others at the ball.” This answer only takes into account the first portion of the story, up until Cinderella first meets the prince. In contrast, given RAPTOR’s context, GPT-4 outputs “The central theme of the story is transformation and overcoming adversity, as Cinderella, with the help of her Fairy Godmother, transforms from a mistreated and downtrodden girl into a beautiful and confident young woman who ultimately finds happiness and love with the Prince.” This is a more complete answer, demonstrating a comprehensive understanding of the story.
This qualitative analysis indicates that RAPTOR outperforms prior retrieval mechanisms because the information that it retrieves is more relevant and exhaustive, allowing for better performance on downstream tasks.
We also created a 2600-word story along with questions about its narrative and theme. An excerpt from the story is present below and the full PDF of this story is linked here. For questions like “What is the central theme of the story?”, an upper-level node is retrieved which includes the sentence: “This story is about the power of human connection... inspiring and uplifting each other as they pursued their passions.” This summary, not explicitly present in the original text, almost directly answers the question.
我们考察的第二个问题是“故事的中心主题是什么?”,这是一个主题问题,需要对全文进行整体理解。RAPTOR和DPR为这个问题检索的文本如表13所示。RAPTOR检索的文本包含故事所有主要部分的简短描述,而DPR检索的文本包含故事的一个狭窄子集的详细描述。同样,检索机制的差异会影响GPT-4在回答问题时的表现。考虑到DPR的背景,它得出“故事的中心主题是转变和内在美的力量,灰姑娘,一个善良而谦逊的女孩,神奇地变成了一个美丽的公主,吸引了王子和其他人的注意和钦佩。”这个答案只考虑了故事的第一部分,直到灰姑娘第一次遇到王子。相比之下,在《猛禽》的背景下,GPT-4的评价是“故事的中心主题是转变和克服逆境,灰姑娘在仙女教母的帮助下,从一个受虐待和受压迫的女孩变成了一个美丽自信的年轻女子,最终与王子找到了幸福和爱情。”这是一个更完整的答案,展示了对故事的全面理解。
这种定性分析表明,RAPTOR优于先前的检索机制,因为它检索的信息更相关,更详尽,可以在下游任务上获得更好的性能。
我们还创建了一个2600字的故事,并提出了关于故事和主题的问题。下面是这个故事的摘录,完整的PDF链接在这里。对于诸如“故事的中心主题是什么?”,检索到一个上层节点,其中包括这样的句子:“这个故事是关于人类联系的力量……当他们追求自己的激情时,彼此激励和振奋。”这篇摘要虽然没有明确地出现在原文中,但几乎直接回答了这个问题。
Excerpt from ”The Eager Writer”:
节选自《渴望的作家》:
”Ethan’s passion for writing had always been a part of him. As a child, he would often scribble stories and poems in his notebook, and as he grew older, his love for writing only intensified. His evenings were often spent in the dim light of his room, typing away at his laptop. He had recently taken a job as a content writer for an online marketing firm to pay the bills, but his heart still longed for the world of storytelling. However, like many aspiring writers, he struggled to find a foothold in the industry. He took a job as a content writer for an online marketing firm, but it was growing increasingly evident to him that this was not the path he wanted to pursue. It was during this time that he stumbled upon the Pathways app. The app offered a platform for people in similar professions to connect and share knowledge, and he saw it as an opportunity to finally connect with others who shared his passion for writing. Ethan saw an opportunity to meet others who shared his passion and could offer guidance and mentorship. He quickly signed up and was surprised by the number of writers he found on the platform, from well establish professionals to beginners just starting out in the business.”
“伊森对写作的热情一直是他的一部分。小时候,他经常在笔记本上草草写下故事和诗歌,随着年龄的增长,他对写作的热爱愈演愈烈。他的夜晚常常是在房间昏暗的灯光下,在笔记本电脑上打字。他最近在一家网络营销公司找到了一份内容写手的工作来支付账单,但他的内心仍然渴望讲故事的世界。然而,像许多有抱负的作家一样,他很难在这个行业找到立足之地。他在一家网络营销公司找到了一份内容写手的工作,但他越来越明显地意识到,这不是他想要追求的道路。正是在这段时间里,他偶然发现了Pathways应用程序。这款应用程序为从事类似职业的人提供了一个交流和分享知识的平台,他认为这是一个机会,最终可以与其他和他一样热爱写作的人建立联系。伊桑看到了一个机会,可以认识其他和他有同样热情的人,可以提供指导和指导。他很快就注册了,并对他在平台上发现的作家数量感到惊讶,从成熟的专业人士到刚开始从事这项业务的初学者。”
H叙事性评价脚本
We made several modifications to AllenNLP’s evaluation script3 to better fit our evaluation needs:
为了更好地满足我们的评估需求,我们对AllenNLP的评估脚本3做了一些修改:
• Added Smoothing:
Smoothing was incorporated to handle cases where BLEU score is zero, due to no n-gram matches occurring in the reference text. A BLEU score of zero skews the results, leading to an overly harsh evaluation for rare or novel phrases. By adding a smoothing function, we prevent the BLEU scores from dropping to zero, providing a more fair evaluation.
• Modified BLEU-4 Weighting:
The original script applied a weight of 1 to the highest order n-gram (4-gram) and 0 to the rest in its BLEU-4 calculation (i.e., weights=(0, 0, 0, 1)). This approach may overly focus on 4-gram matches while neglecting lower-order matches. To provide a more balanced evaluation, we evenly distributed the weight across all n-gram levels, changing the weights for the BLEU-4 calculation to (0.25, 0.25, 0.25, 0.25).
• Tokenization before Mapping in METEOR Calculation:
The original script utilized a simple split and map method for METEOR calculation. We fixed this by first tokenizing the text and then mapping the tokens. This amendment improves the accuracy of the METEOR calculation by taking into account the correct linguistic boundaries of words.
•增加平滑:
平滑被用于处理BLEU分数为零的情况,因为参考文本中没有出现n-gram匹配。BLEU分数为零会扭曲结果,导致对罕见或新颖短语的过于苛刻的评估。通过添加平滑函数,我们防止BLEU分数降为零,提供更公平的评估。
•修改BLEU-4权重:
在BLEU-4计算中,原始脚本对最高阶n-gram (4-gram)的权重为1,对其余部分的权重为0(即,weights=(0,0,0,1))。这种方法可能过于关注4克匹配,而忽略了低阶匹配。为了提供更平衡的评估,我们在所有n-gram级别上均匀分配权重,将BLEU-4计算的权重更改为(0.25,0.25,0.25,0.25)。
•在流星计算映射前的标记化:
原始脚本使用简单的分割和映射方法进行METEOR计算。我们通过首先对文本进行标记,然后对标记进行映射来解决这个问题。这一修正通过考虑单词的正确语言边界提高了METEOR计算的准确性。
Table 13: Relevant excerpts from text retrieved by RAPTOR and DPR for the questions on the fairytale Cinderella
表13:RAPTOR和DPR检索到的关于童话灰姑娘问题的相关文本摘录
Question: What is the central theme of the story?
问题:故事的中心主题是什么?
RAPTOR 回复:
Fairy Godmother helps Cinderella attend a ball by transforming her rags. . . Cinderella impresses the Prince at the ball. . . she loses track of time and has to run home alone in the darkness. The Prince is unable to find Cinderella and goes in search of her . . . She forgave her sisters, and treated them always very kindly, and the Prince had great cause to be glad that he had found the glass slipper.
仙女教母帮助灰姑娘参加一个舞会,把她的破衣服变成了……灰姑娘在舞会上给王子留下了深刻的印象…她忘记了时间,不得不在黑暗中独自跑回家。王子找不到灰姑娘,就去找她…她原谅了她的姐妹们,而且总是对她们很好。王子有充分的理由为他找到了水晶鞋而高兴。
DPR回复:
Two mice were turned into footmen; four grasshoppers into white horses. Next, the Fairy touched Cinderella’s rags, and they became rich satin robes, trimmed with point lace. . . . Alas! the clock had struck Eleven. She found no coach, only a pumpkin, and the rat and mice ran quickly away when they saw her; while all her fine dress turned to rags, and she had to run home alone. . . They told her a beautiful Princess had been at the ball, with whom the Prince was delighted. They did not know it was Cinderella.
两只老鼠变成了男仆;四只蚱蜢变成了白马。接着,仙女碰了碰灰姑娘的破布,它们变成了华丽的缎子长袍,镶着尖花边. . . .唉!钟敲了11下。她没有找到马车,只有一个南瓜,老鼠一看到她就飞快地跑开了;当她所有漂亮的衣服都变成破衣服时,她不得不独自跑回家……他们告诉她,舞会上有一位美丽的公主,王子很喜欢她。他们不知道那是灰姑娘。
Question: How does Cinderella find a happy ending?
问题:灰姑娘是如何找到一个幸福的结局的?
RAPTOR 回复:
The story of Cinderella involves her godmother, a fairy, who transforms a pumpkin into a grand coach with her wand and allows Cinderella to attend the ball. However, Cinderella must return home before the clock strikes eleven or her dress will turn back into rags. . . Cinderella impresses the Prince at the ball but leaves before he can find out who she is. . . The Prince searched for the owner of a lost glass slipper and found it belonged to Cinderella. She forgave her sisters and the Prince was glad to have found her.
灰姑娘的故事涉及到她的教母,一个仙女,她用她的魔杖把南瓜变成了一辆大马车,并允许灰姑娘参加舞会。然而,灰姑娘必须在时钟敲11点之前回家,否则她的衣服会变成破衣服…灰姑娘在舞会上给王子留下了深刻的印象,但在他发现她是谁之前就离开了…王子寻找一只丢失的水晶鞋的主人,发现它是灰姑娘的。她原谅了她的姐妹们,王子很高兴找到了她。
DPR回复:
the clock had struck Eleven. . . The Prince was very much surprised when he missed Cinderella again, and leaving the ball, went in search of her. . . Fairy touched Cinderella’s rags, and they became rich satin robes, trimmed with point lace... Her old shoes became a charming pair of glass slippers, which shone like diamonds. “Now go to the ball, my love,” she said, “and enjoy yourself. But remember, you must leave the room before the clock strikes eleven. If you do not your dress will return to its original rags.”
钟敲了11下……当王子再次错过灰姑娘时,他感到非常惊讶,离开舞会,去寻找她…仙女碰了碰灰姑娘的破布,它们变成了华丽的缎子长袍,镶着尖花边……她的旧鞋变成了一双迷人的水晶鞋,像钻石一样闪闪发光。“现在去参加舞会吧,亲爱的,”她说,“好好享受吧。但是记住,你必须在钟敲11点之前离开房间。如果你不这样做,你的衣服就会变回原来的破衣服。”
我分析了不同层次对猛禽性能的影响
I.1不同的层是如何影响性能的?
In this section, we present a detailed breakdown of RAPTOR’s retrieval performance when querying different layers of the hierarchical tree structure for various stories. These tables validate the utility of RAPTOR’s multi-layered structure for diverse query requirements.
在本节中,我们将详细分析RAPTOR在查询不同故事的层次树结构的不同层时的检索性能。这些表验证了RAPTOR的多层结构对不同查询需求的实用性。
Table 14: Performance of RAPTOR when querying different layers of the tree for Story 2.
表14:在为Story 2查询树的不同层时,RAPTOR的性能。
Figure 7: Histogram showing the percentage of nodes retrieved from different layers of the RAPTOR tree across three datasets (NarrativeQA, Quality, and Qasper) using three retrievers (SBERT, BM25, and DPR). The data indicate that a substantial portion of the nodes contributing to the final retrieval comes from non-leaf layers, with a notable percentage from the first and second layers, highlighting the importance of RAPTOR’s hierarchical summarization in the retrieval process.
图7:直方图显示了使用三种检索器(SBERT、BM25和DPR)从三个数据集(NarrativeQA、Quality和Qasper)的RAPTOR树的不同层检索到的节点百分比。数据表明,参与最终检索的大部分节点来自非叶子层,其中第一层和第二层的比例显著,这突出了RAPTOR的分层总结在检索过程中的重要性。
Table 15: Performance of RAPTOR when querying different layers of the tree for Story 3.
表15:在查询故事3的树的不同层时,RAPTOR的性能。
Table 16: Performance of RAPTOR when querying different layers of the tree for Story 4.
表16:在查询故事4的树的不同层时,RAPTOR的性能。
Table 17: Performance of RAPTOR when querying different layers of the tree for Story 5.
表17:在查询故事5的树的不同层时,RAPTOR的性能。
I.2检索到的节点来自哪些层?
We further conduct an ablation study across all three datasets and across three different retrievers with RAPTOR with the collapsed tree retrieval to examine the layers from which the retrieved nodes originate. We observe that between 18.5% to 57% of the retrieved nodes come from non-leaf nodes.
As illustrated in Figure 7, the retrieval pattern across layers reveals the importance of RAPTOR’s multi-layered tree structure. Notably, a significant percentage of the nodes retrieved by RAPTOR using the DPR retriever for the NarrativeQA dataset come from the first and second layers of the tree, as opposed to the leaf nodes. This pattern is consistent across the other datasets and retrievers, albeit with varying percentages.
我们进一步对所有三个数据集和三种不同的检索器进行了消融研究,使用RAPTOR和折叠树检索来检查检索节点的起源层。我们观察到18.5%到57%的检索节点来自非叶节点。
如图7所示,跨层的检索模式揭示了RAPTOR多层树结构的重要性。值得注意的是,RAPTOR使用DPR检索器为NarrativeQA数据集检索的节点中有很大一部分来自树的第一层和第二层,而不是叶子节点。这种模式在其他数据集和检索器中是一致的,尽管百分比不同。
Table 18: Percentage of nodes from non-leaf nodes across different datasets and retrievers
表18:来自不同数据集和检索器的非叶节点的节点百分比
Table 19: Percentage of nodes from different layers with DPR as the retriever
表19:以DPR为检索器的各层节点的百分比
Table 20: Percentage of nodes from different layers with SBERT as the retriever
表20:以SBERT为检索器的不同层的节点百分比
Table 21: Percentage of nodes from different layers with BM25 as the retriever
表21:以BM25为检索对象的不同层节点的百分比
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。