赞
踩
本文转载改编自:
https://python.langchain.com.cn/docs/modules/data_connection/document_transformers/
一旦加载了文档,您通常会希望对其进行转换,以更好地适应您的应用程序。
最简单的例子是您可能希望将长文档拆分为更小的块,以适应您模型的上下文窗口。
LangChain提供了许多内置的文档转换器,使得拆分、合并、过滤和其他文档操作变得容易。
当您想要处理大块文本时,有必要将文本拆分为块。
虽然听起来很简单,但这里存在许多 潜在的复杂性。
理想情况下,您希望将 语义相关的文本片段 保持在一起。
"语义相关"的含义可能取决于 文本的类型。本笔记本演示了几种做法。
在高层次上,文本拆分器的工作方式如下:
这意味着有两个不同的轴可以定制您的文本拆分器:
默认推荐的文本分割器是 RecursiveCharacterTextSplitter。
该文本分割器接受一个字符列表。
它尝试根据第一个字符进行分割来创建块,但如果任何块太大,则继续移动到下一个字符,依此类推。
默认情况下,它尝试进行分割的字符是 ["\n\n", "\n", " ", ""]
除了控制可以进行分割的字符之外,您还可以控制一些其他事项:
length_function
:计算块长度的方法。默认只计算字符数,但通常在此处传递一个令牌计数器。chunk_size
:块的最大大小(由长度函数测量)。chunk_overlap
:块之间的最大重叠。保持一些连续性之间可能有一些重叠(例如使用滑动窗口)。add_start_index
:是否在元数据中包含每个块在原始文档中的起始位置。加载一段长文本
with open('../../state_of_the_union.txt') as f:
state_of_the_union = f.read()
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 100,
chunk_overlap = 20,
length_function = len,
add_start_index = True,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={'start_index': 0}
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={'start_index': 82}
这是最简单的方法。它基于字符进行拆分(默认为"\n\n"),并通过字符数量来测量块的长度。
# This is a long document we can split up.
with open('../../../state_of_the_union.txt') as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
...
He met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0
如下示例,传递文档的元数据信息。注意,它是和文档一起拆分的。
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
print(documents[0])
page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
...
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0
text_splitter.split_text(state_of_the_union)[0]
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
...
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
CodeTextSplitter 允许您使用多种语言进行代码分割。
导入枚举 Language
并指定语言。
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
Language,
)
Full list of support languages
[e.value for e in Language]
['cpp',
'go',
'java',
'js',
'php',
'proto',
'python',
'rst',
'ruby',
'rust',
'scala',
'swift',
'markdown',
'latex',
'html',
'sol',]
给定编程语言,你也可以看到 这个语言对应的 separators
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']
这里是使用 PythonTextSplitter 的示例
PYTHON_CODE = """
def hello_world():
print("Hello, World!")
# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs
[Document(page_content='def hello_world():\n print("Hello, World!")', metadata={}),
Document(page_content='# Call the function\nhello_world()', metadata={})]
这里是使用 JS 文本分割器的示例
JS_CODE = """
function helloWorld() {
console.log("Hello, World!");
}
// Call the function
helloWorld();
"""
js_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.JS, chunk_size=60, chunk_overlap=0
)
js_docs = js_splitter.create_documents([JS_CODE])
js_docs
[Document(page_content='function helloWorld() {\n console.log("Hello, World!");\n}', metadata={}),
Document(page_content='// Call the function\nhelloWorld();', metadata={})]
这里是使用 Markdown 文本分割器的示例
markdown_text = """
# 声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/990268
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。