当前位置:   article > 正文

LangChain教程 | langchain 文件加载器使用教程 | Document Loaders全集_python unstructuredworddocumentloader用法

python unstructuredworddocumentloader用法

提示:

        想要了解更多有关内置文档加载器与第三方工具集成的文档,甚至包括了:哔哩哔哩网站加载器、区块链加载器、汇编音频文本、Datadog日志加载器等。

        本文主要收集与讲解日常使用的加载器,足够咱们平时开发人工智能的工作使用,大概有:csv加载器、text加载器、word加载器、html加载器、pdf加载器、文件目录加载器、json加载器等。

概述

        使用文档加载器将数据从源加载为 Document Document是一段文本和相关的元数据。例如,有一些文档加载器用于加载简单的 .txt 文件,用于加载任何网页的文本内容,甚至用于加载 YouTube视频的副本

        文档加载器提供了一种“加载”方法,用于从配置的源中将数据作为文档加载。它们还可选地实现“延迟加载”,用于将数据延迟加载到内存中。

一、CSV 加载器

        CSV 文件是使用逗号分隔值的分隔文本文件。文件的每一行都是一条数据记录。每个记录由一个或多个用逗号分隔的字段组成。

        每个文档加载一行CSV数据。

  1. from langchain_community.document_loaders.csv_loader import CSVLoader
  2. loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
  3. data = loader.load()
  4. print(data)

          打印结果:

    [Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 0}, lookup_index=0), Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 1}, lookup_index=0), Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 2}, lookup_index=0), Document(page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 3}, lookup_index=0), Document(page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 4}, lookup_index=0), Document(page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 5}, lookup_index=0), Document(page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 6}, lookup_index=0), Document(page_content='Team: Orioles\n"Payroll (millions)": 81.43\n"Wins": 93', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 7}, lookup_index=0), Document(page_content='Team: Rays\n"Payroll (millions)": 64.17\n"Wins": 90', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 8}, lookup_index=0), Document(page_content='Team: Angels\n"Payroll (millions)": 154.49\n"Wins": 89', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 9}, lookup_index=0), Document(page_content='Team: Tigers\n"Payroll (millions)": 132.30\n"Wins": 88', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 10}, lookup_index=0), Document(page_content='Team: Cardinals\n"Payroll (millions)": 110.30\n"Wins": 88', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 11}, lookup_index=0), Document(page_content='Team: Dodgers\n"Payroll (millions)": 95.14\n"Wins": 86', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 12}, lookup_index=0), Document(page_content='Team: White Sox\n"Payroll (millions)": 96.92\n"Wins": 85', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 13}, lookup_index=0), Document(page_content='Team: Brewers\n"Payroll (millions)": 97.65\n"Wins": 83', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 14}, lookup_index=0), Document(page_content='Team: Phillies\n"Payroll (millions)": 174.54\n"Wins": 81', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 15}, lookup_index=0), Document(page_content='Team: Diamondbacks\n"Payroll (millions)": 74.28\n"Wins": 81', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 16}, lookup_index=0), Document(page_content='Team: Pirates\n"Payroll (millions)": 63.43\n"Wins": 79', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 17}, lookup_index=0), Document(page_content='Team: Padres\n"Payroll (millions)": 55.24\n"Wins": 76', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 18}, lookup_index=0), Document(page_content='Team: Mariners\n"Payroll (millions)": 81.97\n"Wins": 75', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 19}, lookup_index=0), Document(page_content='Team: Mets\n"Payroll (millions)": 93.35\n"Wins": 74', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 20}, lookup_index=0), Document(page_content='Team: Blue Jays\n"Payroll (millions)": 75.48\n"Wins": 73', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 21}, lookup_index=0), Document(page_content='Team: Royals\n"Payroll (millions)": 60.91\n"Wins": 72', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 22}, lookup_index=0), Document(page_content='Team: Marlins\n"Payroll (millions)": 118.07\n"Wins": 69', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 23}, lookup_index=0), Document(page_content='Team: Red Sox\n"Payroll (millions)": 173.18\n"Wins": 69', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 24}, lookup_index=0), Document(page_content='Team: Indians\n"Payroll (millions)": 78.43\n"Wins": 68', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 25}, lookup_index=0), Document(page_content='Team: Twins\n"Payroll (millions)": 94.08\n"Wins": 66', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 26}, lookup_index=0), Document(page_content='Team: Rockies\n"Payroll (millions)": 78.06\n"Wins": 64', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 27}, lookup_index=0), Document(page_content='Team: Cubs\n"Payroll (millions)": 88.19\n"Wins": 61', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 28}, lookup_index=0), Document(page_content='Team: Astros\n"Payroll (millions)": 60.65\n"Wins": 55', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 29}, lookup_index=0)]

① 定制CSV解析和加载

        参见csv模块文档,了解支持哪些csv参数的更多信息。

        下面直接在注释里面讲解几个比较常用的参数:

  1. loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv', csv_args={
  2. # 定界符:用于分隔字段的单字符字符串。它默认为','
  3. 'delimiter': ',',
  4. # 引号字符:用于引用包含特殊字符的字段的单字符字符串,如定界符或者quotechar,或者包含换行符。它默认为'"'.
  5. 'quotechar': '"',
  6. # 字段名称:如果在创建对象时没有作为参数传递,则在第一次访问或从文件中读取第一条记录时初始化该属性。
  7. 'fieldnames': ['MLB Team', 'Payroll in millions', 'Wins']
  8. })
  9. data = loader.load()
  10. print(data)

        打印结果:

    [Document(page_content='MLB Team: Team\nPayroll in millions: "Payroll (millions)"\nWins: "Wins"', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 0}, lookup_index=0), Document(page_content='MLB Team: Nationals\nPayroll in millions: 81.34\nWins: 98', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 1}, lookup_index=0), Document(page_content='MLB Team: Reds\nPayroll in millions: 82.20\nWins: 97', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 2}, lookup_index=0), Document(page_content='MLB Team: Yankees\nPayroll in millions: 197.96\nWins: 95', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 3}, lookup_index=0), Document(page_content='MLB Team: Giants\nPayroll in millions: 117.62\nWins: 94', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 4}, lookup_index=0), Document(page_content='MLB Team: Braves\nPayroll in millions: 83.31\nWins: 94', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 5}, lookup_index=0), Document(page_content='MLB Team: Athletics\nPayroll in millions: 55.37\nWins: 94', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 6}, lookup_index=0), Document(page_content='MLB Team: Rangers\nPayroll in millions: 120.51\nWins: 93', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 7}, lookup_index=0), Document(page_content='MLB Team: Orioles\nPayroll in millions: 81.43\nWins: 93', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 8}, lookup_index=0), Document(page_content='MLB Team: Rays\nPayroll in millions: 64.17\nWins: 90', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 9}, lookup_index=0), Document(page_content='MLB Team: Angels\nPayroll in millions: 154.49\nWins: 89', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 10}, lookup_index=0), Document(page_content='MLB Team: Tigers\nPayroll in millions: 132.30\nWins: 88', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 11}, lookup_index=0), Document(page_content='MLB Team: Cardinals\nPayroll in millions: 110.30\nWins: 88', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 12}, lookup_index=0), Document(page_content='MLB Team: Dodgers\nPayroll in millions: 95.14\nWins: 86', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 13}, lookup_index=0), Document(page_content='MLB Team: White Sox\nPayroll in millions: 96.92\nWins: 85', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 14}, lookup_index=0), Document(page_content='MLB Team: Brewers\nPayroll in millions: 97.65\nWins: 83', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 15}, lookup_index=0), Document(page_content='MLB Team: Phillies\nPayroll in millions: 174.54\nWins: 81', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 16}, lookup_index=0), Document(page_content='MLB Team: Diamondbacks\nPayroll in millions: 74.28\nWins: 81', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 17}, lookup_index=0), Document(page_content='MLB Team: Pirates\nPayroll in millions: 63.43\nWins: 79', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 18}, lookup_index=0), Document(page_content='MLB Team: Padres\nPayroll in millions: 55.24\nWins: 76', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 19}, lookup_index=0), Document(page_content='MLB Team: Mariners\nPayroll in millions: 81.97\nWins: 75', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 20}, lookup_index=0), Document(page_content='MLB Team: Mets\nPayroll in millions: 93.35\nWins: 74', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 21}, lookup_index=0), Document(page_content='MLB Team: Blue Jays\nPayroll in millions: 75.48\nWins: 73', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 22}, lookup_index=0), Document(page_content='MLB Team: Royals\nPayroll in millions: 60.91\nWins: 72', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 23}, lookup_index=0), Document(page_content='MLB Team: Marlins\nPayroll in millions: 118.07\nWins: 69', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 24}, lookup_index=0), Document(page_content='MLB Team: Red Sox\nPayroll in millions: 173.18\nWins: 69', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 25}, lookup_index=0), Document(page_content='MLB Team: Indians\nPayroll in millions: 78.43\nWins: 68', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 26}, lookup_index=0), Document(page_content='MLB Team: Twins\nPayroll in millions: 94.08\nWins: 66', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 27}, lookup_index=0), Document(page_content='MLB Team: Rockies\nPayroll in millions: 78.06\nWins: 64', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 28}, lookup_index=0), Document(page_content='MLB Team: Cubs\nPayroll in millions: 88.19\nWins: 61', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 29}, lookup_index=0), Document(page_content='MLB Team: Astros\nPayroll in millions: 60.65\nWins: 55', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 30}, lookup_index=0)]

② 指定一个列来标识文档源

        使用 source_column 参数为从每一行创建的文档指定一个源。否则 file_path 将用作从CSV文件创建的所有文档的源。

        当将从CSV文件加载的文档用于使用源回答问题的链时,这很有用。

  1. loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv', source_column="Team")
  2. data = loader.load()
  3. print(data)
    [Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', lookup_str='', metadata={'source': 'Nationals', 'row': 0}, lookup_index=0), Document(page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97', lookup_str='', metadata={'source': 'Reds', 'row': 1}, lookup_index=0), Document(page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95', lookup_str='', metadata={'source': 'Yankees', 'row': 2}, lookup_index=0), Document(page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94', lookup_str='', metadata={'source': 'Giants', 'row': 3}, lookup_index=0), Document(page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94', lookup_str='', metadata={'source': 'Braves', 'row': 4}, lookup_index=0), Document(page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94', lookup_str='', metadata={'source': 'Athletics', 'row': 5}, lookup_index=0), Document(page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93', lookup_str='', metadata={'source': 'Rangers', 'row': 6}, lookup_index=0), Document(page_content='Team: Orioles\n"Payroll (millions)": 81.43\n"Wins": 93', lookup_str='', metadata={'source': 'Orioles', 'row': 7}, lookup_index=0), Document(page_content='Team: Rays\n"Payroll (millions)": 64.17\n"Wins": 90', lookup_str='', metadata={'source': 'Rays', 'row': 8}, lookup_index=0), Document(page_content='Team: Angels\n"Payroll (millions)": 154.49\n"Wins": 89', lookup_str='', metadata={'source': 'Angels', 'row': 9}, lookup_index=0), Document(page_content='Team: Tigers\n"Payroll (millions)": 132.30\n"Wins": 88', lookup_str='', metadata={'source': 'Tigers', 'row': 10}, lookup_index=0), Document(page_content='Team: Cardinals\n"Payroll (millions)": 110.30\n"Wins": 88', lookup_str='', metadata={'source': 'Cardinals', 'row': 11}, lookup_index=0), Document(page_content='Team: Dodgers\n"Payroll (millions)": 95.14\n"Wins": 86', lookup_str='', metadata={'source': 'Dodgers', 'row': 12}, lookup_index=0), Document(page_content='Team: White Sox\n"Payroll (millions)": 96.92\n"Wins": 85', lookup_str='', metadata={'source': 'White Sox', 'row': 13}, lookup_index=0), Document(page_content='Team: Brewers\n"Payroll (millions)": 97.65\n"Wins": 83', lookup_str='', metadata={'source': 'Brewers', 'row': 14}, lookup_index=0), Document(page_content='Team: Phillies\n"Payroll (millions)": 174.54\n"Wins": 81', lookup_str='', metadata={'source': 'Phillies', 'row': 15}, lookup_index=0), Document(page_content='Team: Diamondbacks\n"Payroll (millions)": 74.28\n"Wins": 81', lookup_str='', metadata={'source': 'Diamondbacks', 'row': 16}, lookup_index=0), Document(page_content='Team: Pirates\n"Payroll (millions)": 63.43\n"Wins": 79', lookup_str='', metadata={'source': 'Pirates', 'row': 17}, lookup_index=0), Document(page_content='Team: Padres\n"Payroll (millions)": 55.24\n"Wins": 76', lookup_str='', metadata={'source': 'Padres', 'row': 18}, lookup_index=0), Document(page_content='Team: Mariners\n"Payroll (millions)": 81.97\n"Wins": 75', lookup_str='', metadata={'source': 'Mariners', 'row': 19}, lookup_index=0), Document(page_content='Team: Mets\n"Payroll (millions)": 93.35\n"Wins": 74', lookup_str='', metadata={'source': 'Mets', 'row': 20}, lookup_index=0), Document(page_content='Team: Blue Jays\n"Payroll (millions)": 75.48\n"Wins": 73', lookup_str='', metadata={'source': 'Blue Jays', 'row': 21}, lookup_index=0), Document(page_content='Team: Royals\n"Payroll (millions)": 60.91\n"Wins": 72', lookup_str='', metadata={'source': 'Royals', 'row': 22}, lookup_index=0), Document(page_content='Team: Marlins\n"Payroll (millions)": 118.07\n"Wins": 69', lookup_str='', metadata={'source': 'Marlins', 'row': 23}, lookup_index=0), Document(page_content='Team: Red Sox\n"Payroll (millions)": 173.18\n"Wins": 69', lookup_str='', metadata={'source': 'Red Sox', 'row': 24}, lookup_index=0), Document(page_content='Team: Indians\n"Payroll (millions)": 78.43\n"Wins": 68', lookup_str='', metadata={'source': 'Indians', 'row': 25}, lookup_index=0), Document(page_content='Team: Twins\n"Payroll (millions)": 94.08\n"Wins": 66', lookup_str='', metadata={'source': 'Twins', 'row': 26}, lookup_index=0), Document(page_content='Team: Rockies\n"Payroll (millions)": 78.06\n"Wins": 64', lookup_str='', metadata={'source': 'Rockies', 'row': 27}, lookup_index=0), Document(page_content='Team: Cubs\n"Payroll (millions)": 88.19\n"Wins": 61', lookup_str='', metadata={'source': 'Cubs', 'row': 28}, lookup_index=0), Document(page_content='Team: Astros\n"Payroll (millions)": 60.65\n"Wins": 55', lookup_str='', metadata={'source': 'Astros', 'row': 29}, lookup_index=0)]

二、文件目录 File Directory 加载器

        这包括如何加载目录中的所有文档。

        默认情况下,它使用非结构化加载程序.

from langchain_community.document_loaders import DirectoryLoader

        我们可以使用 glob 参数来控制要加载的文件。请注意,这里它不加载 .rst 文件或 .html 文件。

  1. loader = DirectoryLoader('../', glob="**/*.md")
  2. docs = loader.load()
  3. print(len(docs))

        打印结果:

    1

① 显示进度条

        默认情况下,不会显示进度条。要显示进度条,请安装 tqdm library(例如),并设置show_progress 参数到 True .

pip install tqdm
  1. loader = DirectoryLoader('../', glob="**/*.md", show_progress=True)
  2. docs = loader.load()

        演示效果:

  1. Requirement already satisfied: tqdm in /Users/jon/.pyenv/versions/3.9.16/envs/microbiome-app/lib/python3.9/site-packages (4.65.0)
  2. 0it [00:00, ?it/s]

② 使用多线程

        默认情况下,加载发生在一个线程中。为了利用几个线程,请将 use_multithreading 标志为 true

  1. loader = DirectoryLoader('../', glob="**/*.md", use_multithreading=True)
  2. docs = loader.load()

 ③ 更改加载程序类

        默认情况下,使用 UnstructuredLoader 。然而,您可以非常容易地改变加载程序的类型。只需要指定参数 loader_cls 的类型。

  1. from langchain_community.document_loaders import TextLoader
  2. loader = DirectoryLoader('../', glob="**/*.md", loader_cls=TextLoader)
  3. docs = loader.load()
  4. len(docs)

        打印结果:

  1. 1

        如果需要加载 Python源代码文件,请使用 PythonLoader .

  1. from langchain_community.document_loaders import PythonLoader
  2. loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)
  3. docs = loader.load()
  4. len(docs)

         打印结果:

    691

④ 使用文本加载器自动检测文件编码

        在本例中,我们将看到一些有用的策略,特别是这些策略在加载大量随机文件时使用 TextLoader

        首先,为了说明这个问题,让我们尝试用任意编码加载多个文本。

  1. path = '../../../../../tests/integration_tests/examples'
  2. loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader)

        A.默认行为

loader.load()

        文件 example-non-utf8.txt 使用不同的编码,因此load()函数失败,并显示一条有用的消息,指出哪个文件解码失败。

        默认行为TextLoader加载任何文档失败都将导致整个加载过程失败,并且不会加载任何文档。

        B.无声失败

        我们可以传递参数silent_errorsDirectoryLoader跳过无法加载的文件并继续加载过程。

  1. loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True)
  2. docs = loader.load()

        C.自动检测编码

        我们也可以使用 TextLoader 自动检测文件编码失败前,通过autodetect_encoding加载相关的加载器类。

  1. text_loader_kwargs={'autodetect_encoding': True}
  2. loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
  3. docs = loader.load()
  4. doc_sources = [doc.metadata['source'] for doc in docs]
  5. print(doc_sources)

        打印结果:

  1. ['../../../../../tests/integration_tests/examples/example-non-utf8.txt',
  2. '../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
  3. '../../../../../tests/integration_tests/examples/example-utf8.txt']

三、HTML 加载器

超文本标记语言或HTML是设计用于在web浏览器中显示的文档的标准标记语言。

        这包括如何加载 HTML文档 转换成我们可以在下游使用的文档格式。

  1. from langchain_community.document_loaders import UnstructuredHTMLLoader
  2. loader = UnstructuredHTMLLoader("example_data/fake-content.html")
  3. data = loader.load()
  4. print(data)

         打印结果:

    [Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

① 用BeautifulSoup4加载HTML

        我们也可以使用 BeautifulSoup4 使用加载HTML文档 BSHTMLLoader 。这将把文本从HTML提取到page_content,页面标题为title到…里面metadata.

from langchain_community.document_loaders import BSHTMLLoader
  1. loader = BSHTMLLoader("example_data/fake-content.html")
  2. data = loader.load()
  3. print(data)
    [Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

四、JSON 加载器

JSON 是一种开放的标准文件格式和数据交换格式,它使用人类可读的文本来存储和传输由属性值对和数组(或其他可序列化的值)组成的数据对象。

JSON行是一种文件格式,其中每一行都是有效的JSON值。

JSONLoader使用指定的jq模式解析JSON文件。它使用jq python包。详情看这个指南的详细文档 jq 语法。 

# pip install jq 不再需要安装
from langchain_community.document_loaders import JSONLoader
  1. import json
  2. from pathlib import Path
  3. from pprint import pprint
  4. file_path='./example_data/facebook_chat.json'
  5. data = json.loads(Path(file_path).read_text())
  6. print(data)
  1. {'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
  2. 'is_still_participant': True,
  3. 'joinable_mode': {'link': '', 'mode': 1},
  4. 'magic_words': [],
  5. 'messages': [{'content': 'Bye!',
  6. 'sender_name': 'User 2',
  7. 'timestamp_ms': 1675597571851},
  8. {'content': 'Oh no worries! Bye',
  9. 'sender_name': 'User 1',
  10. 'timestamp_ms': 1675597435669},
  11. {'content': 'No Im sorry it was my mistake, the blue one is not '
  12. 'for sale',
  13. 'sender_name': 'User 2',
  14. 'timestamp_ms': 1675596277579},
  15. {'content': 'I thought you were selling the blue one!',
  16. 'sender_name': 'User 1',
  17. 'timestamp_ms': 1675595140251},
  18. {'content': 'Im not interested in this bag. Im interested in the '
  19. 'blue one!',
  20. 'sender_name': 'User 1',
  21. 'timestamp_ms': 1675595109305},
  22. {'content': 'Here is $129',
  23. 'sender_name': 'User 2',
  24. 'timestamp_ms': 1675595068468},
  25. {'photos': [{'creation_timestamp': 1675595059,
  26. 'uri': 'url_of_some_picture.jpg'}],
  27. 'sender_name': 'User 2',
  28. 'timestamp_ms': 1675595060730},
  29. {'content': 'Online is at least $100',
  30. 'sender_name': 'User 2',
  31. 'timestamp_ms': 1675595045152},
  32. {'content': 'How much do you want?',
  33. 'sender_name': 'User 1',
  34. 'timestamp_ms': 1675594799696},
  35. {'content': 'Goodmorning! $50 is too low.',
  36. 'sender_name': 'User 2',
  37. 'timestamp_ms': 1675577876645},
  38. {'content': 'Hi! Im interested in your bag. Im offering $50. Let '
  39. 'me know if you are interested. Thanks!',
  40. 'sender_name': 'User 1',
  41. 'timestamp_ms': 1675549022673}],
  42. 'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
  43. 'thread_path': 'inbox/User 1 and User 2 chat',
  44. 'title': 'User 1 and User 2 chat'}

① 使用JSONLoader

        假设我们对提取content中的字段messagesJSON数据的键。这可以通过JSONLoader如下图。

JSON文件

  1. loader = JSONLoader(
  2. file_path='./example_data/facebook_chat.json',
  3. jq_schema='.messages[].content',
  4. text_content=False)
  5. data = loader.load()
  6. print(data)
  1. [Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
  2. Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
  3. Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),
  4. Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),
  5. Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),
  6. Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),
  7. Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),
  8. Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),
  9. Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),
  10. Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
  11. Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]

JSON行文件

        如果您想从JSON Lines文件中加载文档,您需要传递json_lines=True并详细说明jq_schema提取page_content来自单个JSON对象。

  1. file_path = './example_data/facebook_chat_messages.jsonl'
  2. print(Path(file_path).read_text())
  1. ('{"sender_name": "User 2", "timestamp_ms": 1675597571851, "content": "Bye!"}\n'
  2. '{"sender_name": "User 1", "timestamp_ms": 1675597435669, "content": "Oh no '
  3. 'worries! Bye"}\n'
  4. '{"sender_name": "User 2", "timestamp_ms": 1675596277579, "content": "No Im '
  5. 'sorry it was my mistake, the blue one is not for sale"}\n')
  1. loader = JSONLoader(
  2. file_path='./example_data/facebook_chat_messages.jsonl',
  3. jq_schema='.content',
  4. text_content=False,
  5. json_lines=True)
  6. data = loader.load()
  7. print(data)
  1. [Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}),
  2. Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 2}),
  3. Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 3})]

设置了另一个选项jq_schema='.'并提供content_key:

  1. loader = JSONLoader(
  2. file_path='./example_data/facebook_chat_messages.jsonl',
  3. jq_schema='.',
  4. content_key='sender_name',
  5. json_lines=True)
  6. data = loader.load()
  7. print(data)
  1. [Document(page_content='User 2', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}),
  2. Document(page_content='User 1', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 2}),
  3. Document(page_content='User 2', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 3})]

带有jq模式的JSON文件content_key

        要使用jq模式中的content_key从JSON文件加载文档,请设置is _ content _ key _ jq _ pars able = True。请确保content_key是兼容的,并且可以使用jq模式进行解析。

  1. file_path = './sample.json'
  2. pprint(Path(file_path).read_text())
  1. {"data": [
  2. {"attributes": {
  3. "message": "message1",
  4. "tags": [
  5. "tag1"]},
  6. "id": "1"},
  7. {"attributes": {
  8. "message": "message2",
  9. "tags": [
  10. "tag2"]},
  11. "id": "2"}]}
  1. loader = JSONLoader(
  2. file_path=file_path,
  3. jq_schema=".data[]",
  4. content_key=".attributes.message",
  5. is_content_key_jq_parsable=True,
  6. )
  7. data = loader.load()
  8. print(data)
  1. [Document(page_content='message1', metadata={'source': '/path/to/sample.json', 'seq_num': 1}),
  2. Document(page_content='message2', metadata={'source': '/path/to/sample.json', 'seq_num': 2})]

五、PDF Loader 加载器

目前市面上有很多的pdf加载器,下面会挑选几款受欢迎的展示,具体要使用哪种,自行选择。

① 使用 PyPDF

        PyPDF是一个功能全面的库,它允许用户进行PDF的读取、分割、合并以及转换等操作。这个库的优点在于其轻量且纯Python编写,没有庞大的依赖,因此安装和使用相对简单。此外,PyPDF跨平台性好,能够在Windows、macOS和Linux上良好运行。然而,它可能不支持PDF 1.7及以上版本的某些特性,对于处理带有复杂特性的最新PDF文件可能会存在限制。

加载PDF使用pypdf文档数组,其中每个文档都包含页面内容和元数据page号码。

pip install pypdf
  1. from langchain_community.document_loaders import PyPDFLoader
  2. loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
  3. pages = loader.load_and_split()
  4. print(pages[0])
    Document(page_content='LayoutParser : A Uni\x0ced Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\nfmelissadell,jacob carlson g@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model con\x0cgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\ne\x0borts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser , an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io .\nKeywords: Document Image Analysis ·Deep Learning ·Layout Analysis\n·Character Recognition ·Open Source library ·Toolkit.\n1 Introduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classi\x0ccation [ 11,arXiv:2103.15348v2  [cs.CV]  21 Jun 2021', metadata={'source': 'example_data/layout-parser-paper.pdf', 'page': 0})

这种方法的一个优点是可以通过页码检索文档。

应用实例:

  1. import os
  2. import getpass
  3. os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
  4. from langchain_community.vectorstores import FAISS
  5. from langchain_openai import OpenAIEmbeddings
  6. faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
  7. docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
  8. for doc in docs:
  9. print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
  1. 9: 10 Z. Shen et al.
  2. Fig. 4: Illustration of (a) the original historical Japanese document with layout
  3. detection results and (b) a recreated version of the document image that achieves
  4. much better character recognition recall. The reorganization algorithm rearranges
  5. the tokens based on the their detect
  6. 3: 4 Z. Shen et al.
  7. Efficient Data AnnotationC u s t o m i z e d M o d e l T r a i n i n gModel Cust omizationDI A Model HubDI A Pipeline SharingCommunity PlatformLa y out Detection ModelsDocument Images
  8. T h e C o r e L a y o u t P a r s e r L i b r a r yOCR ModuleSt or age & VisualizationLa y ou

② 使用PyPDFium2

        PyPDFium2是基于PDFium的Python绑定。PDFium是一个由Google开发的快速且功能丰富的PDF渲染引擎,因此PyPDFium2在PDF渲染和文本提取方面可能具有出色的性能。对于那些主要关注PDF的渲染和文本提取的用户来说,PyPDFium2可能是一个理想的选择。

  1. from langchain_community.document_loaders import PyPDFium2Loader
  2. loader = PyPDFium2Loader("example_data/layout-parser-paper.pdf")
  3. data = loader.load()

③ 使用PDFMiner

        PDFMiner是一个专注于从PDF文档中提取文本和元数据的库。它不仅支持基本的文本提取功能,还提供了许多高级特性,如表格分析和图像处理。PDFMiner的API设计简洁易懂,方便开发者快速上手,并且具有良好的跨平台兼容性。无论是简单的文本提取还是复杂的页面布局分析,PDFMiner都能满足各种需求。

  1. from langchain_community.document_loaders import PDFMinerLoader
  2. loader = PDFMinerLoader("example_data/layout-parser-paper.pdf")
  3. data = loader.load()

④ 特殊的 PyPDF Directory

        从目录加载pdf

  1. from langchain_community.document_loaders import PyPDFDirectoryLoader
  2. loader = PyPDFDirectoryLoader("example_data/")
  3. docs = loader.load()

⑤ 特殊的 使用非结构化

        非结构化的PDF指的是PDF文件中的信息没有按照一定的结构或格式进行组织,而是以原始的、未加工的形式呈现。这类PDF文件中的数据没有预定义的数据模型,不方便用数据库二维逻辑表来表现,也不便于提取和解析。因此,非结构化的PDF文件可能看起来比较杂乱,缺乏统一的结构和格式,使得用户难以直接获取所需的信息。

  1. from langchain_community.document_loaders import UnstructuredPDFLoader
  2. loader = UnstructuredPDFLoader("example_data/layout-parser-paper.pdf")
  3. data = loader.load()

六、Word 加载器(含.doc 和 .docx)

在langchain里面word只有一个非结构化的word加载器UnstructuredWordDocumentLoader。

环境准备:

  1. pip install unstructured
  2. pip install python-doc
  3. pip install python-docx

 示例代码:

  1. from langchain_community.document_loaders import UnstructuredWordDocumentLoader
  2. loader = UnstructuredWordDocumentLoader("example_data/layout-parser-paper.doc")
  3. data = loader.load()

七、Text 加载器(.txt 加载器)

在langchain里面.txt只有一个text加载器TextLoader。

  1. from langchain_community.document_loaders import TextLoader
  2. loader = TextLoader("example_data/layout-parser-paper.txt")
  3. data = loader.load()

八、完整代码

下面是通过经验总结的常用文件加载器的函数,可直接使用。

  1. from langchain_community.document_loaders import (
  2. UnstructuredWordDocumentLoader,
  3. CSVLoader,
  4. PyPDFLoader,
  5. TextLoader,
  6. DirectoryLoader,
  7. )
  8. import os
  9. from langchain_community.document_loaders.unstructured import UnstructuredFileLoader
  10. from langchain_community.document_loaders.pdf import PyPDFDirectoryLoader
  11. # load PDF files from directory
  12. # def load_pdf_from_dir_2(directory_path):
  13. # data = []
  14. # for filename in os.listdir(directory_path):
  15. # if filename.endswith(".pdf"):
  16. # print(filename)
  17. # # print the file name
  18. # loader = PyPDFLoader(f'{directory_path}/{filename}')
  19. # print(loader)
  20. # data.append(loader.load())
  21. # return data
  22. # load PDF files from directory
  23. def load_pdf_from_dir(directory_path):
  24. loader = PyPDFDirectoryLoader(directory_path)
  25. data = loader.load()
  26. return data
  27. # load PDF files from a pdf file
  28. def load_pdf_from_one(filepath):
  29. data = ''
  30. if filepath.endswith(".pdf"):
  31. print(filepath)
  32. # print the file name
  33. loader = PyPDFLoader(f'{filepath}')
  34. print(loader)
  35. data = loader.load()
  36. return data
  37. # load Word files(.doc/.docx) from directory
  38. def load_word_from_dir(directory_path):
  39. data = []
  40. for filename in os.listdir(directory_path):
  41. # check if the file is a doc or docx file
  42. # 检查所有doc以及docx后缀的文件
  43. if filename.endswith(".doc") or filename.endswith(".docx"):
  44. # langchain自带功能,加载word文档
  45. loader = UnstructuredWordDocumentLoader(f'{directory_path}/{filename}')
  46. data.append(loader.load())
  47. return data
  48. # load Word files(.doc/.docx) from a filename
  49. def load_word_from_one(filename):
  50. data = ''
  51. if filename.endswith(".doc") or filename.endswith(".docx"):
  52. print(filename)
  53. # print the file name
  54. loader = UnstructuredWordDocumentLoader(f'{filename}')
  55. print(loader)
  56. data = loader.load()
  57. return data
  58. # load Text files(.txt) from directory
  59. def load_txt_from_dir(directory_path):
  60. data = []
  61. for filename in os.listdir(directory_path):
  62. if filename.endswith(".txt"):
  63. print(filename)
  64. loader = TextLoader(f'{directory_path}/{filename}')
  65. print(loader)
  66. data.append(loader.load())
  67. return data
  68. # load Text files(.doc/.docx) from a filename
  69. def load_text_from_one(filename):
  70. data = ''
  71. if filename.endswith(".txt"):
  72. print(filename)
  73. # print the file name
  74. loader = TextLoader(f'{filename}')
  75. print(loader)
  76. data = loader.load()
  77. return data
  78. # load CSV files(.txt) from directory
  79. def load_csv_from_dir(directory_path):
  80. data = []
  81. for filename in os.listdir(directory_path):
  82. if filename.endswith(".csv"):
  83. print(filename)
  84. loader = CSVLoader(f'{directory_path}/{filename}')
  85. print(loader)
  86. data.append(loader.load())
  87. return data
  88. # load CSV files(.doc/.docx) from a filename
  89. def load_csv_from_one(filename):
  90. data = ''
  91. if filename.endswith(".csv"):
  92. print(filename)
  93. # print the file name
  94. loader = CSVLoader(f'{filename}')
  95. print(loader)
  96. data = loader.load()
  97. return data
  98. # load all files from directory
  99. # param glob = "**/*.文件后缀" 控制要加载的文件
  100. # param show_progress = true 显示进度条
  101. # param use_multithreading = true 利用多线程
  102. # param loader_cls = CSVLoader 指定加载器 | UnstructuredFileLoader
  103. def load_all_from_dir(directory_path, glob, show_progress=False, use_multithreading=False, loader_cls=UnstructuredFileLoader):
  104. loader = DirectoryLoader(directory_path, glob=glob, show_progress=show_progress, use_multithreading=use_multithreading, loader_cls=loader_cls)
  105. data = loader.load()
  106. return data
  107. if __name__ == '__main__':
  108. res = load_pdf_from_dir("./testdir")
  109. print(res)

创作不易,高抬贵手三连(点赞、收藏、关注),同学们的满意是我(H-大叔)的动力。

 代码运行有问题或其他建议,请在留言区评论,看到就会回复,不用私聊。

专栏人工智能 | 大模型 | 实战与教程里面还有其他人工智能|大数据方面的文章,可继续食用,持续更新。

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号