翻译: LLM构建 GitHub 提交记录的聊天机器人二使用 Timescale Vector、pgvector 和 LlamaIndex_llamaindex 基于pgvector

作者：2023面试高手 | 2024-06-03 15:35:22

踩

llamaindex 基于pgvector

接着上篇内容：翻译: LLM构建 GitHub 提交记录的聊天机器人一使用 Timescale Vector、pgvector 和 LlamaIndex

TSV Time Machine 示例应用有三个页面：

Home主页：提供应用程序使用说明的应用程序主页。
Load Data加载数据：页面以加载所选存储库的 Git 提交历史记录。
Time Machine Demo：与加载的任何 GitHub 存储库聊天的界面。

由于该应用程序是 ~600 行代码，我们不会逐行解压（尽管您可以要求 ChatGPT 向您解释任何棘手的部分！让我们看一下其中涉及的关键代码片段：

从要与之聊天的 GitHub 存储库加载数据
通过时间感知检索time-aware retrieval augmented generation增强聊天效果

第 1 部分：使用 Timescale Vector 和 LlamaIndex 加载基于时间的数据

输入要为其加载数据的 GitHub 存储库的 URL，TSV Time Machine 使用 LlamaIndex 加载数据，为其创建向量嵌入，并将其存储在 Timescale Vector 中。

在这里插入图片描述
在文件0_LoadData.py中，我们从您选择的 GitHub 存储库中获取数据，使用 OpenAI 的 text-embedding-ada-002 模型和 LlamaIndex 为其创建嵌入，并将其存储在 Timescale Vector 的表中。这些表包含与 Git 提交关联的向量嵌入、原始文本和元数据，包括反映提交时间戳的 UUID。

首先，我们定义一个 load_git_history() 函数。此函数将要求用户输入 GitHub 存储库 URL、分支和提交数，以通过 st.text_input 元素加载。然后它将获取存储库的 Git 提交历史记录，使用 LlamaIndex 嵌入提交历史记录文本并将它们转换为 LlamaIndex 节点，并将嵌入和元数据插入到 Timescale Vector 中：

s, and insert the embeddings and metadata into Timescale Vector:
首先，我们定义一个 load_git_history() 函数。此函数将要求用户输入 GitHub 存储库 URL、分支和提交数，以通过 st.text_input 元素加载。然后它将获取存储库的 Git 提交历史记录，使用 LlamaIndex 嵌入提交历史记录文本并将它们转换为 LlamaIndex 节点，并将嵌入和元数据插入到 Timescale Vector 中：

# Load git history into the database using LlamaIndex
def load_git_history():
   repo = st.text_input("Repo", "<https://github.com/postgres/postgres>")
   branch = st.text_input("Branch", "master")
   limit = int(st.text_input("Limit number commits (0 for no limit)", "1000"))
   if st.button("Load data into the database"):
       df = get_history(repo, branch, limit)
       table_name = record_catalog_info(repo)
       load_into_db(table_name, df)
1
2
3
4
5
6
7
8
9
10
11
12

用于从用户定义的 URL 加载 Git 历史记录的函数。默认为 PostgreSQL 项目。

虽然帮助程序函数 get_history()，record_catalog_info() 和 load_into_db() 的完整代码位于示例应用存储库中，但下面是概述：

get_history()：获取存储库的 Git 历史记录并将其存储在 Pandas DataFrame 中。我们获取提交哈希、作者姓名、提交日期和提交消息。
record_catalog_info()：在我们的 Timescale Vector 数据库中创建一个关系表，用于存储已加载的 GitHub 存储库的信息。存储库 URL 和表提交的名称存储在数据库中。
load_into_db()：在 LlamaIndex 中创建一个 TimescaleVectorStore，用于存储提交数据的嵌入和元数据。
我们将time_partition_interval参数设置为 365 天。此参数表示按时间对数据进行分区的每个间隔的长度。每个分区将包含指定时间长度的数据。

# Create Timescale Vectorstore with partition interval of 1 year
   ts_vector_store = TimescaleVectorStore.from_params(
       service_url=st.secrets["TIMESCALE_SERVICE_URL"],
       table_name=table_name,
       time_partition_interval=timedelta(days=365),
   )
1
2
3
4
5
6

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/2023面试高手/article/detail/667793

翻译: LLM构建 GitHub 提交记录的聊天机器人二 使用 Timescale Vector、pgvector 和 LlamaIndex_llamaindex 基于pgvector

第 1 部分：使用 Timescale Vector 和 LlamaIndex 加载基于时间的数据

翻译: LLM构建 GitHub 提交记录的聊天机器人二使用 Timescale Vector、pgvector 和 LlamaIndex_llamaindex 基于pgvector