赞
踩
PrivateGPT、 llama.cpp和 GPT4All等项目的流行 强调了在本地(在您自己的设备上)运行 LLM 的需求。
这至少有两个重要的好处:
在本地运行法学硕士需要满足以下条件:
用户现在可以访问一组快速增长的开源LLM。
这些LLM可以至少从两个维度进行评估(见图):
可以使用多个排行榜来评估这些模型的相对性能,包括:
已经出现了一些框架来支持开源 LLM 在各种设备上的推理:
1. llama.cpp:带有权重优化/量化的 llama 推理代码的 C++ 实现
2. gpt4all:优化 C 后端推理
3. Ollama:将模型权重和环境捆绑到在设备上运行并为 LLM 提供服务的应用程序中
一般来说,这些框架会做一些事情:
由于精度较低,我们从根本上减少了将 LLM 存储在内存中所需的内存。
另外,我们可以看出GPU显存带宽 表的重要性!
由于 GPU 内存带宽更大,Mac M2 Max 的推理速度比 M1 快 5-6 倍。
Ollama是在 macOS 上轻松运行推理的一种方法。
这里的说明 提供了详细信息,我们总结如下:
from langchain_community.llms import Ollama
llm = Ollama(model=“llama2”)
llm(“The first man on the moon was …”)
’ The first man on the moon was Neil Armstrong, who landed on the moon on July 20, 1969 as part of the Apollo 11 mission. obviously.’
在生成令牌时对其进行流式传输。
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = Ollama(
model=“llama2”, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)
llm(“The first man on the moon was …”)
The first man to walk on the moon was Neil Armstrong, an American astronaut who was part of the Apollo 11 mission in 1969. февруари 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon’s surface, famously declaring “That’s one small step for man, one giant leap for mankind” as he took his first steps. He was followed by fellow astronaut Edwin “Buzz” Aldrin, who also walked on the moon during the mission.
’ The first man to walk on the moon was Neil Armstrong, an American astronaut who was part of the Apollo 11 mission in 1969. февруари 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon’s surface, famously declaring “That’s one small step for man, one giant leap for mankind” as he took his first steps. He was followed by fellow astronaut Edwin “Buzz” Aldrin, who also walked on the moon during the mission.’
在本地运行模型时,推理速度是一个挑战(见上文)。
为了最大限度地减少延迟,最好在 GPU 上本地运行模型,许多消费笔记本电脑(例如 Apple 设备)都附带 GPU 。
使用 GPU,可用的 GPU 内存带宽(如上所述)也很重要。
Ollama将自动利用 Apple 设备上的 GPU。
其他框架要求用户设置环境才能使用 Apple GPU。
例如,llama.cppPython 绑定可以配置为通过Metal使用 GPU 。
Metal 是 Apple 创建的图形和计算 API,提供对 GPU 近乎直接的访问。
请参阅此处的llama.cpp设置 以启用此功能。
特别是,请确保 conda 使用您创建的正确虚拟环境 ( miniforge3)。
例如,对我来说:
conda activate /Users/rlm/miniforge3/envs/llama
确认上述情况后,则:
CMAKE_ARGS=“-DLLAMA_METAL=on” FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
有多种方法可以获取量化模型权重。
使用Ollama,通过以下方式获取模型
ollama pull <model family>:<tag>:
from langchain_community.llms import Ollama
llm = Ollama(model=“llama2:13b”)
llm(“The first man on the moon was … think step by step”)
’ Sure! Here’s the answer, broken down step by step:\n\nThe first man on the moon was… Neil Armstrong.\n\nHere’s how I arrived at that answer:\n\n1. The first manned mission to land on the moon was Apollo 11.\n2. The mission included three astronauts: Neil Armstrong, Edwin “Buzz” Aldrin, and Michael Collins.\n3. Neil Armstrong was the mission commander and the first person to set foot on the moon.\n4. On July 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon’s surface, famously declaring “That’s one small step for man, one giant leap for mankind.”\n\nSo, the first man on the moon was Neil Armstrong!’
Llama.cpp 与多种模型兼容。
例如,下面我们使用从HuggingFacellama2-13b下载的 4 位量化 进行推理。
如上所述,请参阅API 参考 以获取完整的参数集。
从llama.cpp API 参考文档中,有一些值得评论:
n_gpu_layers:要加载到 GPU 内存中的层数
n_batch:模型应并行处理的令牌数量
n_ctx:令牌上下文窗口
f16_kv:模型是否应该对键/值缓存使用半精度
%env CMAKE_ARGS=“-DLLAMA_METAL=on” %env FORCE_CMAKE=1 %pip install --upgrade --quiet llama-cpp-python --no-cache-dirclear from langchain.callbacks.manager import CallbackManager from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler from langchain_community.llms import LlamaCpp llm = LlamaCpp( model_path=“/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin”, n_gpu_layers=1, n_batch=512, n_ctx=2048, f16_kv=True, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), verbose=True, )
控制台日志将显示以下内容,表明 Metal 已通过上述步骤正确启用:
ggml_metal_init: allocating
ggml_metal_init: using MPS
llm(“The first man on the moon was … Let’s think step by step”)
Llama.generate: prefix-match hit
llama_print_timings: load time = 9623.21 ms
llama_print_timings: sample time = 143.77 ms / 203 runs ( 0.71 ms per token, 1412.01 tokens per second)
llama_print_timings: prompt eval time = 485.94 ms / 7 tokens ( 69.42 ms per token, 14.40 tokens per second)
llama_print_timings: eval time = 6385.16 ms / 202 runs ( 31.61 ms per token, 31.64 tokens per second)
llama_print_timings: total time = 7279.28 ms
and use logical reasoning to figure out who the first man on the moon was.
Here are some clues:
1. The first man on the moon was an American.
2. He was part of the Apollo 11 mission.
3. He stepped out of the lunar module and became the first person to set foot on the moon’s surface.
4. His last name is Armstrong.
Now, let’s use our reasoning skills to figure out who the first man on the moon was. Based on clue #1, we know that the first man on the moon was an American. Clue #2 tells us that he was part of the Apollo 11 mission. Clue #3 reveals that he was the first person to set foot on the moon’s surface. And finally, clue #4 gives us his last name: Armstrong.
Therefore, the first man on the moon was Neil Armstrong!
" and use logical reasoning to figure out who the first man on the moon was.\n\nHere are some clues:\n\n1. The first man on the moon was an American.\n2. He was part of the Apollo 11 mission.\n3. He stepped out of the lunar module and became the first person to set foot on the moon’s surface.\n4. His last name is Armstrong.\n\nNow, let’s use our reasoning skills to figure out who the first man on the moon was. Based on clue #1, we know that the first man on the moon was an American. Clue #2 tells us that he was part of the Apollo 11 mission. Clue #3 reveals that he was the first person to set foot on the moon’s surface. And finally, clue #4 gives us his last name: Armstrong.\nTherefore, the first man on the moon was Neil Armstrong!"
我们可以使用从 GPT4All模型浏览器下载的模型权重。
与上面显示的类似,我们可以运行推理并使用API 参考 来设置感兴趣的参数。
%pip install gpt4all
from langchain_community.llms import GPT4All
llm = GPT4All(
model=“/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin”
)
llm(“The first man on the moon was … Let’s think step by step”)
“.\n1) The United States decides to send a manned mission to the moon.2) They choose their best astronauts and train them for this specific mission.3) They build a spacecraft that can take humans to the moon, called the Lunar Module (LM).4) They also create a larger spacecraft, called the Saturn V rocket, which will launch both the LM and the Command Service Module (CSM), which will carry the astronauts into orbit.5) The mission is planned down to the smallest detail: from the trajectory of the rockets to the exact movements of the astronauts during their moon landing.6) On July 16, 1969, the Saturn V rocket launches from Kennedy Space Center in Florida, carrying the Apollo 11 mission crew into space.7) After one and a half orbits around the Earth, the LM separates from the CSM and begins its descent to the moon’s surface.8) On July 20, 1969, at 2:56 pm EDT (GMT-4), Neil Armstrong becomes the first man on the moon. He speaks these”
一些LLM将受益于特定的提示。
例如,LLaMA将使用特殊的代币。
我们可以ConditionalPromptSelector根据模型类型来设置提示。
# Set our LLM
llm = LlamaCpp(
model_path=“/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin”,
n_gpu_layers=1,
n_batch=512,
n_ctx=2048,
f16_kv=True,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True,
)
根据机型版本设置相应的提示信息。
from langchain.chains import LLMChain
from langchain.chains.prompt_selector import ConditionalPromptSelector
from langchain.prompts import PromptTemplate
DEFAULT_LLAMA_SEARCH_PROMPT = PromptTemplate(
input_variables=[“question”],
template=“”“<> \n You are an assistant tasked with improving Google search </span>
results. \n <> \n\n [INST] Generate THREE Google search queries that </span>
are similar to this question. The output should be a numbered list of questions </span>
and each should have a question mark at the end: \n\n {question} [/INST]”“”,
)
DEFAULT_SEARCH_PROMPT = PromptTemplate(
input_variables=[“question”],
template=“”“You are an assistant tasked with improving Google search </span>
results. Generate THREE Google search queries that are similar to </span>
this question. The output should be a numbered list of questions and each </span>
should have a question mark at the end: {question}”“”,
)
QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(
default_prompt=DEFAULT_SEARCH_PROMPT,
conditionals=[(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)],
)
prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)
prompt
PromptTemplate(input_variables=[‘question’], output_parser=None, partial_variables={}, template=‘<> \n You are an assistant tasked with improving Google search results. \n <> \n\n [INST] Generate THREE Google search queries that are similar to this question. The output should be a numbered list of questions and each should have a question mark at the end: \n\n {question} [/INST]’, template_format=‘f-string’, validate_template=True)
# Chain
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = “What NFL team won the Super Bowl in the year that Justin Bieber was born?”
llm_chain.run({“question”: question})
Sure! Here are three similar search queries with a question mark at the end:
1. Which NBA team did LeBron James lead to a championship in the year he was drafted?
2. Who won the Grammy Awards for Best New Artist and Best Female Pop Vocal Performance in the same year that Lady Gaga was born?
3. What MLB team did Babe Ruth play for when he hit 60 home runs in a single season?
llama_print_timings: load time = 14943.19 ms
llama_print_timings: sample time = 72.93 ms / 101 runs ( 0.72 ms per token, 1384.87 tokens per second)
llama_print_timings: prompt eval time = 14942.95 ms / 93 tokens ( 160.68 ms per token, 6.22 tokens per second)
llama_print_timings: eval time = 3430.85 ms / 100 runs ( 34.31 ms per token, 29.15 tokens per second)
llama_print_timings: total time = 18578.26 ms
’ Sure! Here are three similar search queries with a question mark at the end:\n\n1. Which NBA team did LeBron James lead to a championship in the year he was drafted?\n2. Who won the Grammy Awards for Best New Artist and Best Female Pop Vocal Performance in the same year that Lady Gaga was born?\n3. What MLB team did Babe Ruth play for when he hit 60 home runs in a single season?’
我们还可以使用 LangChain Prompt Hub 来获取和/或存储特定于模型的提示。
这将与您的LangSmith API 密钥一起使用。
例如, 以下是带有 LLaMA 特定令牌的 RAG 提示。
鉴于llm从上述模型之一创建的,您可以将其用于 许多用例。
例如,这里是本地LLM的RAG指南 。
一般来说,本地LLM的用例至少由两个因素驱动:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。