当前位置:   article > 正文

AGI|无GPU也能畅行无阻!Ollama大模型本地运行教程_docker部署ollama

docker部署ollama

本文介绍了如何在无GPU环境下,通过安装Docker、Ollama、Anaconda并创建虚拟环境,实现大模型的本地运行。安装完成后,启动API服务并进行测试,确保模型的高效稳定运行。Ollama的本地部署方案为没有GPU资源的用户提供了便捷的大模型运行方案。

目录

一、实施步骤

安装Docker(可跳过)

安装Ollama

二、API服务

三、测试


一、实施步骤

系统推荐使用Linux,如果是Windows请使用WSL2(2虚拟了完整的Linux内核,相当于Linux)

安装Docker(可跳过)

  1. #更新源
  2. yum -y update
  3. yum install -y yum-utils
  4. #添加源
  5. yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
  6. #安装docker
  7. yum install docker-ce docker-ce-cli containerd.io docker-compose-plugin
  8. #启动docker
  9. systemctl start docker
  10. #开机自启
  11. systemctl enable docker
  12. #验证
  13. docker --version
  14. #Docker version 25.0.1, build 29cf629

安装Ollama

  1. #启动 ollama
  2. docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
  3. #加载一个模型,这里以llama2为例
  4. docker exec -itd ollama ollama run qwen:7b

安装Anaconda并创建虚拟环境(可跳过)

  1. #进入安装目录
  2. cd /opt
  3. #下载Anaconda,如果提示没有wget请安装一下
  4. wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
  5. #安装Anaconda
  6. bash Anaconda3-2023.09-0-Linux-x86_64.sh
  7. #创建ollama虚拟环境
  8. conda create -n ollama python=3.10
  9. #激活虚拟环境
  10. conda activate ollama

二、API服务

ollama本身提供了API服务,但是流式处理有点问题,python版本的没问题,这里以一个api_demo为例对齐chatgpt的api。

代码来源:LLaMA-Factory/src/api_demo.py

  1. # 安装依赖
  2. pip install ollama sse_starlette fastapi
  3. # 创建api_demo.py 文件
  4. touch api_demo.py
  5. vi api_demo.py
  6. python api_demo.py

  1. import asyncio
  2. import json
  3. import os
  4. from typing import Any, Dict, Sequence
  5. import ollama
  6. from sse_starlette.sse import EventSourceResponse
  7. from fastapi import FastAPI, HTTPException, status
  8. from fastapi.middleware.cors import CORSMiddleware
  9. import uvicorn
  10. import time
  11. from enum import Enum, unique
  12. from typing import List, Optional
  13. from pydantic import BaseModel, Field
  14. from typing_extensions import Literal
  15. @unique
  16. class Role(str, Enum):
  17. USER = "user"
  18. ASSISTANT = "assistant"
  19. SYSTEM = "system"
  20. FUNCTION = "function"
  21. TOOL = "tool"
  22. OBSERVATION = "observation"
  23. @unique
  24. class Finish(str, Enum):
  25. STOP = "stop"
  26. LENGTH = "length"
  27. TOOL = "tool_calls"
  28. class ModelCard(BaseModel):
  29. id: str
  30. object: Literal["model"] = "model"
  31. created: int = Field(default_factory=lambda: int(time.time()))
  32. owned_by: Literal["owner"] = "owner"
  33. class ModelList(BaseModel):
  34. object: Literal["list"] = "list"
  35. data: List[ModelCard] = []
  36. class Function(BaseModel):
  37. name: str
  38. arguments: str
  39. class FunctionCall(BaseModel):
  40. id: Literal["call_default"] = "call_default"
  41. type: Literal["function"] = "function"
  42. function: Function
  43. class ChatMessage(BaseModel):
  44. role: Role
  45. content: str
  46. class ChatCompletionMessage(BaseModel):
  47. role: Optional[Role] = None
  48. content: Optional[str] = None
  49. tool_calls: Optional[List[FunctionCall]] = None
  50. class ChatCompletionRequest(BaseModel):
  51. model: str
  52. messages: List[ChatMessage]
  53. tools: Optional[list] = []
  54. do_sample: bool = True
  55. temperature: Optional[float] = None
  56. top_p: Optional[float] = None
  57. n: int = 1
  58. max_tokens: Optional[int] = None
  59. stream: bool = False
  60. class ChatCompletionResponseChoice(BaseModel):
  61. index: int
  62. message: ChatCompletionMessage
  63. finish_reason: Finish
  64. class ChatCompletionResponseStreamChoice(BaseModel):
  65. index: int
  66. delta: ChatCompletionMessage
  67. finish_reason: Optional[Finish] = None
  68. class ChatCompletionResponseUsage(BaseModel):
  69. prompt_tokens: int
  70. completion_tokens: int
  71. total_tokens: int
  72. class ChatCompletionResponse(BaseModel):
  73. id: Literal["chatcmpl-default"] = "chatcmpl-default"
  74. object: Literal["chat.completion"] = "chat.completion"
  75. created: int = Field(default_factory=lambda: int(time.time()))
  76. model: str
  77. choices: List[ChatCompletionResponseChoice]
  78. usage: ChatCompletionResponseUsage
  79. class ChatCompletionStreamResponse(BaseModel):
  80. id: Literal["chatcmpl-default"] = "chatcmpl-default"
  81. object: Literal["chat.completion.chunk"] = "chat.completion.chunk"
  82. created: int = Field(default_factory=lambda: int(time.time()))
  83. model: str
  84. choices: List[ChatCompletionResponseStreamChoice]
  85. class ScoreEvaluationRequest(BaseModel):
  86. model: str
  87. messages: List[str]
  88. max_length: Optional[int] = None
  89. class ScoreEvaluationResponse(BaseModel):
  90. id: Literal["scoreeval-default"] = "scoreeval-default"
  91. object: Literal["score.evaluation"] = "score.evaluation"
  92. model: str
  93. scores: List[float]
  94. def dictify(data: "BaseModel") -> Dict[str, Any]:
  95. try: # pydantic v2
  96. return data.model_dump(exclude_unset=True)
  97. except AttributeError: # pydantic v1
  98. return data.dict(exclude_unset=True)
  99. def jsonify(data: "BaseModel") -> str:
  100. try: # pydantic v2
  101. return json.dumps(data.model_dump(exclude_unset=True), ensure_ascii=False)
  102. except AttributeError: # pydantic v1
  103. return data.json(exclude_unset=True, ensure_ascii=False)
  104. def create_app() -> "FastAPI":
  105. app = FastAPI()
  106. app.add_middleware(
  107. CORSMiddleware,
  108. allow_origins=["*"],
  109. allow_credentials=True,
  110. allow_methods=["*"],
  111. allow_headers=["*"],
  112. )
  113. semaphore = asyncio.Semaphore(int(os.environ.get("MAX_CONCURRENT", 1)))
  114. @app.get("/v1/models", response_model=ModelList)
  115. async def list_models():
  116. model_card = ModelCard(id="gpt-3.5-turbo")
  117. return ModelList(data=[model_card])
  118. @app.post("/v1/chat/completions", response_model=ChatCompletionResponse, status_code=status.HTTP_200_OK)
  119. async def create_chat_completion(request: ChatCompletionRequest):
  120. if len(request.messages) == 0 or request.messages[-1].role not in [Role.USER, Role.TOOL]:
  121. raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid length")
  122. messages = [dictify(message) for message in request.messages]
  123. if len(messages) and messages[0]["role"] == Role.SYSTEM:
  124. system = messages.pop(0)["content"]
  125. else:
  126. system = None
  127. if len(messages) % 2 == 0:
  128. raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Only supports u/a/u/a/u...")
  129. for i in range(len(messages)):
  130. if i % 2 == 0 and messages[i]["role"] not in [Role.USER, Role.TOOL]:
  131. raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid role")
  132. elif i % 2 == 1 and messages[i]["role"] not in [Role.ASSISTANT, Role.FUNCTION]:
  133. raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid role")
  134. elif messages[i]["role"] == Role.TOOL:
  135. messages[i]["role"] = Role.OBSERVATION
  136. tool_list = request.tools
  137. if len(tool_list):
  138. try:
  139. tools = json.dumps([tool_list[0]["function"]], ensure_ascii=False)
  140. except Exception:
  141. raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid tools")
  142. else:
  143. tools = ""
  144. async with semaphore:
  145. loop = asyncio.get_running_loop()
  146. return await loop.run_in_executor(None, chat_completion, messages, system, tools, request)
  147. def chat_completion(messages: Sequence[Dict[str, str]], system: str, tools: str, request: ChatCompletionRequest):
  148. if request.stream:
  149. generate = stream_chat_completion(messages, system, tools, request)
  150. return EventSourceResponse(generate, media_type="text/event-stream")
  151. responses = ollama.chat(model=request.model,
  152. messages=messages,
  153. options={
  154. "top_p": request.top_p,
  155. "temperature": request.temperature
  156. })
  157. prompt_length, response_length = 0, 0
  158. choices = []
  159. result = responses['message']['content']
  160. response_message = ChatCompletionMessage(role=Role.ASSISTANT, content=result)
  161. finish_reason = Finish.STOP if responses.get("done", False) == True else Finish.LENGTH
  162. choices.append(
  163. ChatCompletionResponseChoice(index=0, message=response_message, finish_reason=finish_reason)
  164. )
  165. prompt_length = -1
  166. response_length += -1
  167. usage = ChatCompletionResponseUsage(
  168. prompt_tokens=prompt_length,
  169. completion_tokens=response_length,
  170. total_tokens=prompt_length + response_length,
  171. )
  172. return ChatCompletionResponse(model=request.model, choices=choices, usage=usage)
  173. def stream_chat_completion(
  174. messages: Sequence[Dict[str, str]], system: str, tools: str, request: ChatCompletionRequest
  175. ):
  176. choice_data = ChatCompletionResponseStreamChoice(
  177. index=0, delta=ChatCompletionMessage(role=Role.ASSISTANT, content=""), finish_reason=None
  178. )
  179. chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
  180. yield jsonify(chunk)
  181. for new_text in ollama.chat(
  182. model=request.model,
  183. messages=messages,
  184. stream=True,
  185. options={
  186. "top_p": request.top_p,
  187. "temperature": request.temperature
  188. }
  189. ):
  190. if len(new_text) == 0:
  191. continue
  192. choice_data = ChatCompletionResponseStreamChoice(
  193. index=0, delta=ChatCompletionMessage(content=new_text['message']['content']), finish_reason=None
  194. )
  195. chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
  196. yield jsonify(chunk)
  197. choice_data = ChatCompletionResponseStreamChoice(
  198. index=0, delta=ChatCompletionMessage(), finish_reason=Finish.STOP
  199. )
  200. chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
  201. yield jsonify(chunk)
  202. yield "[DONE]"
  203. return app
  204. if __name__ == "__main__":
  205. app = create_app()
  206. uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("API_PORT", 8000)), workers=1)

三、测试

  1. curl --location 'http://127.0.0.1:8000/v1/chat/completions' \
  2. --header 'Content-Type: application/json' \
  3. --data '{
  4. "model": "qwen:7b",
  5. "messages": [{"role": "user", "content": "What is the OpenAI mission?"}],
  6. "stream": true,
  7. "temperature": 0.7,
  8. "top_p": 1
  9. }'

经过测试,速度在8token/s左右。

以上就是本期全部内容,有疑问的小伙伴欢迎留言讨论~

作者:徐辉| 后端开发工程师

更多AI小知识欢迎关注“神州数码云基地”公众号,回复“AI与数字化转型”进入社群交流

版权声明:文章由神州数码武汉云基地团队实践整理输出,转载请注明出处。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/670102
推荐阅读
相关标签
  

闽ICP备14008679号