当前位置:   article > 正文

ollama大模型qwen2:7b性能测试_qwen2并发测试

qwen2并发测试

部署環境信息:

  1. (base) root@alg-dev17:/opt# lscpu
  2. Architecture: x86_64
  3. CPU op-mode(s): 32-bit, 64-bit
  4. Address sizes: 45 bits physical, 48 bits virtual
  5. Byte Order: Little Endian
  6. CPU(s): 8
  7. On-line CPU(s) list: 0-7
  8. Vendor ID: GenuineIntel
  9. Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
  10. CPU family: 6
  11. Model: 85
  12. Thread(s) per core: 1
  13. Core(s) per socket: 1
  14. Socket(s): 8
  15. Stepping: 4
  16. BogoMIPS: 4589.21
  17. Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable no
  18. nstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid
  19. _fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec
  20. xgetbv1 xsaves arat pku ospke md_clear flush_l1d arch_capabilities
  21. Virtualization features:
  22. Hypervisor vendor: VMware
  23. Virtualization type: full
  24. Caches (sum of all):
  25. L1d: 256 KiB (8 instances)
  26. L1i: 256 KiB (8 instances)
  27. L2: 8 MiB (8 instances)
  28. L3: 198 MiB (8 instances)
  29. NUMA:
  30. NUMA node(s): 1
  31. NUMA node0 CPU(s): 0-7
  32. Vulnerabilities:
  33. Gather data sampling: Unknown: Dependent on hypervisor status
  34. Itlb multihit: KVM: Mitigation: VMX unsupported
  35. L1tf: Mitigation; PTE Inversion
  36. Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
  37. Meltdown: Mitigation; PTI
  38. Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown
  39. Retbleed: Mitigation; IBRS
  40. Spec rstack overflow: Not affected
  41. Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  42. Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  43. Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Syscall hardening, KVM SW loop
  44. Srbds: Not affected
  45. Tsx async abort: Not affected
  46. (base) root@alg-dev17:/opt# free -h
  47. total used free shared buff/cache available
  48. Mem: 15Gi 2.4Gi 12Gi 18Mi 1.2Gi 10Gi
  49. Swap: 3.8Gi 0B 3.8Gi
  50. (base) root@alg-dev17:/opt# nvidia-smi
  51. Fri Jun 28 09:17:11 2024
  52. +-----------------------------------------------------------------------------------------+
  53. | NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
  54. |-----------------------------------------+------------------------+----------------------+
  55. | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
  56. | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
  57. | | | MIG M. |
  58. |=========================================+========================+======================|
  59. | 0 NVIDIA GeForce GTX 1080 Ti On | 00000000:0B:00.0 Off | N/A |
  60. | 23% 28C P8 8W / 250W | 3MiB / 11264MiB | 0% Default |
  61. | | | N/A |
  62. +-----------------------------------------+------------------------+----------------------+
  63. +-----------------------------------------------------------------------------------------+
  64. | Processes: |
  65. | GPU GI CI PID Type Process name GPU Memory |
  66. | ID ID Usage |
  67. |=========================================================================================|
  68. | No running processes found |
  69. +-----------------------------------------------------------------------------------------+

大模型:

(base) root@alg-dev17:/opt# ollama list
NAME            ID              SIZE    MODIFIED     
qwen2:7b        e0d4e1163c58    4.4 GB  22 hours ago

ollama服務配置:

(base) root@alg-dev17:/opt# cat /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin"
Environment="OLLAMA_NUM_PARALLEL=16"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_HOST=0.0.0.0"


執行脚本:参照一个csdn用户的分享的脚本

  1. import aiohttp
  2. import asyncio
  3. import time
  4. from tqdm import tqdm
  5. import random
  6. questions = [
  7. "Why is the sky blue?", "Why do we dream?", "Why is the ocean salty?", "Why do leaves change color?",
  8. "Why do birds sing?", "Why do we have seasons?", "Why do stars twinkle?", "Why do we yawn?",
  9. "Why is the sun hot?", "Why do cats purr?", "Why do dogs bark?", "Why do fish swim?",
  10. "Why do we have fingerprints?", "Why do we sneeze?", "Why do we have eyebrows?", "Why do we have hair?",
  11. "Why do we have nails?", "Why do we have teeth?", "Why do we have bones?", "Why do we have muscles?",
  12. "Why do we have blood?", "Why do we have a heart?", "Why do we have lungs?", "Why do we have a brain?",
  13. "Why do we have skin?", "Why do we have ears?", "Why do we have eyes?", "Why do we have a nose?",
  14. "Why do we have a mouth?", "Why do we have a tongue?", "Why do we have a stomach?", "Why do we have intestines?",
  15. "Why do we have a liver?", "Why do we have kidneys?", "Why do we have a bladder?", "Why do we have a pancreas?",
  16. "Why do we have a spleen?", "Why do we have a gallbladder?", "Why do we have a thyroid?", "Why do we have adrenal glands?",
  17. "Why do we have a pituitary gland?", "Why do we have a hypothalamus?", "Why do we have a thymus?", "Why do we have lymph nodes?",
  18. "Why do we have a spinal cord?", "Why do we have nerves?", "Why do we have a circulatory system?", "Why do we have a respiratory system?",
  19. "Why do we have a digestive system?", "Why do we have an immune system?"
  20. ]
  21. async def fetch(session, url):
  22. """
  23. 参数:
  24. session (aiohttp.ClientSession): 用于请求的会话。
  25. url (str): 要发送请求的 URL。
  26. 返回:
  27. tuple: 包含完成 token 数量和请求时间。
  28. """
  29. start_time = time.time()
  30. # 随机选择一个问题
  31. question = random.choice(questions) # <--- 这两个必须注释一个
  32. # 固定问题
  33. # question = questions[0] # <--- 这两个必须注释一个
  34. # 请求的内容
  35. json_payload = {
  36. "model": "qwen2:7b",
  37. "messages": [{"role": "user", "content": question}],
  38. "stream": False,
  39. "temperature": 0.7 # 参数使用 0.7 保证每次的结果略有区别
  40. }
  41. async with session.post(url, json=json_payload) as response:
  42. response_json = await response.json()
  43. print(f"{response_json}")
  44. end_time = time.time()
  45. request_time = end_time - start_time
  46. completion_tokens = response_json['usage']['completion_tokens'] # 从返回的参数里获取生成的 token 的数量
  47. return completion_tokens, request_time
  48. async def bound_fetch(sem, session, url, pbar):
  49. # 使用信号量 sem 来限制并发请求的数量,确保不会超过最大并发请求数
  50. async with sem:
  51. result = await fetch(session, url)
  52. pbar.update(1)
  53. return result
  54. async def run(load_url, max_concurrent_requests, total_requests):
  55. """
  56. 通过发送多个并发请求来运行基准测试。
  57. 参数:
  58. load_url (str): 要发送请求的URL。
  59. max_concurrent_requests (int): 最大并发请求数。
  60. total_requests (int): 要发送的总请求数。
  61. 返回:
  62. tuple: 包含完成 token 总数列表和响应时间列表。
  63. """
  64. # 创建 Semaphore 来限制并发请求的数量
  65. sem = asyncio.Semaphore(max_concurrent_requests)
  66. # 创建一个异步的HTTP会话
  67. async with aiohttp.ClientSession() as session:
  68. tasks = []
  69. # 创建一个进度条来可视化请求的进度
  70. with tqdm(total=total_requests) as pbar:
  71. # 循环创建任务,直到达到总请求数
  72. for _ in range(total_requests):
  73. # 为每个请求创建一个任务,确保它遵守信号量的限制
  74. task = asyncio.ensure_future(bound_fetch(sem, session, load_url, pbar))
  75. tasks.append(task) # 将任务添加到任务列表中
  76. # 等待所有任务完成并收集它们的结果
  77. results = await asyncio.gather(*tasks)
  78. # 计算所有结果中的完成token总数
  79. completion_tokens = sum(result[0] for result in results)
  80. # 从所有结果中提取响应时间
  81. response_times = [result[1] for result in results]
  82. # 返回完成token的总数和响应时间的列表
  83. return completion_tokens, response_times
  84. if __name__ == '__main__':
  85. import sys
  86. if len(sys.argv) != 3:
  87. print("Usage: python bench.py <C> <N>")
  88. sys.exit(1)
  89. C = int(sys.argv[1]) # 最大并发数
  90. N = int(sys.argv[2]) # 请求总数
  91. # vllm 和 ollama 都兼容了 openai 的 api 让测试变得更简单了
  92. url = 'http://10.1.9.167:11434/v1/chat/completions'
  93. start_time = time.time()
  94. completion_tokens, response_times = asyncio.run(run(url, C, N))
  95. end_time = time.time()
  96. # 计算总时间
  97. total_time = end_time - start_time
  98. # 计算每个请求的平均时间
  99. avg_time_per_request = sum(response_times) / len(response_times)
  100. # 计算每秒生成的 token 数量
  101. tokens_per_second = completion_tokens / total_time
  102. print(f'Performance Results:')
  103. print(f' Total requests : {N}')
  104. print(f' Max concurrent requests : {C}')
  105. print(f' Total time : {total_time:.2f} seconds')
  106. print(f' Average time per request : {avg_time_per_request:.2f} seconds')
  107. print(f' Tokens per second : {tokens_per_second:.2f}')

運行結果1:

  Performance Results:
  Total requests            : 2000
  Max concurrent requests   : 50
  Total time                : 8360.14 seconds
  Average time per request  : 206.93 seconds
  Tokens per second         : 83.43

運行結果2:




显存占用情况:

  1. (base) root@alg-dev17:~# nvidia-smi
  2. Thu Jun 27 16:21:36 2024
  3. +-----------------------------------------------------------------------------------------+
  4. | NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
  5. |-----------------------------------------+------------------------+----------------------+
  6. | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
  7. | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
  8. | | | MIG M. |
  9. |=========================================+========================+======================|
  10. | 0 NVIDIA GeForce GTX 1080 Ti On | 00000000:0B:00.0 Off | N/A |
  11. | 35% 64C P2 212W / 250W | 7899MiB / 11264MiB | 83% Default |
  12. | | | N/A |
  13. +-----------------------------------------+------------------------+----------------------+
  14. +-----------------------------------------------------------------------------------------+
  15. | Processes: |
  16. | GPU GI CI PID Type Process name GPU Memory |
  17. | ID ID Usage |
  18. |=========================================================================================|
  19. | 0 N/A N/A 9218 C ...unners/cuda_v11/ollama_llama_server 7896MiB |
  20. +-----------------------------------------------------------------------------------------+

仅供参照,转载请注明出处!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/819086
推荐阅读
相关标签
  

闽ICP备14008679号