当前位置:   article > 正文

docker-compose监控nVidia GPU_docker-compose容器里的java服务怎么访问nvidia-smi

docker-compose容器里的java服务怎么访问nvidia-smi

可以选择docker run 去运行nvidia_smi_exporter

docker run -d --name nvidia_smi_exporter --restart unless-stopped --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi -p 9835:9835 utkuozdemir/nvidia_gpu_exporter:1.1.0

可以选择docker-compose去运行nvidia_smi_exporter (推荐)

  1. version: '3'
  2. services:
  3. nvidia_smi_exporter:
  4. image: utkuozdemir/nvidia_gpu_exporter:1.1.0
  5. container_name: nvidia_smi_exporter
  6. restart: unless-stopped
  7. devices:
  8. - /dev/nvidiactl:/dev/nvidiactl
  9. - /dev/nvidia0:/dev/nvidia0
  10. - /dev/nvidia1:/dev/nvidia1
  11. volumes:
  12. - /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
  13. - /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
  14. - /usr/bin/nvidia-smi:/usr/bin/nvidia-smi
  15. ports:
  16. - "9835:9835"

curl 127.0.0.1:9835/metrics

  1. nvidia_smi_temperature_memory{uuid="a52aea53-bc0c-61ca-8f99-df0926347ce2"} 57
  2. # HELP nvidia_smi_utilization_gpu_ratio utilization.gpu [%]
  3. # TYPE nvidia_smi_utilization_gpu_ratio gauge
  4. nvidia_smi_utilization_gpu_ratio{uuid="9aad4dc0-0be0-c871-2f84-6b990152f5ec"} 0
  5. nvidia_smi_utilization_gpu_ratio{uuid="a52aea53-bc0c-61ca-8f99-df0926347ce2"} 1
  6. # HELP nvidia_smi_utilization_memory_ratio utilization.memory [%]
  7. # TYPE nvidia_smi_utilization_memory_ratio gauge
  8. nvidia_smi_utilization_memory_ratio{uuid="9aad4dc0-0be0-c871-2f84-6b990152f5ec"} 0
  9. nvidia_smi_utilization_memory_ratio{uuid="a52aea53-bc0c-61ca-8f99-df0926347ce2"} 0
  10. # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
  11. # TYPE process_cpu_seconds_total counter
  12. process_cpu_seconds_total 22.55
  13. # HELP process_max_fds Maximum number of open file descriptors.
  14. # TYPE process_max_fds gauge
  15. process_max_fds 1.048576e+06
  16. # HELP process_open_fds Number of open file descriptors.
  17. # TYPE process_open_fds gauge
  18. process_open_fds 12
  19. # HELP process_resident_memory_bytes Resident memory size in bytes.
  20. # TYPE process_resident_memory_bytes gauge
  21. process_resident_memory_bytes 1.7002496e+07
  22. # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
  23. # TYPE process_start_time_seconds gauge
  24. process_start_time_seconds 1.70476923874e+09
  25. # HELP process_virtual_memory_bytes Virtual memory size in bytes.
  26. # TYPE process_virtual_memory_bytes gauge
  27. process_virtual_memory_bytes 7.40630528e+08
  28. # HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
  29. # TYPE process_virtual_memory_max_bytes gauge
  30. process_virtual_memory_max_bytes 1.8446744073709552e+19
  31. # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
  32. # TYPE promhttp_metric_handler_requests_in_flight gauge
  33. promhttp_metric_handler_requests_in_flight 1
  34. # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
  35. # TYPE promhttp_metric_handler_requests_total counter
  36. promhttp_metric_handler_requests_total{code="200"} 1447
  37. promhttp_metric_handler_requests_total{code="500"} 0
  38. promhttp_metric_handler_requests_total{code="503"} 0
  39. 等等..................

告警规则

  1. groups:
  2. - name: example
  3. rules:
  4. - alert: GPU内存使用率高
  5. expr: (nvidia_smi_memory_used_bytes / ignoring(instance) nvidia_smi_memory_total_bytes) > 0.9
  6. for: 5m
  7. labels:
  8. severity: warning
  9. annotations:
  10. summary: "GPU memory usage is above 90%"
  11. description: "The GPU({{ $labels.uuid }}) 已使用 90% 以上的内存超过5分钟."
  12. - alert: GPU的温度过高
  13. expr: nvidia_smi_temperature_gpu > 80
  14. for: 5m
  15. labels:
  16. severity: warning
  17. annotations:
  18. summary: "GPU temperature is above 80 degrees"
  19. description: "The GPU({{ $labels.uuid }}) 温度已经超过 80° 超过5分钟."
  20. - alert: GPU的功率使用过高
  21. expr: nvidia_smi_power_draw_watts / nvidia_smi_power_limit_watts > 0.9
  22. for: 5m
  23. labels:
  24. severity: warning
  25. annotations:
  26. summary: "GPU power usage is above 90% of its limit"
  27. description: "The GPU({{ $labels.uuid }}) 已使用超过 90% 的功率限制超过5分钟."
  28. - alert: GPU_ECC错误过多
  29. expr: increase(nvidia_smi_ecc_errors_corrected_volatile_total[1h]) > 0
  30. for: 1h
  31. labels:
  32. severity: critical
  33. annotations:
  34. summary: "GPU ECC错误增加"
  35. description: "在过去的一小时内,GPU({{ $labels.uuid }})的ECC错误数量有所增加。"
  36. - alert: GPU瓦数过高
  37. expr: nvidia_smi_power_draw_watts > 250
  38. for: 5m
  39. labels:
  40. severity: warning
  41. annotations:
  42. summary: "GPU功耗超过250瓦"
  43. description: "GPU({{ $labels.uuid }})的功耗已超过250瓦,持续时间超过5分钟。"
  44. # - alert: 高GPU使用率
  45. # expr: nvidia_smi_utilization_gpu_ratio > 0.9
  46. # for: 1h
  47. # labels:
  48. # severity: 警告
  49. # annotations:
  50. # summary: GPU使用率过高
  51. # description: GPU使用率超过90%,并且持续了超过1小时。

prometheus配置

  1. - job_name: "A100服务器"
  2. scrape_interval: 15s
  3. static_configs:
  4. - targets: ["IP:9835"]
  5. labels:
  6. instance: xxxxx

grafana模板

14574

链接  Nvidia GPU Metrics | Grafana Labs

本文内容由网友自发贡献,转载请注明出处:https://www.wpsshop.cn/w/Guff_9hys/article/detail/849426
推荐阅读
相关标签
  

闽ICP备14008679号