当前位置:   article > 正文

工具系列:PyCaret介绍_基于Dask搭建分布式计算集群_dask分布式搭建

dask分布式搭建

一、目的

1、单机容量限制

一般单个机器的内存在64G,由于建模需要,可能数据量级别可能再100G以上超过单机限制,所以需要分布式集群去处理大规模数据。

2、利用已有计算资源

为了提升计算效率,充分利用已有计算资源,可以调用多个服务器多核去处理大规模数据。

二、架构

Dask是基于资源管理器下游的应用,可以把虚拟机的资源整合成分布式集群,Pycaret通过dask做机器学习计算。

三、参考内容

1、Dask介绍

https://www.dask.org

2、Pycaret介绍

http://www.pycaret.org/

四、部署安装

1、安装Dask

pip install "dask[complete]"

2、配置Dask集群

(1)主服务器配置

  1. #命令行执行
  2. dask-scheduler
  1. root@notebook-rn-20231114171325952s4sc-2qr37-0:~/.local/bin# ./dask-scheduler
  2. /root/.local/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py:140: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
  3. warnings.warn(
  4. 2023-11-15 05:07:23,025 - distributed.scheduler - INFO - -----------------------------------------------
  5. 2023-11-15 05:07:23,660 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
  6. 2023-11-15 05:07:23,735 - distributed.scheduler - INFO - State start
  7. 2023-11-15 05:07:23,741 - distributed.scheduler - INFO - -----------------------------------------------
  8. 2023-11-15 05:07:23,743 - distributed.scheduler - INFO - Scheduler at: tcp://172.20.12.148:8786
  9. 2023-11-15 05:07:23,743 - distributed.scheduler - INFO - dashboard at: http://172.20.12.148:8787/status

(2)从服务器配置

  1. # 命令行执行
  2. dask-worker 172.20.12.148:8786
  1. /root/.local/lib/python3.8/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
  2. warnings.warn(
  3. 2023-11-15 05:07:27,093 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.12.178:34413'
  4. 2023-11-15 05:07:28,847 - distributed.worker - INFO - Start worker at: tcp://172.20.12.178:41461
  5. 2023-11-15 05:07:28,847 - distributed.worker - INFO - Listening to: tcp://172.20.12.178:41461
  6. 2023-11-15 05:07:28,847 - distributed.worker - INFO - dashboard at: 172.20.12.178:37669
  7. 2023-11-15 05:07:28,848 - distributed.worker - INFO - Waiting to connect to: tcp://172.20.12.148:8786
  8. 2023-11-15 05:07:28,848 - distributed.worker - INFO - -------------------------------------------------
  9. 2023-11-15 05:07:28,848 - distributed.worker - INFO - Threads: 2
  10. 2023-11-15 05:07:28,848 - distributed.worker - INFO - Memory: 8.00 GiB
  11. 2023-11-15 05:07:28,848 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-w6vw3yg0
  12. 2023-11-15 05:07:28,848 - distributed.worker - INFO - -------------------------------------------------
  13. 2023-11-15 05:07:30,199 - distributed.worker - INFO - Registered to: tcp://172.20.12.148:8786
  14. 2023-11-15 05:07:30,200 - distributed.worker - INFO - -------------------------------------------------
  15. 2023-11-15 05:07:30,203 - distributed.core - INFO - Starting established connection to tcp://172.20.12.148:8786

3、安装Pycaret

pip install pycaret[full]

五、功能测试

1、测试Dask

  1. from dask.distributed import Client
  2. client = Client('172.20.12.178:8786')
  3. def square(x):
  4. return x ** 2
  5. def neg(x):
  6. return -x
  7. A = client.map(square, range(10))
  8. B = client.map(neg, A)
  9. total = client.submit(sum, B)
  10. total.result()
-285

2、测试Pycaret集群计算

  1. import pandas as pd
  2. df=pd.read_csv('train.csv')
  3. # init setup
  4. from pycaret.classification import *
  5. clf1 = setup(data = df, target = 'Survived',n_jobs = -1)
  6. # import parallel back-end
  7. from pycaret.parallel import FugueBackend
  8. compare_models(n_select=3, parallel=FugueBackend("dask"),verbose=True)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Cpp五条/article/detail/526111
推荐阅读
相关标签
  

闽ICP备14008679号