赞
踩
一般单个机器的内存在64G,由于建模需要,可能数据量级别可能再100G以上超过单机限制,所以需要分布式集群去处理大规模数据。
为了提升计算效率,充分利用已有计算资源,可以调用多个服务器多核去处理大规模数据。
Dask是基于资源管理器下游的应用,可以把虚拟机的资源整合成分布式集群,Pycaret通过dask做机器学习计算。
pip install "dask[complete]"
(1)主服务器配置
- #命令行执行
- dask-scheduler
- root@notebook-rn-20231114171325952s4sc-2qr37-0:~/.local/bin# ./dask-scheduler
- /root/.local/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py:140: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
- warnings.warn(
- 2023-11-15 05:07:23,025 - distributed.scheduler - INFO - -----------------------------------------------
- 2023-11-15 05:07:23,660 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
- 2023-11-15 05:07:23,735 - distributed.scheduler - INFO - State start
- 2023-11-15 05:07:23,741 - distributed.scheduler - INFO - -----------------------------------------------
- 2023-11-15 05:07:23,743 - distributed.scheduler - INFO - Scheduler at: tcp://172.20.12.148:8786
- 2023-11-15 05:07:23,743 - distributed.scheduler - INFO - dashboard at: http://172.20.12.148:8787/status
(2)从服务器配置
- # 命令行执行
- dask-worker 172.20.12.148:8786
- /root/.local/lib/python3.8/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
- warnings.warn(
- 2023-11-15 05:07:27,093 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.12.178:34413'
- 2023-11-15 05:07:28,847 - distributed.worker - INFO - Start worker at: tcp://172.20.12.178:41461
- 2023-11-15 05:07:28,847 - distributed.worker - INFO - Listening to: tcp://172.20.12.178:41461
- 2023-11-15 05:07:28,847 - distributed.worker - INFO - dashboard at: 172.20.12.178:37669
- 2023-11-15 05:07:28,848 - distributed.worker - INFO - Waiting to connect to: tcp://172.20.12.148:8786
- 2023-11-15 05:07:28,848 - distributed.worker - INFO - -------------------------------------------------
- 2023-11-15 05:07:28,848 - distributed.worker - INFO - Threads: 2
- 2023-11-15 05:07:28,848 - distributed.worker - INFO - Memory: 8.00 GiB
- 2023-11-15 05:07:28,848 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-w6vw3yg0
- 2023-11-15 05:07:28,848 - distributed.worker - INFO - -------------------------------------------------
- 2023-11-15 05:07:30,199 - distributed.worker - INFO - Registered to: tcp://172.20.12.148:8786
- 2023-11-15 05:07:30,200 - distributed.worker - INFO - -------------------------------------------------
- 2023-11-15 05:07:30,203 - distributed.core - INFO - Starting established connection to tcp://172.20.12.148:8786
pip install pycaret[full]
- from dask.distributed import Client
- client = Client('172.20.12.178:8786')
- def square(x):
- return x ** 2
- def neg(x):
- return -x
- A = client.map(square, range(10))
- B = client.map(neg, A)
- total = client.submit(sum, B)
- total.result()
-285
- import pandas as pd
- df=pd.read_csv('train.csv')
- # init setup
- from pycaret.classification import *
- clf1 = setup(data = df, target = 'Survived',n_jobs = -1)
- # import parallel back-end
- from pycaret.parallel import FugueBackend
- compare_models(n_select=3, parallel=FugueBackend("dask"),verbose=True)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。