当前位置:   article > 正文

基于Centos7的slurm集群部署方案(测试环境)_slurm部署

slurm部署

基于Centos7的slurm集群部署方案(测试环境):

主要运用场景:

  1. 本地有多台服务器,可以组成集群进行计算任务,slurm有作业的编排功能
  2. 本地和云端的服务器组成混合云,混合云计算同样可以使用slurm,当然如果使用混合云的话可以直接让供应商提供技术支持
  3. 云端的计算和超算中心也是使用slurm做作业编排,当然云端环境不需要自己部署,直接使用就可以,且云端由于有专业的公司维护,稳定性更高

步骤:

  • 准备三个节点,也就是三台虚拟机,配置好网络和时间,并配置好hostname

IP                                hostname

192.168.60.24                      master

192.168.60.58                      slurm-node1

192.168.60.60                      slurm-node2

网络要可以互通,如果没有设置好DNS服务器的话,可以设置hosts文件,所有节点都要设置

vim /etc/hosts

192.168.60.24                      master

192.168.60.58                      slurm-node1

 192.168.60.60                  slurm-node2

ntpdate -u ntp.api.bz     同步时间,三台都要做

  • 关闭防火墙和SELINUX

systemctl stop firefirewalld

systemctl disable firefirewalld

setenforcing 0

getenforcing

vim /etc/selinux/config

SELINUX=Permissive

  • 设置SSH(这一步不是必须要的,是为了方便远程连接)

在控制主机中生成ssh密钥对

ssh-keygen -t rsa

一直回车,即在$HOME/.ssh目录下生成id_rsa和id_rsa.put私钥和公钥两个文件

将公钥拷贝到客户机中.ssh/authorized_keys文件中,实现免密码登录远程管理主机

ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.60.58

ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.60.60

四、安装依赖

       yum install -y epel-release

       yum install -y gtk2 gtk2-devel munge munge-devel python python3

yum clean all

yum makeache

五、配置munge #三台机器均要配置(slurm要使用munge做认证)

       chown root:root /etc/munge

       chown root:root /var/run/munge

       chown root:root m /var/lib/munge

       chown root:root /var/log/munge

       create-munge-key #此步骤只在master 节点做即可

    chown root:root /etc/munge/munge.key

       scp /etc/munge/munge.key slurm-node1:/etc/munge/

    scp /etc/munge/munge.key slurm-node2:/etc/munge/

       munged  启动munged

       查看munge是否联通

munge -n | unmunge

munge -n | ssh master unmunge

六、安装slurm

yum install -y slurm*

*代表所有以slurm开头的安装包,下图标红的是必须的,其他的可以根据需要安装

七、配置conf文件

vim /etc/slurm/slurm.conf

(下面是完整配置文件,事实上并不需要全部修改,标黄处是我做了修改的地方,其他配置按需要修改即可)

#

# See the slurm.conf man page for more information.

#

ControlMachine=master

#ControlAddr=127.0.0.1

#BackupController=

#BackupAddr=

#

AuthType=auth/munge

#CheckpointType=checkpoint/none

CryptoType=crypto/munge

#DisableRootJobs=NO

#EnforcePartLimits=NO

#Epilog=

#EpilogSlurmctld=

#FirstJobId=1

#MaxJobId=999999

#GresTypes=

#GroupUpdateForce=0

#GroupUpdateTime=600

#JobCheckpointDir=/var/slurm/checkpoint

#JobCredentialPrivateKey=

#JobCredentialPublicCertificate=

#JobFileAppend=0

#JobRequeue=1

#JobSubmitPlugins=

#KillOnBadExit=0

#LaunchType=launch/slurm

#Licenses=foo*4,bar

#MailProg=/bin/true

#MaxJobCount=5000

#MaxStepCount=40000

#MaxTasksPerNode=128

MpiDefault=none

#MpiParams=ports=#-#

#PluginDir=

#PlugStackConfig=

#PrivateData=jobs

ProctrackType=proctrack/cgroup

#Prolog=

#PrologFlags=

#PrologSlurmctld=

#PropagatePrioProcess=0

#PropagateResourceLimits=

#PropagateResourceLimitsExcept=

#RebootProgram=

ReturnToService=1

#SallocDefaultCommand=

SlurmctldPidFile=/var/run/slurm/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurm/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurm/d

SlurmUser=root

#SlurmdUser=root

#SrunEpilog=

#SrunProlog=

StateSaveLocation=/var/spool/slurm/ctld

SwitchType=switch/none

#TaskEpilog=

TaskPlugin=task/none

#TaskPluginParam=

#TaskProlog=

#TopologyPlugin=topology/tree

#TmpFS=/tmp

#TrackWCKey=no

#TreeWidth=

#UnkillableStepProgram=

#UsePAM=0

#

#

# TIMERS

#BatchStartTimeout=10

#CompleteWait=0

#EpilogMsgTime=2000

#GetEnvTimeout=2

#HealthCheckInterval=0

#HealthCheckProgram=

InactiveLimit=0

KillWait=30

#MessageTimeout=10

#ResvOverRun=0

MinJobAge=300

#OverTimeLimit=0

SlurmctldTimeout=120

SlurmdTimeout=300

#UnkillableStepTimeout=60

#VSizeFactor=0

Waittime=0

#

#

# SCHEDULING

#DefMemPerCPU=0

#FastSchedule=1

#MaxMemPerCPU=0

#SchedulerTimeSlice=30

SchedulerType=sched/backfill

SelectType=select/linear

#SelectTypeParameters=

#

#

# JOB PRIORITY

#PriorityFlags=

#PriorityType=priority/basic

#PriorityDecayHalfLife=

#PriorityCalcPeriod=

#PriorityFavorSmall=

#PriorityMaxAge=

#PriorityUsageResetPeriod=

#PriorityWeightAge=

#PriorityWeightFairshare=

#PriorityWeightJobSize=

#PriorityWeightPartition=

#PriorityWeightQOS=

#

#

# LOGGING AND ACCOUNTING

#AccountingStorageEnforce=0

#AccountingStorageHost=

#AccountingStorageLoc=

#AccountingStoragePass=

#AccountingStoragePort=

AccountingStorageType=accounting_storage/none

#AccountingStorageUser=

AccountingStoreJobComment=YES

ClusterName=cluster

#DebugFlags=

#JobCompHost=

#JobCompLoc=

#JobCompPass=

#JobCompPort=

JobCompType=jobcomp/none

#JobCompUser=

#JobContainerType=job_container/none

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=3

#SlurmctldLogFile=

SlurmdDebug=3

#SlurmdLogFile=

#SlurmSchedLogFile=

#SlurmSchedLogLevel=

#

#

# POWER SAVE SUPPORT FOR IDLE NODES (optional)

#SuspendProgram=

#ResumeProgram=

#SuspendTimeout=

#ResumeTimeout=

#ResumeRate=

#SuspendExcNodes=

#SuspendExcParts=

#SuspendRate=

#SuspendTime=

#

#

# COMPUTE NODES

NodeName=master CPUs=4 Sockets=1 CoresPerSocket=4 RealMemory=2000 State=UNKNOWN

PartitionName=control Nodes=master Default=YES MaxTime=INFINITE State=UP

NodeName=slurm-node1,slurm-node2 CPUs=4 Sockets=1 CoresPerSocket=4 RealMemory=2000 State=UNKNOWN

PartitionName=compute Nodes=slurm-node1,slurm-node2 Default=YES MaxTime=INFINITE State=UP

#NodeName=slurm-node2 CPUs=4 Sockets=1 CoresPerSocket=4 RealMemory=2000 State=UNKNOWN

#PartitionName=compute Nodes=slurm-node2 Default=YES MaxTime=INFINITE State=UP

八、运行slurm

在master节点输入:

slurmctld -c

slurmd -c

在node节点输入:

slurm -c  

输入sinfo查看集群状态

调试:

如果出现问题启动不了,会显示启动状态:

slurmctld -DVVVVV

slurmd -DVVVVV

简单的例子,可以看到创建的作业,更详细的使用文档在另一份文件:

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/1017104
推荐阅读
相关标签
  

闽ICP备14008679号