赞
踩
基于Centos7的slurm集群部署方案(测试环境):
主要运用场景:
步骤:
IP hostname
192.168.60.24 master
192.168.60.58 slurm-node1
192.168.60.60 slurm-node2
网络要可以互通,如果没有设置好DNS服务器的话,可以设置hosts文件,所有节点都要设置
vim /etc/hosts
192.168.60.24 master
192.168.60.58 slurm-node1
192.168.60.60 slurm-node2
ntpdate -u ntp.api.bz 同步时间,三台都要做
systemctl stop firefirewalld
systemctl disable firefirewalld
setenforcing 0
getenforcing
vim /etc/selinux/config
SELINUX=Permissive
在控制主机中生成ssh密钥对
ssh-keygen -t rsa
一直回车,即在$HOME/.ssh目录下生成id_rsa和id_rsa.put私钥和公钥两个文件
将公钥拷贝到客户机中.ssh/authorized_keys文件中,实现免密码登录远程管理主机
ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.60.58
ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.60.60
四、安装依赖
yum install -y epel-release
yum install -y gtk2 gtk2-devel munge munge-devel python python3
yum clean all
yum makeache
五、配置munge #三台机器均要配置(slurm要使用munge做认证)
chown root:root /etc/munge
chown root:root /var/run/munge
chown root:root m /var/lib/munge
chown root:root /var/log/munge
create-munge-key #此步骤只在master 节点做即可
chown root:root /etc/munge/munge.key
scp /etc/munge/munge.key slurm-node1:/etc/munge/
scp /etc/munge/munge.key slurm-node2:/etc/munge/
munged 启动munged
查看munge是否联通
munge -n | unmunge
munge -n | ssh master unmunge
六、安装slurm
yum install -y slurm*
*代表所有以slurm开头的安装包,下图标红的是必须的,其他的可以根据需要安装
七、配置conf文件
vim /etc/slurm/slurm.conf
(下面是完整配置文件,事实上并不需要全部修改,标黄处是我做了修改的地方,其他配置按需要修改即可)
#
# See the slurm.conf man page for more information.
#
ControlMachine=master
#ControlAddr=127.0.0.1
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/true
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=root
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=master CPUs=4 Sockets=1 CoresPerSocket=4 RealMemory=2000 State=UNKNOWN
PartitionName=control Nodes=master Default=YES MaxTime=INFINITE State=UP
NodeName=slurm-node1,slurm-node2 CPUs=4 Sockets=1 CoresPerSocket=4 RealMemory=2000 State=UNKNOWN
PartitionName=compute Nodes=slurm-node1,slurm-node2 Default=YES MaxTime=INFINITE State=UP
#NodeName=slurm-node2 CPUs=4 Sockets=1 CoresPerSocket=4 RealMemory=2000 State=UNKNOWN
#PartitionName=compute Nodes=slurm-node2 Default=YES MaxTime=INFINITE State=UP
八、运行slurm
在master节点输入:
slurmctld -c
slurmd -c
在node节点输入:
slurm -c
输入sinfo查看集群状态
调试:
如果出现问题启动不了,会显示启动状态:
slurmctld -DVVVVV
slurmd -DVVVVV
简单的例子,可以看到创建的作业,更详细的使用文档在另一份文件:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。