-
Notifications
You must be signed in to change notification settings - Fork 37
Updates to use amazon linux 2023 ami and reflect the actual RealMemory of the compute nodes #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: plugin-v2
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,7 +30,7 @@ Parameters: | |
|
||
LatestAmiId: | ||
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> | ||
Default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 | ||
Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 # AL2023 ami | ||
|
||
HeadNodeInstanceType: | ||
Type: String | ||
|
@@ -47,6 +47,12 @@ Parameters: | |
Default: 2 | ||
Description: Number of vCPUs for the compute node instance type | ||
|
||
ComputeNodeMemory: | ||
Type: Number | ||
Default: 4 | ||
Description: Amount of memory for the compute instance type in GB | ||
|
||
|
||
Metadata: | ||
AWS::CloudFormation::Interface: | ||
ParameterGroups: | ||
|
@@ -62,6 +68,7 @@ Metadata: | |
- HeadNodeInstanceType | ||
- ComputeNodeInstanceType | ||
- ComputeNodeCPUs | ||
- ComputeNodeMemory | ||
- KeyPair | ||
- LatestAmiId | ||
- Label: | ||
|
@@ -82,10 +89,12 @@ Metadata: | |
default: Compute Node Instance Type | ||
ComputeNodeCPUs: | ||
default: Compute Node vCPUs | ||
ComputeNodeMemory: | ||
default: Compute Node memory | ||
KeyPair: | ||
default: Key Pair | ||
LatestAmiId: | ||
default: Latest Amazon Linux 2 AMI ID | ||
default: Latest Amazon Linux 2023 AMI ID | ||
SlurmPackageUrl: | ||
default: Slurm Package URL | ||
PluginPrefixUrl: | ||
|
@@ -188,8 +197,10 @@ Resources: | |
Fn::Base64: | ||
!Sub | | ||
#!/bin/bash -x | ||
amazon-linux-extras install epel -y | ||
yum install munge munge-libs munge-devel -y | ||
# Install packages | ||
dnf update -y | ||
dnf install nfs-utils python3 python3-pip -y | ||
dnf install munge munge-libs munge-devel -y | ||
|
||
echo "welcometoslurmamazonuserwelcometoslurmamazonuserwelcometoslurmamazonuser" | tee /etc/munge/munge.key | ||
chown munge:munge /etc/munge/munge.key | ||
|
@@ -200,7 +211,7 @@ Resources: | |
systemctl start munge | ||
sleep 15 | ||
|
||
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad rpm-build -y | ||
dnf install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad rpm-build -y | ||
|
||
mkdir -p /nfs | ||
mount -t nfs ${HeadNodeNetworkInterface.PrimaryPrivateIpAddress}:/nfs /nfs | ||
|
@@ -229,20 +240,19 @@ Resources: | |
!Sub | | ||
#!/bin/bash -x | ||
# Install packages | ||
yum update -y | ||
yum install nfs-utils python2 python2-pip python3 python3-pip -y | ||
amazon-linux-extras install epel -y | ||
yum install munge munge-libs munge-devel openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad rpm-build libyaml http-parser-devel json-c-devel perl-devel -y | ||
yum groupinstall "Development Tools" -y | ||
dnf update -y | ||
dnf install nfs-utils python3 python3-pip -y | ||
dnf install munge munge-libs munge-devel openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad rpm-build libyaml json-c-devel perl-devel -y | ||
dnf groupinstall "Development Tools" -y | ||
pip3 install boto3 | ||
pip3 install awscli | ||
|
||
# Configure NFS share | ||
mkdir -p /nfs | ||
echo "/nfs *(rw,async,no_subtree_check,no_root_squash)" | tee /etc/exports | ||
systemctl enable nfs | ||
systemctl start nfs | ||
exportfs -av | ||
systemctl enable --now nfs-server rpcbind | ||
systemctl restart nfs-server | ||
sudo exportfs -arv | ||
|
||
# Configure Munge | ||
echo "welcometoslurmamazonuserwelcometoslurmamazonuserwelcometoslurmamazonuser" | tee /etc/munge/munge.key | ||
|
@@ -257,10 +267,10 @@ Resources: | |
# Install Slurm | ||
cd /home/ec2-user/ | ||
wget -q ${SlurmPackageUrl} | ||
tar -xvf /home/ec2-user/slurm-*.tar.bz2 -C /home/ec2-user | ||
cd /home/ec2-user/slurm-* | ||
/home/ec2-user/slurm-*/configure --prefix=/nfs/slurm | ||
make -j 4 | ||
tar -xf slurm-*.tar.bz2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if SlurmPackageUrl is not .tar.bz2? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bollig - if its changes the script will break. Its an upstream packaging decision that devs normally doesn't change in a whim. The archive extension appears to be consistently using .tar.bz2. https://download.schedmd.com/slurm/. Now we can try to handle some of them but will not be full proof. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. something ugly like this
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fair. this is ok as is, just thinking about people who may roll/distribute their own patch-fixed version of slurm. |
||
cd "$(ls -d /home/ec2-user/slurm-* | grep -v '.tar.bz2')" | ||
./configure --prefix=/nfs/slurm | ||
make -j $(nproc) | ||
make install | ||
make install-contrib | ||
sleep 5 | ||
|
@@ -308,7 +318,8 @@ Resources: | |
"MaxNodes": 100, | ||
"Region": "${AWS::Region}", | ||
"SlurmSpecifications": { | ||
"CPUs": "${ComputeNodeCPUs}" | ||
"CPUs": "${ComputeNodeCPUs}", | ||
"RealMemory": "${ComputeNodeMemory}" | ||
}, | ||
"PurchasingOption": "on-demand", | ||
"OnDemandOptions": { | ||
|
@@ -377,8 +388,12 @@ Resources: | |
# Configure the plugin | ||
$SLURM_HOME/etc/aws/generate_conf.py | ||
cat $SLURM_HOME/etc/aws/slurm.conf.aws >> $SLURM_HOME/etc/slurm.conf | ||
cp $SLURM_HOME/etc/aws/gres.conf.aws $SLURM_HOME/etc/gres.conf | ||
cp $SLURM_HOME/etc/aws/gres.conf.aws $SLURM_HOME/etc/gres.conf # GPU's | ||
|
||
# install cronie package | ||
dnf install cronie -y | ||
systemctl enable crond.service | ||
systemctl start crond.service | ||
crontab -l > mycron | ||
cat > mycron <<EOF | ||
* * * * * $SLURM_HOME/etc/aws/change_state.py &>/dev/null | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why statically define this? Why not auto-detect RealMemory and have users provide a SchedulableMemory percentage (see ParallelCluster for example)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchedulableMemory
'll be something that needs further research and for my next PR. :)IFAIK,
SchedulableMemory
is a Slurm configuration that's shipped with AWS ParallelCluster. And we are not dealing with parallelcluster here. As for the auto-detect forRealMemory
; we need an ec2 describe call for the instance specified. This will also need more broader change which is beyond the scope of this PR.