Skip to content

Updates to use amazon linux 2023 ami and reflect the actual RealMemory of the compute nodes #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: plugin-v2
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 34 additions & 19 deletions template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Parameters:

LatestAmiId:
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2
Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 # AL2023 ami

HeadNodeInstanceType:
Type: String
Expand All @@ -47,6 +47,12 @@ Parameters:
Default: 2
Description: Number of vCPUs for the compute node instance type

ComputeNodeMemory:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why statically define this? Why not auto-detect RealMemory and have users provide a SchedulableMemory percentage (see ParallelCluster for example)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchedulableMemory 'll be something that needs further research and for my next PR. :)

IFAIK, SchedulableMemory is a Slurm configuration that's shipped with AWS ParallelCluster. And we are not dealing with parallelcluster here. As for the auto-detect for RealMemory; we need an ec2 describe call for the instance specified. This will also need more broader change which is beyond the scope of this PR.

Type: Number
Default: 4
Description: Amount of memory for the compute instance type in GB


Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
Expand All @@ -62,6 +68,7 @@ Metadata:
- HeadNodeInstanceType
- ComputeNodeInstanceType
- ComputeNodeCPUs
- ComputeNodeMemory
- KeyPair
- LatestAmiId
- Label:
Expand All @@ -82,10 +89,12 @@ Metadata:
default: Compute Node Instance Type
ComputeNodeCPUs:
default: Compute Node vCPUs
ComputeNodeMemory:
default: Compute Node memory
KeyPair:
default: Key Pair
LatestAmiId:
default: Latest Amazon Linux 2 AMI ID
default: Latest Amazon Linux 2023 AMI ID
SlurmPackageUrl:
default: Slurm Package URL
PluginPrefixUrl:
Expand Down Expand Up @@ -188,8 +197,10 @@ Resources:
Fn::Base64:
!Sub |
#!/bin/bash -x
amazon-linux-extras install epel -y
yum install munge munge-libs munge-devel -y
# Install packages
dnf update -y
dnf install nfs-utils python3 python3-pip -y
dnf install munge munge-libs munge-devel -y

echo "welcometoslurmamazonuserwelcometoslurmamazonuserwelcometoslurmamazonuser" | tee /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
Expand All @@ -200,7 +211,7 @@ Resources:
systemctl start munge
sleep 15

yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad rpm-build -y
dnf install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad rpm-build -y

mkdir -p /nfs
mount -t nfs ${HeadNodeNetworkInterface.PrimaryPrivateIpAddress}:/nfs /nfs
Expand Down Expand Up @@ -229,20 +240,19 @@ Resources:
!Sub |
#!/bin/bash -x
# Install packages
yum update -y
yum install nfs-utils python2 python2-pip python3 python3-pip -y
amazon-linux-extras install epel -y
yum install munge munge-libs munge-devel openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad rpm-build libyaml http-parser-devel json-c-devel perl-devel -y
yum groupinstall "Development Tools" -y
dnf update -y
dnf install nfs-utils python3 python3-pip -y
dnf install munge munge-libs munge-devel openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad rpm-build libyaml json-c-devel perl-devel -y
dnf groupinstall "Development Tools" -y
pip3 install boto3
pip3 install awscli

# Configure NFS share
mkdir -p /nfs
echo "/nfs *(rw,async,no_subtree_check,no_root_squash)" | tee /etc/exports
systemctl enable nfs
systemctl start nfs
exportfs -av
systemctl enable --now nfs-server rpcbind
systemctl restart nfs-server
sudo exportfs -arv

# Configure Munge
echo "welcometoslurmamazonuserwelcometoslurmamazonuserwelcometoslurmamazonuser" | tee /etc/munge/munge.key
Expand All @@ -257,10 +267,10 @@ Resources:
# Install Slurm
cd /home/ec2-user/
wget -q ${SlurmPackageUrl}
tar -xvf /home/ec2-user/slurm-*.tar.bz2 -C /home/ec2-user
cd /home/ec2-user/slurm-*
/home/ec2-user/slurm-*/configure --prefix=/nfs/slurm
make -j 4
tar -xf slurm-*.tar.bz2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if SlurmPackageUrl is not .tar.bz2?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bollig - if its changes the script will break. Its an upstream packaging decision that devs normally doesn't change in a whim. The archive extension appears to be consistently using .tar.bz2. https://download.schedmd.com/slurm/.

Now we can try to handle some of them but will not be full proof.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something ugly like this

wget -q ${SlurmPackageUrl}
# Extract based on file extension
if ls slurm-*.tar.gz >/dev/null 2>&1; then
    tar -xzf slurm-*.tar.gz
elif ls slurm-*.tar.bz2 >/dev/null 2>&1; then
    tar -xjf slurm-*.tar.bz2
elif ls slurm-*.tgz >/dev/null 2>&1; then
    tar -xzf slurm-*.tgz
elif ls slurm-*.tar >/dev/null 2>&1; then
    tar -xf slurm-*.tar
else
    echo "No recognized Slurm archive found"
    exit 1
fi

# Change to the extracted directory, excluding any archive files
cd "$(ls -d /home/ec2-user/slurm-* | grep -v -E '\.tar\.gz$|\.tar\.bz2$|\.tgz$|\.tar$')"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair. this is ok as is, just thinking about people who may roll/distribute their own patch-fixed version of slurm.

cd "$(ls -d /home/ec2-user/slurm-* | grep -v '.tar.bz2')"
./configure --prefix=/nfs/slurm
make -j $(nproc)
make install
make install-contrib
sleep 5
Expand Down Expand Up @@ -308,7 +318,8 @@ Resources:
"MaxNodes": 100,
"Region": "${AWS::Region}",
"SlurmSpecifications": {
"CPUs": "${ComputeNodeCPUs}"
"CPUs": "${ComputeNodeCPUs}",
"RealMemory": "${ComputeNodeMemory}"
},
"PurchasingOption": "on-demand",
"OnDemandOptions": {
Expand Down Expand Up @@ -377,8 +388,12 @@ Resources:
# Configure the plugin
$SLURM_HOME/etc/aws/generate_conf.py
cat $SLURM_HOME/etc/aws/slurm.conf.aws >> $SLURM_HOME/etc/slurm.conf
cp $SLURM_HOME/etc/aws/gres.conf.aws $SLURM_HOME/etc/gres.conf
cp $SLURM_HOME/etc/aws/gres.conf.aws $SLURM_HOME/etc/gres.conf # GPU's

# install cronie package
dnf install cronie -y
systemctl enable crond.service
systemctl start crond.service
crontab -l > mycron
cat > mycron <<EOF
* * * * * $SLURM_HOME/etc/aws/change_state.py &>/dev/null
Expand Down