You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run the multi node ray example on LUMI but it seems there is some sort of issue in the script as it is and it doesn't start up. Do you have an idea what could be the problem?
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2025-03-26_16-52-45_422621_52670/sockets/plasma_store in the list of object store socket names.
Initializing ray cluster on head node uan04
Usage stats collection is disabled.
Local node IP: 193.167.209.166
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/_private/node.py", line 342, in __init__
ray._private.services.wait_for_node(
File "/usr/local/lib/python3.10/dist-packages/ray/_private/services.py", line 471, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2025-03-26_16-52-45_422621_52670/sockets/plasma_store in the list of object store socket names.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2672, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2668, in main
return cli()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 850, in start
node = ray._private.node.Node(
File "/usr/local/lib/python3.10/dist-packages/ray/_private/node.py", line 347, in __init__
raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.
Here's the complete script for reference:
#!/bin/bash
#SBATCH --account=project_xxx
#SBATCH --partition=dev-g
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --cpus-per-task=56
#SBATCH --gpus-per-node=8
#SBATCH --mem=480G
#SBATCH --time=30
export NUMEXPR_MAX_THREADS=16
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export HF_HOME=/scratch/project_xxx/xxx/hf-cache
# Where to store the vLLM server log
VLLM_LOG=/scratch/project_xxx/xxx/vllm-logs/${SLURM_JOB_ID}.log
mkdir -p $(dirname $VLLM_LOG)
MODEL=".."
RAY="python3 -m ray.scripts.scripts"
RAY_PORT=6379
HEAD_NODE=$(hostname)
# Needed on AMD at least, see https://github.com/vllm-project/vllm/issues/3818
export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
# Using the socket network, rather than infiniband et al, not ideal
# but works for now
export NCCL_NET=Socket
# Load the modules
module purge
module use /appl/local/csc/modulefiles/
module load pytorch/2.5
#source venv/bin/activate
# Start Ray on the head node
echo "Initializing ray cluster on head node $HEAD_NODE"
$RAY start --head --port=${RAY_PORT} --disable-usage-stats
# Make sure head node has started properly
sleep 30
while ! ray status >/dev/null 2>&1
do
sleep 5
done
WORKER_NNODES=$(( SLURM_NNODES - 1 ))
echo "Start the $WORKER_NNODES worker node(s)"
srun --ntasks=$WORKER_NNODES --nodes=$WORKER_NNODES --exclude=$HEAD_NODE $RAY start --block --address=$HEAD_NODE:${RAY_PORT} &
# Wait until all worker nodes have checked in
sleep 10
while [ $(ray status 2>/dev/null | grep node_ | wc -l) -ne $SLURM_NNODES ]
do
sleep 5
done
ray status
echo "Starting VLLM"
python -m vllm.entrypoints.openai.api_server \
--distributed-executor-backend=ray \
--model=$MODEL \
--dtype=auto \
--tensor-parallel-size=8 \
--pipeline-parallel-size=2 \
--gpu-memory-utilization=0.95 \
--trust-remote-code \
--enforce-eager > $VLLM_LOG &
# Wait until vLLM is running properly
sleep 20
while ! curl localhost:8000 >/dev/null 2>&1
do
sleep 10
done
curl localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"prompt": "I finally got vLLM working on multiple nodes on LUMI", "temperature": 0, "max_tokens": 100, "model": "xxx"}' | json_pp
# If you want to keep vLLM running you need to add a "wait" here, otherwise the job will stop when the above line is done.
# wait
The text was updated successfully, but these errors were encountered:
I'm trying to run the multi node ray example on LUMI but it seems there is some sort of issue in the script as it is and it doesn't start up. Do you have an idea what could be the problem?
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2025-03-26_16-52-45_422621_52670/sockets/plasma_store in the list of object store socket names.
Here's the complete script for reference:
The text was updated successfully, but these errors were encountered: