Skip to content

Multi-node Ray LUMI - Did not find socket name .. in the list of object store socket names. #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kaiserdan opened this issue Mar 26, 2025 · 2 comments

Comments

@kaiserdan
Copy link

I'm trying to run the multi node ray example on LUMI but it seems there is some sort of issue in the script as it is and it doesn't start up. Do you have an idea what could be the problem?

TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2025-03-26_16-52-45_422621_52670/sockets/plasma_store in the list of object store socket names.


Initializing ray cluster on head node uan04
Usage stats collection is disabled.

Local node IP: 193.167.209.166
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/node.py", line 342, in __init__
    ray._private.services.wait_for_node(
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/services.py", line 471, in wait_for_node
    raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2025-03-26_16-52-45_422621_52670/sockets/plasma_store in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2672, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2668, in main
    return cli()
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 850, in start
    node = ray._private.node.Node(
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/node.py", line 347, in __init__
    raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

Here's the complete script for reference:

#!/bin/bash
#SBATCH --account=project_xxx
#SBATCH --partition=dev-g
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --cpus-per-task=56
#SBATCH --gpus-per-node=8
#SBATCH --mem=480G
#SBATCH --time=30

export NUMEXPR_MAX_THREADS=16
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export HF_HOME=/scratch/project_xxx/xxx/hf-cache

# Where to store the vLLM server log
VLLM_LOG=/scratch/project_xxx/xxx/vllm-logs/${SLURM_JOB_ID}.log
mkdir -p $(dirname $VLLM_LOG)

MODEL=".."

RAY="python3 -m ray.scripts.scripts"
RAY_PORT=6379
HEAD_NODE=$(hostname)

# Needed on AMD at least, see https://github.com/vllm-project/vllm/issues/3818
export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1

# Using the socket network, rather than infiniband et al, not ideal
# but works for now
export NCCL_NET=Socket

# Load the modules
module purge
module use /appl/local/csc/modulefiles/
module load pytorch/2.5
#source venv/bin/activate

# Start Ray on the head node
echo "Initializing ray cluster on head node $HEAD_NODE"
$RAY start --head --port=${RAY_PORT} --disable-usage-stats

# Make sure head node has started properly
sleep 30
while ! ray status >/dev/null 2>&1
do
    sleep 5
done

WORKER_NNODES=$(( SLURM_NNODES - 1 ))
echo "Start the $WORKER_NNODES worker node(s)"
srun --ntasks=$WORKER_NNODES --nodes=$WORKER_NNODES --exclude=$HEAD_NODE $RAY start --block --address=$HEAD_NODE:${RAY_PORT} &

# Wait until all worker nodes have checked in
sleep 10
while [ $(ray status 2>/dev/null | grep node_ | wc -l) -ne $SLURM_NNODES ]
do
    sleep 5
done
ray status

echo "Starting VLLM"
python -m vllm.entrypoints.openai.api_server \
            --distributed-executor-backend=ray \
            --model=$MODEL \
            --dtype=auto \
            --tensor-parallel-size=8 \
            --pipeline-parallel-size=2 \
            --gpu-memory-utilization=0.95 \
            --trust-remote-code \
            --enforce-eager > $VLLM_LOG &

# Wait until vLLM is running properly
sleep 20
while ! curl localhost:8000 >/dev/null 2>&1
do
    sleep 10
done

curl localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"prompt": "I finally got vLLM working on multiple nodes on LUMI", "temperature": 0, "max_tokens": 100, "model": "xxx"}' | json_pp

# If you want to keep vLLM running you need to add a "wait" here, otherwise the job will stop when the above line is done.
# wait

@mvsjober
Copy link
Member

You need to submit the job with Slurm, e.g.:

sbatch run-vllm-ray.sh

@kaiserdan
Copy link
Author

Yes, I can confirm, this was the issue. It's working as it should now! Thank you very much for the quick help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants