Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data parallel inference #1237

Closed
kevinhu opened this issue Sep 30, 2023 · 18 comments
Closed

Data parallel inference #1237

kevinhu opened this issue Sep 30, 2023 · 18 comments
Labels
feature request New feature or request

Comments

@kevinhu
Copy link

kevinhu commented Sep 30, 2023

Is there a recommended way to run data parallel inference (i.e. a copy of the model on each GPU)? It's possible by hacking CUDA_VISIBLE_DEVICES, but I was wondering if there's a cleaner method.

def worker(worker_idx):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(worker_idx)
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(model="facebook/opt-125m")
    outputs = llm.generate(prompts, sampling_params)


if __name__ == "__main__":
    
    with multiprocessing.Pool(4) as pool:
        pool.map(worker, range(4))
@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Oct 1, 2023

This approach should end up in a more scalable (maybe also cleaner) architecture:

Run a vLLM API server for each GPU, serving on different ports. Then use those API endpoints to schedule generations on the vLLM backends from a centralized process. If you want to do it from Python, then try the vllm-client package (also available to install with pip), it supports async, making your logic simpler.

It also allows restarting your "control" process during development (or on upgrades) without having to reload the model into the vLLM backends.

It has a slight overhead due to API access, but that would amortize due to the parallel execution of generations.

Make sure to keep the vLLM backend processes alive. Restart them if they crash or if they repeatedly fail client requests.

@viktor-ferenczi
Copy link
Contributor

Feature request: Allow for data-parallel execution on multiple (sets of) GPUs with the same model, served from the same API, so no external scheduler is required.

@brucechin
Copy link

brucechin commented Oct 9, 2023

This approach should end up in a more scalable (maybe also cleaner) architecture:

Run a vLLM API server for each GPU, serving on different ports. Then use those API endpoints to schedule generations on the vLLM backends from a centralized process. If you want to do it from Python, then try the vllm-client package (also available to install with pip), it supports async, making your logic simpler.

It also allows restarting your "control" process during development (or on upgrades) without having to reload the model into the vLLM backends.

It has a slight overhead due to API access, but that would amortize due to the parallel execution of generations.

Make sure to keep the vLLM backend processes alive. Restart them if they crash or if they repeatedly fail client requests.

Hi @viktor-ferenczi , @LiuXiaoxuanPKU assigned this issue to me to offload some work. After reading your comments, I think to add the data-parallel inference on multiple GPUs with the same model, I can implement it according to your above suggestion?

  1. A centralized scheduler process to start multiple local vLLM API servers, each for each GPU with different ports.
  2. Implement some scheduling policies and restart API server when it detects any failure.
  3. If possible, we can support multi-server and each server with multi-GPU scenario to further improve the scalability.

I plan to add a new class : DataParallelScheduler which can start multiple vLLM API servers, manage them, and schedule incoming requests with the same generate interface. In the vllm/entrypoints/api_server.py, add an option for data-parallel inference, if enabled, instead of starting engine = AsyncLLMEngine.from_engine_args(engine_args), we can initialize a DataParallelScheduler instance for serving the request.

I will ensure that my change will not affect the old execution flow when the data-parallel inference option is disabled. I will also add tests to check the robustness of the scheduler I am going to add.

Please let me know if I miss anything here. I would like to add this feature support in my free time.

@viktor-ferenczi
Copy link
Contributor

I'm not 100% sure that this functionality belongs to the vLLM engine project itself, because it is only a layer on top of it. Maybe using some existing external tool/framework to verify service health and configure it to restart the vLLM instances if required would be enough. All it needs is running a short generation as a health check one a minute (for example), so broken/frozen processes can be identified and restarted automatically.

You have the freedom to go ahead and make a PR for the solution your described.

@SunLemuria
Copy link

I think fastchat supports this feature: fastchat scalability
image

@anisingh1
Copy link

Hi @brucechin, Are you working on implementing this request or has this been deferred?

@AjayP13
Copy link

AjayP13 commented Feb 6, 2024

This is possible to do on our DataDreamer package which can load vLLM in parallel (different models on different GPUs). It does this by always instantiating vLLM in a background process and communicating with it. See ParallelLLM in the package for wrapping multiple VLLM objects under a single LLM object.

@andakai
Copy link

andakai commented Mar 25, 2024

This approach should end up in a more scalable (maybe also cleaner) architecture:
Run a vLLM API server for each GPU, serving on different ports. Then use those API endpoints to schedule generations on the vLLM backends from a centralized process. If you want to do it from Python, then try the vllm-client package (also available to install with pip), it supports async, making your logic simpler.
It also allows restarting your "control" process during development (or on upgrades) without having to reload the model into the vLLM backends.
It has a slight overhead due to API access, but that would amortize due to the parallel execution of generations.
Make sure to keep the vLLM backend processes alive. Restart them if they crash or if they repeatedly fail client requests.

Hi @viktor-ferenczi , @LiuXiaoxuanPKU assigned this issue to me to offload some work. After reading your comments, I think to add the data-parallel inference on multiple GPUs with the same model, I can implement it according to your above suggestion?

  1. A centralized scheduler process to start multiple local vLLM API servers, each for each GPU with different ports.
  2. Implement some scheduling policies and restart API server when it detects any failure.
  3. If possible, we can support multi-server and each server with multi-GPU scenario to further improve the scalability.

I plan to add a new class : DataParallelScheduler which can start multiple vLLM API servers, manage them, and schedule incoming requests with the same generate interface. In the vllm/entrypoints/api_server.py, add an option for data-parallel inference, if enabled, instead of starting engine = AsyncLLMEngine.from_engine_args(engine_args), we can initialize a DataParallelScheduler instance for serving the request.

I will ensure that my change will not affect the old execution flow when the data-parallel inference option is disabled. I will also add tests to check the robustness of the scheduler I am going to add.

Please let me know if I miss anything here. I would like to add this feature support in my free time.

Hi, @brucechin , how is this work going? I am fascinated about this idea.

@AmoghM
Copy link

AmoghM commented May 4, 2024

+1 for this feature.

1 similar comment
@zemerov
Copy link

zemerov commented May 7, 2024

+1 for this feature.

@WanBenLe
Copy link

+1 for this feature, datadreamer seems couldn't Improve inference speed(with model copy)) of a single model on multiple GPUs

@ifromeast
Copy link

+1 for this feature. @WoosukKwon

@DarkLight1337 DarkLight1337 added the feature request New feature or request label May 31, 2024
@GritLs
Copy link

GritLs commented Jun 1, 2024

+1 for this feature

@mangomatrix
Copy link

need this feature too.

@kota-iizuka
Copy link
Contributor

There is an example of using data parallel in https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_distributed.py (changing num_instances properly controls the CUDA_VISIBLE_DEVICES environment variable as well).

On the other hand, since the above example is batch inference, I think there is still a need for a method of online inference (with proper load balancing) and a simple method for parallel inference of multiple models. (It is probably possible to achieve this with a single script in the examples/ directory, but it is important to make it easy to use.)

@shizhediao
Copy link

+1 for this feature.

@zhaochenyang20
Copy link

https://github.com/zhaochenyang20/ModelServer

Could you please check this? This is my locally written one.

@youkaichao
Copy link
Member

I'm going to close this issue, as vllm does not plan to support it.

users should seek third party support (which should be pretty easy to set up), e.g.:

https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model

or the solutions mentioned in the above discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests