🤗 Support request for a new model from huggingface: embedding model with rotary embeddings not supported #10970

cosmic-chichu · 2024-12-06T23:44:21Z

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.227-219.884.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.85
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 550.127.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               8
On-line CPU(s) list:                  0-7
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
CPU family:                           6
Model:                                79
Thread(s) per core:                   2
Core(s) per socket:                   4
Socket(s):                            1
Stepping:                             1
CPU max MHz:                          3000.0000
CPU min MHz:                          1200.0000
BogoMIPS:                             4600.01
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
Hypervisor vendor:                    Xen
Virtualization type:                  full
L1d cache:                            128 KiB (4 instances)
L1i cache:                            128 KiB (4 instances)
L2 cache:                             1 MiB (4 instances)
L3 cache:                             45 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-7
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.45.2
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-7	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=GPU-3835f1aa-d70a-51ab-2903-518f94d2b012
NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
NCCL_VERSION=2.22.3-1
CUDA_DOCKER_ARCH=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=12.6.0
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

problem

I'm unable to use the jinaai/jina-embeddings-v3 embedding model with vllm. When run using OpenAI style OpenAIServingEmbedding it throws an error No CUDA GPUs are available. I also tried to run it using LLM class and it gives the error described below.

Does this mean the model is not yet supported or a different issue?

code

from vllm import LLM

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create an LLM.
model = LLM(model="jinaai/jina-embeddings-v3", enforce_eager=True, trust_remote_code=True)
# Generate embedding. The output is a list of PoolingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
    print(output.outputs.embedding)  # list of 4096 floats

error

WARNING 12-06 23:15:27 config.py:503] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 12-06 23:15:27 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='jinaai/jina-embeddings-v3', speculative_config=None, tokenizer='jinaai/jina-embeddings-v3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8194, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=jinaai/jina-embeddings-v3, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='MEAN', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None))
INFO 12-06 23:15:28 selector.py:135] Using Flash Attention backend.
INFO 12-06 23:15:28 model_runner.py:1072] Starting to load model jinaai/jina-embeddings-v3...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/one-flow/serve-llama/test.py", line 12, in <module>
[rank0]:     model = LLM(model="jinaai/jina-embeddings-v3", enforce_eager=True, trust_remote_code=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1028, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 210, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 585, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 347, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config, )
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 36, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 152, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1074, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 332, in load_model
[rank0]:     model = _initialize_model(vllm_config=vllm_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 100, in _initialize_model
[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/bert.py", line 387, in __init__
[rank0]:     self.model = self._build_model(vllm_config=vllm_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/roberta.py", line 84, in _build_model
[rank0]:     return BertModel(vllm_config=vllm_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/bert.py", line 317, in __init__
[rank0]:     self.embeddings = embedding_class(config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/roberta.py", line 36, in __init__
[rank0]:     raise ValueError("Only 'absolute' position_embedding_type" +
[rank0]: ValueError: Only 'absolute' position_embedding_type is supported
[rank0]:[W1206 23:15:29.942436995 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

error with serving embedding

 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 279, in _init_workers_ray
 self._run_workers("init_device")
 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 411, in _run_workers
 self.driver_worker.execute_method(method, *driver_args,
 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 481, in execute_method
 raise e
 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 472, in execute_method
 return executor(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 135, in init_device
 torch.cuda.set_device(self.device)
 File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 478, in set_device
 torch._C._cuda_setDevice(device)
 File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 319, in _lazy_init
 torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-12-07T02:15:43Z

ValueError: Only 'absolute' position_embedding_type is supported

It seems that your model isn't supported yet. cc @flaviabeo

flaviabeo · 2024-12-09T15:24:59Z

I can look into this @DarkLight1337
cc: @maxdebayser

maxdebayser · 2024-12-11T00:49:43Z

This model comes with with custom code to add rotary embeddings: https://huggingface.co/jinaai/xlm-roberta-flash-implementation . This means that even with the transformers library it's not natively supported and depends on the trust_remote_code=True flag to download and execute this extra code. So I would say that this is a model support request and not a bug.

cosmic-chichu added the bug Something isn't working label Dec 6, 2024

cosmic-chichu changed the title ~~[Bug]: embedding model not supported~~ 🤗 Support request for a new model from huggingface: embedding model with rotary embeddings not supported Dec 11, 2024

richardliaw added the ray anything related with ray label Dec 11, 2024

ruisearch42 removed the ray anything related with ray label Dec 12, 2024

noooop mentioned this issue Apr 6, 2025

[New Model]: jinaai/jina-embeddings-v3 #16120

Merged

vllm-bot closed this as completed in #16120 Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

🤗 Support request for a new model from huggingface: embedding model with rotary embeddings not supported #10970

🤗 Support request for a new model from huggingface: embedding model with rotary embeddings not supported #10970

cosmic-chichu commented Dec 6, 2024

DarkLight1337 commented Dec 7, 2024

Uh oh!

flaviabeo commented Dec 9, 2024

Uh oh!

maxdebayser commented Dec 11, 2024

Uh oh!

Uh oh!

🤗 Support request for a new model from huggingface: embedding model with rotary embeddings not supported #10970

🤗 Support request for a new model from huggingface: embedding model with rotary embeddings not supported #10970

Comments

cosmic-chichu commented Dec 6, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

problem

code

error

error with serving embedding

Before submitting a new issue...

DarkLight1337 commented Dec 7, 2024

Uh oh!

flaviabeo commented Dec 9, 2024

Uh oh!

maxdebayser commented Dec 11, 2024

Uh oh!