Skip to content

RuntimeError: attn_bias is not correctly aligned #407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zieen opened this issue Jul 9, 2023 · 3 comments · Fixed by #834
Closed

RuntimeError: attn_bias is not correctly aligned #407

zieen opened this issue Jul 9, 2023 · 3 comments · Fixed by #834
Labels
bug Something isn't working

Comments

@zieen
Copy link

zieen commented Jul 9, 2023

Unable to handle request for model mosaicml/mpt-30b-chat

INFO 07-09 00:50:38 llm_engine.py:131] # GPU blocks: 716, # CPU blocks: 195
INFO:     Started server process [89934]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 07-09 00:50:42 async_llm_engine.py:117] Received request cmpl-41fa40b022f54beaa423ec71c5c090e9: prompt: 'hello', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=False, max_tokens=7, logprobs=None), prompt token ids: None.
INFO 07-09 00:50:42 scheduler.py:269] Throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO 07-09 00:50:42 async_llm_engine.py:196] Aborted request cmpl-41fa40b022f54beaa423ec71c5c090e9.
INFO:     8.218.79.36:49514 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 481, in create_completion
    async for res in result_generator:
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 151, in generate
    raise e
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 148, in generate
    await self.engine_step(request_id)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 74, in engine_step
    request_outputs = self.engine.step()
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 242, in step
    output = self._run_workers(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 330, in _run_workers
    output = executor(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 284, in execute_model
    output = self.model(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 233, in forward
    hidden_states = self.transformer(input_ids, positions, kv_caches,
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 201, in forward
    hidden_states = block(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 152, in forward
    x = self.attn(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 101, in forward
    attn_output = self.attn(q, k, v, k_cache, v_cache, input_metadata,
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 170, in forward
    self.multi_query_kv_attention(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 352, in multi_query_kv_attention
    out = xops.memory_efficient_attention_forward(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 213, in memory_efficient_attention_forward
    return _memory_efficient_attention_forward(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 310, in _memory_efficient_attention_forward
    out, *_ = op.apply(inp, needs_gradient=False)
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/cutlass.py", line 186, in apply
    out, lse, rng_seed, rng_offset = cls.OPERATOR(
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: attn_bias is not correctly aligned

Here is my Xformers

python -m xformers.info
xFormers 0.0.21+55a4798.d20230709
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   2.0.1+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            9.0
gpu.name:                                          NVIDIA H100 PCIe
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.10.12
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    9.0
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   None
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

Pytorch Version:

Collecting environment information...
PyTorch version: 2.0.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.31

Python version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-73-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          26
On-line CPU(s) list:             0-25
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       26
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8480+
Stepping:                        8
CPU MHz:                         2000.000
BogoMIPS:                        4000.00
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       832 KiB
L1i cache:                       832 KiB
L2 cache:                        104 MiB
L3 cache:                        416 MiB
NUMA node0 CPU(s):               0-25
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.25.1
[pip3] torch==2.0.1+cu118
[pip3] torchaudio==2.0.2+cu118
[pip3] torchvision==0.15.2+cu118
[conda] numpy                     1.25.1                   pypi_0    pypi
[conda] torch                     2.0.1+cu118              pypi_0    pypi
[conda] torchaudio                2.0.2+cu118              pypi_0    pypi
[conda] torchvision               0.15.2+cu118             pypi_0    pypi
@WoosukKwon WoosukKwon added the bug Something isn't working label Jul 13, 2023
@WoosukKwon
Copy link
Collaborator

Hi @zieen, thanks for reporting the bug and letting us know your environment. Currently, it seems the bug occurs because the xformers library does not fully support the H100 GPU at the moment. We have a related issue: #199 For now, please use a different type of GPU.

@MM-IR
Copy link

MM-IR commented Jul 15, 2023

That is very weird, given I also meet this error, when playing with MPT-7B in my A5000 GPUs. TOT

@ZhengJun-AI
Copy link

metoo... when playing with Baichuan13B on V100 32GB

@WoosukKwon WoosukKwon linked a pull request Aug 23, 2023 that will close this issue
groenenboomj pushed a commit to opendatahub-io/vllm that referenced this issue Feb 27, 2025
* Update README.md 20250205_aiter

* whitespace

* adding VLLM_USE_AITER=0 advice
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants