Skip to content

Met segment fault while running Whisper on Arc #13001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ruoyu-y opened this issue Mar 25, 2025 · 16 comments · May be fixed by #13049
Closed

Met segment fault while running Whisper on Arc #13001

Ruoyu-y opened this issue Mar 25, 2025 · 16 comments · May be fixed by #13049

Comments

@Ruoyu-y
Copy link

Ruoyu-y commented Mar 25, 2025

Configuration:

OS: Ubuntu 24.04 
CPU: 12th Gen Intel(R) Core(TM) i9-12900K
Memory: 16G
GPU:  04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08)
software:
    torch                                    2.1.0a0+cxx11.abi
    intel-extension-for-pytorch           2.1.10+xpu
    ipex-llm                                  2.2.0b20250322
    bigdl-core-xe-21                    2.6.0b20250322

Issue met:
run whisper with command python ./recognize.py and get segment fault error

Logs:

$ python recognize.py
/home/cloud/ruoyu/miniforge3/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/cloud/ruoyu/miniforge3/envs/llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2025-03-25 09:18:43,572 - INFO - intel_extension_for_pytorch auto imported
2025-03-25 09:18:43,855 - INFO - PyTorch version 2.1.0a0+cxx11.abi available.
step1:
/home/cloud/ruoyu/miniforge3/envs/llm/lib/python3.11/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
2025-03-25 09:18:46,419 - INFO - Converting the current model to sym_int4 format......

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [12th Gen Intel(R) Core(TM) i9-12900K]
Registry and code: 13 MB
Command: python recognize.py
Uptime: 3.432546 s
Segmentation fault

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Mar 25, 2025

Any hint for this issue? Or recommended configuration?

@hkvision
Copy link
Contributor

Hi,

May I ask if this segment fault only exists for whisper or it also exists in running other models https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM ?

Also, you may use our script to check the environment so that we can better help to detect the issue: https://github.com/intel/ipex-llm/tree/main/python/llm/scripts#usage

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Mar 26, 2025

Hi,

May I ask if this segment fault only exists for whisper or it also exists in running other models https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM ?

Also, you may use our script to check the environment so that we can better help to detect the issue: https://github.com/intel/ipex-llm/tree/main/python/llm/scripts#usage

I found other LLMs also returns segment fault error. But it works with docker container. Here's the output of that environment check script:

$ bash env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.11
-----------------------------------------------------------------
transformers=4.36.2
-----------------------------------------------------------------
torch=2.1.0a0+cxx11.abi
-----------------------------------------------------------------
ipex-llm Version: 2.2.0b20250322
-----------------------------------------------------------------
ipex=2.1.10+xpu
-----------------------------------------------------------------
CPU Information:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               24
On-line CPU(s) list:                  0-23
Vendor ID:                            GenuineIntel
Model name:                           12th Gen Intel(R) Core(TM) i9-12900K
CPU family:                           6
Model:                                151
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             2
CPU(s) scaling MHz:                   22%
CPU max MHz:                          5200.0000
CPU min MHz:                          800.0000
-----------------------------------------------------------------
Total CPU Memory: 15.3286 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 24.04 LTS \n \l

-----------------------------------------------------------------
Linux cloudgpu 6.8.0-52-generic #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.39.20240906
    Build ID: 11f3c29a

Service:
    Version: 1.2.39.20240906
    Build ID: 11f3c29a
    Level Zero Version: 1.17.0
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
-----------------------------------------------------------------
Driver related package version:
ii  intel-fw-gpu                                     2024.17.5-329~22.04                      all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu                             1.3.29735.27-914~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-level-zero-gpu-raytracing                  1.0.0-60~u22.04                          amd64        Level Zero Ray Tracing Support library
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
No device discovered
GPU0 Memory ize=256M
-----------------------------------------------------------------
04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A770]
        Flags: bus master, fast devsel, latency 0, IRQ 234, IOMMU group 20
        Memory at 86000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4050000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 87000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915, xe
-----------------------------------------------------------------

Is there anything wrong in the configuration?

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Mar 26, 2025

To provide more details, on the same machine, i could run the inference service in docker according to the guide https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md. But i cannot run the whisper or other LLMs under python/llm/example/GPU/HuggingFace/LLM folder on my host. I also tried to run the whisper python file inside the docker container bring up following the previous guide, it failed as well. Please help to take a look @hkvision, thanks a lot!

@hkvision
Copy link
Contributor

Hi, we checked your env, the following part might have issues.

-----------------------------------------------------------------
No device discovered
GPU0 Memory ize=256M

Could you use sycl-ls and xpu-smi discovery to confirm if the Arc device is properly detected? Thanks!

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Mar 27, 2025

xpu-smi discovery

xpu-smi discovery returns No device discovered. But i could found the Arc card using lspci. As i am using the in-tree driver in the ubuntu 24.04, will that cause the issue? @hkvision

@hkvision
Copy link
Contributor

hkvision commented Mar 27, 2025

From your lspci result below, seems the memory 256M is not correct, should be 16G? Maybe can you check if your card is settled properly (e.g. resize bar)? Also is the result of sycl-ls as expected on your machine?

04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A770]
        Flags: bus master, fast devsel, latency 0, IRQ 234, IOMMU group 20
        Memory at 86000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4050000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 87000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915, xe

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Mar 28, 2025

From your lspci result below, seems the memory 256M is not correct, should be 16G? Maybe can you check if your card is settled properly (e.g. resize bar)? Also is the result of sycl-ls as expected on your machine?

04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A770]
        Flags: bus master, fast devsel, latency 0, IRQ 234, IOMMU group 20
        Memory at 86000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4050000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 87000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915, xe

I found no Arc shown in the sycl-ls result. How shall i fix the issue? I used to run ipex-llm inside docker and in that way, i could find Arc using sycl-ls

@hkvision
Copy link
Contributor

We suppose this is not an ipex-llm issue but probably due to driver related packages.
You may refer to https://dgpu-docs.intel.com/driver/client/overview.html#installing-client-gpus-on-ubuntu-desktop-24-04-lts for the driver guide.
The environment of the docker (Ubuntu 22.04) is here: https://github.com/intel/ipex-llm/blob/main/docker/llm/serving/xpu/docker/Dockerfile

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Apr 3, 2025

We suppose this is not an ipex-llm issue but probably due to driver related packages. You may refer to https://dgpu-docs.intel.com/driver/client/overview.html#installing-client-gpus-on-ubuntu-desktop-24-04-lts for the driver guide. The environment of the docker (Ubuntu 22.04) is here: https://github.com/intel/ipex-llm/blob/main/docker/llm/serving/xpu/docker/Dockerfile

I follow the guide that you provided to install the driver again. Using the command 'clinfo | grep "770"' that provided at the end of the tutorial, i could see the device shown. Then i tried to install other dependencies according to the doc https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-oneapi, everything seems fine. Then at last, i still met that segment fault. Any other suggestions?

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Apr 3, 2025

Or is there any example to run whisper in a docker container?

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Apr 3, 2025

It would be better if i could run it in docker. Here's the error message i got when running the example inside a docker container:
Image

After installing the dependency with pip install trl, i got another error:
Image

@Ruoyu-y
Copy link
Author

Ruoyu-y commented Apr 4, 2025

Thanks for the guidance. Issue has been resolved

@Ruoyu-y Ruoyu-y closed this as completed Apr 4, 2025
@hkvision
Copy link
Contributor

hkvision commented Apr 7, 2025

Synced offline, pip install trl==0.11.0 solves the problem.
Feel free to tell us if there are further issues later :)

@jason-dai
Copy link
Contributor

Synced offline, pip install trl==0.11.0 solves the problem. Feel free to tell us if there are further issues later :)

Shall we update the example readme?

@hkvision
Copy link
Contributor

hkvision commented Apr 7, 2025

Synced offline, pip install trl==0.11.0 solves the problem. Feel free to tell us if there are further issues later :)

Shall we update the example readme?

Sure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants