You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/getting_started/troubleshooting.md
+38-2Lines changed: 38 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -22,9 +22,9 @@ It'd be better to store the model in a local disk. Additionally, have a look at
22
22
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
23
23
```
24
24
25
-
## Model is too large
25
+
## Out of memory
26
26
27
-
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
27
+
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider [using tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
28
28
29
29
## Enable more logging
30
30
@@ -218,6 +218,42 @@ print(f(x))
218
218
219
219
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
220
220
221
+
## Model failed to be inspected
222
+
223
+
If you see an error like:
224
+
225
+
```text
226
+
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
227
+
raise ValueError(
228
+
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.
229
+
```
230
+
231
+
It means that vLLM failed to import the model file.
232
+
Usually, it is related to missing dependencies or outdated binaries in the vLLM build.
233
+
Please read the logs carefully to determine the root cause of the error.
234
+
235
+
## Model not supported
236
+
237
+
If you see an error like:
238
+
239
+
```text
240
+
Traceback (most recent call last):
241
+
...
242
+
File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
243
+
for arch in architectures:
244
+
TypeError: 'NoneType' object is not iterable
245
+
```
246
+
247
+
or:
248
+
249
+
```text
250
+
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
251
+
raise ValueError(
252
+
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
253
+
```
254
+
255
+
But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
256
+
221
257
## Known Issues
222
258
223
259
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
0 commit comments