Skip to content

[Contributions Welcome] Add Fast Image Processors #36978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
43 of 69 tasks
yonigozlan opened this issue Mar 25, 2025 · 53 comments · May be fixed by #37168, #37481, #37804, #38502 or #37210
Open
43 of 69 tasks

[Contributions Welcome] Add Fast Image Processors #36978

yonigozlan opened this issue Mar 25, 2025 · 53 comments · May be fixed by #37168, #37481, #37804, #38502 or #37210
Labels
contributions-welcome Good First Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! Processing Vision

Comments

@yonigozlan
Copy link
Member

yonigozlan commented Mar 25, 2025

Community contributions: Add Fast Image Processors

Fast image processors have been rolling out progressively for a while. Now that the BaseImageProcessorFast, from which all fast image processors inherit, is in a more stable state, I'm opening this issue to encourage contributors to add fast image processors for models that still only have a "slow" image processor.

How to implement a Fast Image Processor

The core principle of fast image processors is to use torch and torchvision functions for image transformations instead of PIL or numpy. Among other performance benefits, this enables processing images on GPU, significantly improving inference speed.

Another key difference compared to slow image processors is that, unlike BaseImageProcessor, which provides only a minimal skeleton, BaseImageProcessorFast includes all the fundamental functionalities needed for a basic image processor. This allows optimizations made in BaseImageProcessorFast to propagate to its inherited classes. Additionally, most repetitive logic for image loading and argument handling is managed within BaseImageProcessorFast. Except in rare cases, inherited classes do not need to handle image loading, conversion, or retrieving arguments from class attributes in the call/preprocess function, this is all handled in BaseImageProcessorFast.

Getting Started

Run the following command:

transformers-cli add-fast-image-processor --model-name model_name

where model_name is the name of the model (as found in its folder under transformers/src/transformers/models) for which you're adding the fast image processor.

This command will handle all necessary imports and generate a basic fast image processor, which will look similar to this example for Beit:

# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fast Image processor class for Beit."""

from ...image_processing_utils_fast import BASE_IMAGE_PROCESSOR_FAST_DOCSTRING, BaseImageProcessorFast
from ...image_utils import IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD, PILImageResampling
from ...utils import add_start_docstrings


@add_start_docstrings(
    "Constructs a fast Beit image processor.",
    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING,
)
class BeitImageProcessorFast(BaseImageProcessorFast):
    # This generated class can be used as a starting point for the fast image processor.
    # if the image processor is only used for simple augmentations, such as resizing, center cropping, rescaling, or normalizing,
    # only the default values should be set in the class.
    # If the image processor requires more complex augmentations, methods from BaseImageProcessorFast can be overridden.
    # In most cases, only the `_preprocess` method should be overridden.

    # For an example of a fast image processor requiring more complex augmentations, see `LlavaNextImageProcessorFast`.

    # Default values should be checked against the slow image processor
    # None values left after checking can be removed
    resample = PILImageResampling.BICUBIC
    image_mean = IMAGENET_STANDARD_MEAN
    image_std = IMAGENET_STANDARD_STD
    size = {"height": 256, "width": 256}
    default_to_square = None
    crop_size = {"height": 224, "width": 224}
    do_resize = True
    do_center_crop = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = None


__all__ = ["BeitImageProcessorFast"]

As explained in the generated file, if the image processor only performs basic augmentations such as resizing, center cropping, rescaling, and normalizing, the generated file might be sufficient for a working fast image processor. The class attributes, such as resample and image_mean, are automatically parsed from the slow image processor when running the script above. However, you should verify their correctness and check for any missing or incorrectly assigned values.

Customizing the Image Processor

If the image processor requires additional functionalities beyond the basic augmentations, you will need to override the _preprocess function in BaseImageProcessorFast. Check the _preprocess implementation in BaseImageProcessorFast for reference. Notably, it leverages group_images_by_shape and reorder_images to enable batch processing, significantly increasing processing speed, particularly on GPUs. If you create new image processing functions, ensure they support batch processing by utilizing group_images_by_shape and reorder_images where possible.

If your image processor requires additional kwargs not present in DefaultFastImageProcessorKwargs, you must create a ModelNameFastImageProcessorKwargs class that inherits from DefaultFastImageProcessorKwargs and defines the new kwargs. Additionally, you should document the added kwargs in the class and the preprocess function using add_start_docstrings. (This documentation process may be simplified soon, but is necessary for now to get a correct documentation).

For an example of handling custom kwargs and documentation, refer to LlavaNextImageProcessorFast.

Important Notes

  • In nearly all cases, _preprocess is the only function in BaseImageProcessorFast that needs to be overridden.
  • The _preprocess function does not require default values for its arguments, as they are automatically derived from class attributes if not explicitly provided.
  • Even if PIL images or numpy arrays are passed to the image processor, the images argument in _preprocess will always be a list of tensors, with the channel dimension first.

Handling Edge Cases

  • Nested Images: If images are provided as nested lists (e.g., [[image1, image2], [image3]]), they will be flattened to [image1, image2, image3] by default before being passed to _preprocess. This behavior can be modified by overriding _prepare_images_structure, though flattening is generally recommended.
  • Formatting Custom Kwargs: If any custom kwargs require formatting before _preprocess, override _further_process_kwargs.
  • Validating Custom Kwargs: If additional validation is needed for custom kwargs or existing ones, override _validate_preprocess_kwargs.

Testing

In the case where the model already has a test_image_processing_model_name.py file under transformers/tests/models/model_name, the script ran before should have imported the fast image processor to the file, and added it as a fast_image_processing_class class attribute to the ModelNameImageProcessingTest class.
However this is not enough to get all the tests to run on the fast image processor. For all the test functions under ModelNameImageProcessingTest, you need to replace image_processing = self.image_processing_class(**self.image_processor_dict) with a loop over self.image_processor_list.

For example, the test_image_processor_properties test in test_image_processing_beit.py which looks like this:

    def test_image_processor_properties(self):
        image_processing = self.image_processing_class(**self.image_processor_dict)
        self.assertTrue(hasattr(image_processing, "do_resize"))
        self.assertTrue(hasattr(image_processing, "size"))
        self.assertTrue(hasattr(image_processing, "do_center_crop"))
        self.assertTrue(hasattr(image_processing, "center_crop"))
        self.assertTrue(hasattr(image_processing, "do_normalize"))
        self.assertTrue(hasattr(image_processing, "image_mean"))
        self.assertTrue(hasattr(image_processing, "image_std"))
        self.assertTrue(hasattr(image_processing, "do_reduce_labels"))

should be changed to this:

    def test_image_processor_properties(self):
        for image_processing_class in self.image_processor_list:
            image_processing = image_processing_class(**self.image_processor_dict)
            self.assertTrue(hasattr(image_processing, "do_resize"))
            self.assertTrue(hasattr(image_processing, "size"))
            self.assertTrue(hasattr(image_processing, "do_center_crop"))
            self.assertTrue(hasattr(image_processing, "center_crop"))
            self.assertTrue(hasattr(image_processing, "do_normalize"))
            self.assertTrue(hasattr(image_processing, "image_mean"))
            self.assertTrue(hasattr(image_processing, "image_std"))
            self.assertTrue(hasattr(image_processing, "do_reduce_labels"))

In the case where no image processing test file is present, now is a great time to add one! You can have a look at the CLIP image processing test file to use as a simple starting point.

Don't hesitate to add model-specific tests if you feel like there are some non-standard image processing techniques in the processor :).

To run the tests, use this command:

RUN_SLOW=1 python -m pytest tests/models/model_name/test_image_processing_model_name.py

Choosing an Image Processor to Implement

The difficulty of implementing a fast image processor varies by model. If this is your first issue, consider starting with an easier one!

Happy coding!

Here is the list of fast image processors left to implement:

@yonigozlan yonigozlan added contributions-welcome Good First Issue Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! Vision Processing labels Mar 25, 2025
@MinJu-Ha
Copy link
Contributor

Hey! I'd like to work on this issue with MobileViT 😊

@edgarriba
Copy link

@yonigozlan have you considered adopting kornia for that ? we have been curating algorithms (+700 ops) already for several years in terms of image processing and low level vision using exclusively pytorch.

@Knight7561
Copy link

I would love to pick one and start contributing. Good task for this week..!

@yonigozlan
Copy link
Member Author

yonigozlan commented Mar 26, 2025

@edgarriba I love kornia! But for image processors at inference time, it might be a bit overkill since 90% of the time, we only need a mix of resizing, normalizing, padding, and cropping, combined with some model-specific logic. I’ve found that torch/torchvision functional transforms usually cover these needs well, and pipelines like kornia ImageSequential or torchvision Compose aren’t always a good fit because some models require additional processing steps or custom logic in between. We also wanted to avoid adding an extra dependency to Transformers for fast image processors.

That said, I do think kornia could be valuable down the line, especially for batch processing. I'm still exploring how to optimize batch processing performance on both GPU and CPU, and kornia likely handles this more efficiently than our current approach.

@edgarriba
Copy link

The core of the library is pure functional -- that has been always the scope. The top layers you mention were added later just for commodity just for the case of augmentations but big part of the library has been designed as free functions for the exact purposes you mention. In any case, we are always open to collaborations and improvements.

@goravaa
Copy link

goravaa commented Mar 26, 2025

Hi! I'd like to start with YOLOS 🚀

@capnmav77
Copy link

capnmav77 commented Mar 27, 2025

Hi! i'd like to work on Segformer 😊 , any thoughts on this , it's my first contribution
here's my draft PR, thank you

@mariorch22
Copy link

I'd like to start with mllama

@zshn25
Copy link
Contributor

zshn25 commented Mar 27, 2025

I would like to start with EfficientNet but some tests don't pass. Would be great if someone could have a look

#37055

@Yann-CV
Copy link
Contributor

Yann-CV commented Mar 28, 2025

@zshn25 sad I have also worked on Efficientnet... I succeeded to fix the tests so you can have a look . anyway, I will let the maintainers decide which pull request to keep.

@JaiJoshi123
Copy link

Hi! I'd like to work on this issue with ImageGPT 🤗

@samrae7
Copy link

samrae7 commented Mar 29, 2025

Hi. I would like to do this for ZoeDepth if that's ok?

@RaghavPrabhakar66
Copy link
Contributor

Hi, I would like to work on LayoutLMv3.

@henrikm11
Copy link
Contributor

henrikm11 commented Apr 10, 2025

Working on ViTMatte, should be able to create the PR sometime this weekend.

UPDATE: This will take a bit longer since it appears that the original preprocessing may have a bug if the input format is ChannelDimension.First which makes it hard to compare the performance on torch.Tensor....

@arkhamHack
Copy link
Contributor

@yonigozlan hi i would like to work on Superpoint, will raise pr soon.

@Kim-Ju-won
Copy link
Contributor

Kim-Ju-won commented Apr 14, 2025

@yonigozlan Hi, I would work on TVP, will raise pr soon! Thanks

@olccihyeon
Copy link

@yonigozlan Hi, I would work on instructblipvideo, will raise pr! thank you

@Rishik00
Copy link

Hi @yonigozlan I'd like to work on the image processor for mobilenet. Will raise a PR! Thanks

@NahieliV
Copy link
Contributor

Hi! @yonigozlan Here is the PR for Nougat #37661.

@arkhamHack arkhamHack linked a pull request Apr 26, 2025 that will close this issue
5 tasks
@Shoumik-Gandre
Copy link

@yonigozlan
I need help for OneFormer, it has a Kwarg called max_size and I am unsure how to handle this scenario.
This example does not ignite insight: LlavaNextImageProcessorFast.

@Kim-Ju-won
Copy link
Contributor

Kim-Ju-won commented May 5, 2025

Hi @yonigozlan @zucchini-nlp,
I've been working onn a fast processor for the TVP model. However, after reviewing issue #37611, I noticed that the InstructBLIP model was removed from the list. Based on the issue discussion, I understand that fast image processors are not needed for video-only models.

That said, I also see that the TVP and VideoMAE models — which are video-only — are still included on the list, and a PR for VideoMAE has been opened, though it may not be merged based on recent discussions.

Would it be appropriate to open a draft PR for the TVP model?
I’d like to check in and get your thoughts before opening PR.

Thank you!

@henrikm11
Copy link
Contributor

henrikm11 commented May 6, 2025

I can also do ZoeDepth sometime soon as it seems that's up for grabs again, may take a week or two though.
Update: Been busier than expected, but I am almost there, PR will be raised very soon.

@jgyasu
Copy link

jgyasu commented May 7, 2025

Hi @yonigozlan , I would love to work on vivit! Will raise a PR soon :)

@aryanchauhan31
Copy link

Hi! I'd like to work on glpn. Let me know if it's still available!

@Ajaykashela
Copy link

Ajaykashela commented May 20, 2025

Hi! I'd like to work on glpn. Let me know if it's still available!

Heya @aryanchauhan31 , I was originally working on GLPN, but due to other commitments, I haven't been able to dedicate enough time to it. Please feel free to take it over. If I can be of any help please feel free to reach out.

@aryanchauhan31
Copy link

Hi! I'd like to work on glpn. Let me know if it's still available!

Heya @aryanchauhan31 , I was originally working on GLPN, but due to other commitments, I haven't been able to dedicate enough time to it. Please feel free to take it over. If I can be of any help please feel free to reach out.

Thanks. Sure i'll let you know

@AnimeshMaheshwari22
Copy link

Hi. I'd like to work on VitPose

@Ishubhammohole
Copy link

Hi @yonigozlan 👋,

I'd like to contribute by implementing the Fast Image Processor for Pix2Struct.
This seems like a valuable opportunity to contribute to a multimodal model, and I’m excited to dive in.

Please let me know if this model is still unassigned or if there’s anything specific I should be aware of before getting started.

Thanks!
— Shubham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment