-
Notifications
You must be signed in to change notification settings - Fork 29.2k
Fix from_args_and_dict
ProcessorMixin
#38296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix from_args_and_dict
ProcessorMixin
#38296
Conversation
from_args_and_dict
ProcessorMixin
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I am not against it, but my memory tell me we do the same thing (i.e. kwargs is not used in init objects but set values afterwards). For example,
So it's not clear to me:
And @zucchini-nlp could also be helpful on this PR I believe :-) |
A concrete example given by @Cyrilvallez in Gemma3 processor: class Gemma3Processor(ProcessorMixin):
attributes = ["image_processor", "tokenizer"]
valid_kwargs = ["chat_template", "image_seq_length"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor,
tokenizer,
chat_template=None,
image_seq_length: int = 256,
**kwargs,
):
print(image_seq_length)
self.image_seq_length = image_seq_length
self.image_token_id = tokenizer.image_token_id
self.boi_token = tokenizer.boi_token
self.image_token = tokenizer.image_token
image_tokens_expanded = "".join([tokenizer.image_token] * image_seq_length)
print(image_tokens_expanded)
self.full_image_sequence = f"\n\n{tokenizer.boi_token}{image_tokens_expanded}{tokenizer.eoi_token}\n\n"
super().__init__(
image_processor=image_processor,
tokenizer=tokenizer,
chat_template=chat_template,
**kwargs,
) If we instantiate a processor like this: processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", image_seq_length=5)
I also just changed the PR to completely remove the manual |
Yeah, I remember of this issue. IMO the case is specific to Gemma3 and better be fixed by deleting |
This seems more aligned with what would be expected when adding kwargs to |
Oh yeah, Let's make sure all processors have |
transformers/src/transformers/processing_utils.py Lines 986 to 996 in 0704e51
Since here we would be passing only "valid_kwargs" based on __init__ signature when instantiating the processor, we wouldn't raise an error, even if an invalid kwarg is passed to from_pretrained
|
Very nice example. And OK if you want to deal with other mixin separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice to remove the valid_kwargs
attribute as it was making the code unnecessarily complicated! 🤗 Glad that the dangerous behavior of setting the attrs after having initialized is removed as well 🤗
However to my understanding, args should take precedence!
# remove args that are in processor_dict to avoid duplicate arguments | ||
args_to_remove = [i for i, arg in enumerate(accepted_args_and_kwargs) if arg in processor_dict] | ||
args = [arg for i, arg in enumerate(args) if i not in args_to_remove] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think args
should take precedence on processor_dict
no? As they are necesarily passed by the user directly no? and processor_dict
may come from the config and not necesarily from the merged kwargs
So I think we should never remove args
, but remove the corresponding values for them in the processor_dict
if they were saved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "args" are given by the _get_arguments_from_pretrained
, and are not really args given by the user, but attributes retrieved from a checkpoint, then reconstructed as an "args" list based on the attributes
attribute of processors (again there seem to be some redundancy here that we could solve by inspecting the signature, I could have a look next :) ).
transformers/src/transformers/processing_utils.py
Lines 1203 to 1205 in f530727
args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs) | |
processor_dict, kwargs = cls.get_processor_dict(pretrained_model_name_or_path, **kwargs) | |
return cls.from_args_and_dict(args, processor_dict, **kwargs) |
So what can be specified by the user directly here can only be the kwargs
, so I think they should take precedence, especially because of use cases such as this one:
processor = AutoProcessor.from_pretrained("checkpoint_path", image_processor=custom_image_processor)
where we want to get a processor from a checkpoint, but only modify one of its "attributes" e.g. image_processor, tokenizer or feature_extractor. This is currently not supported, and will just be silently ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha ok, alright then! Indeed the TLDR is: whatever the user explicitly passes should take precedence of course!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thanks for clarifying! LGTM!
…ict-processormixin
What does this PR do?
Fix issue where kwargs given to
from_pretrained
for processors would not be taken into account in the processor's__init__
directly, but would override the processor's attributes after the fact, resulting in unexpected behaviors.This might be slightly breaking, as before the kwargs, and even the processor_dict items would be used even if not valid, but I guess this shouldn't be the case :)