-
Notifications
You must be signed in to change notification settings - Fork 37
Using same model for vision/text embeddings #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Oh, thanks for the nice words @karlomikus . I understand why you'd want to use the same model for generating both vision and text embeddings, especially with models like CLIP and SigLIP. However, let me clarify a few things about how these models and pipelines work in TransformersPHP. CLIP and similar models are designed as multimodal models, which means they can handle both image and text inputs. However, when you use a pipeline like The current structure of the feature extraction pipeline doesn't account for models that might expect an image input as well. So, while CLIP and other similar models are multimodal, utilizing them for text-only feature extraction isn't straightforward and possible now. I'll look into the possibility of supporting these types of models in the nearest future and will mention this issue if it becomes part of the library. In the meantime, I recommend exploring other smaller models trained specifically for text feature extraction for your application. I hope this helps! Let me know if you have any other questions. |
Thanks for the response, makes sense. Looking forward to more features in future. Here's what I had so far, which kinda "works" if someone stumbles upon this. // Model config needs "processor_class" key removed to fallback to default image processor
$modelName = 'Xenova/siglip-base-patch16-224';
$config = AutoConfig::fromPretrained($modelName);
$image = Image::read('/var/www/app/src/test.jpg');
$textModel = SiglipTextModel::fromPretrained($modelName, false, $config);
$textTokenizer = PretrainedTokenizer::fromPretrained($modelName);
$textInputs = $textTokenizer('A man with a hat', padding: true, truncation: true);
$textOutputs = $textModel($textInputs);
$textEmbeddings = $textOutputs["last_hidden_state"] ?? $textOutputs["logits"];
$visionModel = SiglipVisionModel::fromPretrained($modelName, false, $config);
$visionProcessor = AutoProcessor::fromPretrained($modelName);
$visionInputs = $visionProcessor($image);
$visionOutput = $visionModel($visionInputs);
$visionEmbeddings = $visionOutput['last_hidden_state']->toArray(); |
Great! This is a perfect use case for using the models directly. The pipelines are just there to make these steps easy, albeit sacrificing some flexibility. |
Would be great if in (near) future update TransformersPHP will support this "two-way-CLIP-functionality". In my case, I am trying to get a CLIP vector (cls) from a text like "A cat playing with a ball" to be able to search in a database full of such vectors (using cosine similarity for example) and have exact this 'undefined array key "pixel_values"'-error. My code so far to get the text-vector:
|
@karlomikus |
Your question
Hello,
Is it possible to use the same model to generate vision and text embeddings. Seems like models like CLIP and SigLIP should support this but using pipelines like this:
Returns an error:
Warning: Undefined array key "pixel_values" in /var/www/app/vendor/codewithkyrian/transformers/src/Models/ModelArchitecture.php on line 86
Excellent package btw.
Context (optional)
No response
Reference (optional)
No response
The text was updated successfully, but these errors were encountered: