Skip to content

Using same model for vision/text embeddings #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karlomikus opened this issue May 7, 2024 · 5 comments
Closed

Using same model for vision/text embeddings #31

karlomikus opened this issue May 7, 2024 · 5 comments
Labels
question Further information is requested

Comments

@karlomikus
Copy link

Your question

Hello,

Is it possible to use the same model to generate vision and text embeddings. Seems like models like CLIP and SigLIP should support this but using pipelines like this:

<?php
$modelName = 'Xenova/clip-vit-base-patch32';

$extractor = pipeline('feature-extraction', $modelName);
$embeddings = $extractor('A man with a hat');

Returns an error: Warning: Undefined array key "pixel_values" in /var/www/app/vendor/codewithkyrian/transformers/src/Models/ModelArchitecture.php on line 86

Excellent package btw.

Context (optional)

No response

Reference (optional)

No response

@karlomikus karlomikus added the question Further information is requested label May 7, 2024
@CodeWithKyrian
Copy link
Owner

Oh, thanks for the nice words @karlomikus . I understand why you'd want to use the same model for generating both vision and text embeddings, especially with models like CLIP and SigLIP. However, let me clarify a few things about how these models and pipelines work in TransformersPHP.

CLIP and similar models are designed as multimodal models, which means they can handle both image and text inputs. However, when you use a pipeline like feature-extraction, it's specifically designed to work with models trained for text inputs only. This model expects both, that's why you're getting those errors.

The current structure of the feature extraction pipeline doesn't account for models that might expect an image input as well. So, while CLIP and other similar models are multimodal, utilizing them for text-only feature extraction isn't straightforward and possible now.

I'll look into the possibility of supporting these types of models in the nearest future and will mention this issue if it becomes part of the library. In the meantime, I recommend exploring other smaller models trained specifically for text feature extraction for your application.

I hope this helps! Let me know if you have any other questions.

@karlomikus
Copy link
Author

Thanks for the response, makes sense. Looking forward to more features in future.

Here's what I had so far, which kinda "works" if someone stumbles upon this.

// Model config needs "processor_class" key removed to fallback to default image processor
$modelName = 'Xenova/siglip-base-patch16-224';
$config = AutoConfig::fromPretrained($modelName);
$image = Image::read('/var/www/app/src/test.jpg');

$textModel = SiglipTextModel::fromPretrained($modelName, false, $config);
$textTokenizer = PretrainedTokenizer::fromPretrained($modelName);
$textInputs = $textTokenizer('A man with a hat', padding: true, truncation: true);
$textOutputs = $textModel($textInputs);
$textEmbeddings = $textOutputs["last_hidden_state"] ?? $textOutputs["logits"];

$visionModel = SiglipVisionModel::fromPretrained($modelName, false, $config);
$visionProcessor = AutoProcessor::fromPretrained($modelName);
$visionInputs = $visionProcessor($image);
$visionOutput = $visionModel($visionInputs);
$visionEmbeddings = $visionOutput['last_hidden_state']->toArray();

@CodeWithKyrian
Copy link
Owner

Great! This is a perfect use case for using the models directly. The pipelines are just there to make these steps easy, albeit sacrificing some flexibility.

@BillyGeat
Copy link

BillyGeat commented Jan 31, 2025

Would be great if in (near) future update TransformersPHP will support this "two-way-CLIP-functionality".

In my case, I am trying to get a CLIP vector (cls) from a text like "A cat playing with a ball" to be able to search in a database full of such vectors (using cosine similarity for example) and have exact this 'undefined array key "pixel_values"'-error.

My code so far to get the text-vector:

require './vendor/autoload.php';
use Codewithkyrian\Transformers\Models\Auto\AutoModel;
use Codewithkyrian\Transformers\PreTrainedTokenizers\CLIPTokenizer;

$model = AutoModel::fromPretrained('Xenova/clip-vit-base-patch32');
$tokenizer = CLIPTokenizer::fromPretrained('Xenova/clip-vit-base-patch32');

$inputs = 'A cat playing with a ball';
$encodedInput = $tokenizer($inputs, padding: true, truncation: true);

// Pass the tokenized text through the CLIP model
$outputs = $model->forward($encodedInput);

// Extract the last hidden state of the transformer
$lastHiddenState = $outputs['last_hidden_state'];

// The CLS token vector is the first element of the last hidden state
$textEmbedding = $lastHiddenState[0][0];

@BillyGeat
Copy link

BillyGeat commented Jan 31, 2025

Here's what I had so far, which kinda "works" if someone stumbles upon this.

// Model config needs "processor_class" key removed to fallback to default image processor
$modelName = 'Xenova/siglip-base-patch16-224';
[...]

@karlomikus
Have you managed to get equal length vector (768 for example) of both, text and image to be able to search via cosine similarity? Did this work? I used your code (thank you for giving "the right way"!) but I can not manage to get the right vectors (CLS) out of them...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants