Using same model for vision/text embeddings #31

karlomikus · 2024-05-07T17:57:31Z

Your question

Hello,

Is it possible to use the same model to generate vision and text embeddings. Seems like models like CLIP and SigLIP should support this but using pipelines like this:

<?php
$modelName = 'Xenova/clip-vit-base-patch32';

$extractor = pipeline('feature-extraction', $modelName);
$embeddings = $extractor('A man with a hat');

Returns an error: Warning: Undefined array key "pixel_values" in /var/www/app/vendor/codewithkyrian/transformers/src/Models/ModelArchitecture.php on line 86

Excellent package btw.

Context (optional)

No response

Reference (optional)

No response

The text was updated successfully, but these errors were encountered:

CodeWithKyrian · 2024-05-08T12:34:25Z

Oh, thanks for the nice words @karlomikus . I understand why you'd want to use the same model for generating both vision and text embeddings, especially with models like CLIP and SigLIP. However, let me clarify a few things about how these models and pipelines work in TransformersPHP.

CLIP and similar models are designed as multimodal models, which means they can handle both image and text inputs. However, when you use a pipeline like feature-extraction, it's specifically designed to work with models trained for text inputs only. This model expects both, that's why you're getting those errors.

The current structure of the feature extraction pipeline doesn't account for models that might expect an image input as well. So, while CLIP and other similar models are multimodal, utilizing them for text-only feature extraction isn't straightforward and possible now.

I'll look into the possibility of supporting these types of models in the nearest future and will mention this issue if it becomes part of the library. In the meantime, I recommend exploring other smaller models trained specifically for text feature extraction for your application.

I hope this helps! Let me know if you have any other questions.

karlomikus · 2024-05-08T15:34:59Z

Thanks for the response, makes sense. Looking forward to more features in future.

Here's what I had so far, which kinda "works" if someone stumbles upon this.

// Model config needs "processor_class" key removed to fallback to default image processor
$modelName = 'Xenova/siglip-base-patch16-224';
$config = AutoConfig::fromPretrained($modelName);
$image = Image::read('/var/www/app/src/test.jpg');

$textModel = SiglipTextModel::fromPretrained($modelName, false, $config);
$textTokenizer = PretrainedTokenizer::fromPretrained($modelName);
$textInputs = $textTokenizer('A man with a hat', padding: true, truncation: true);
$textOutputs = $textModel($textInputs);
$textEmbeddings = $textOutputs["last_hidden_state"] ?? $textOutputs["logits"];

$visionModel = SiglipVisionModel::fromPretrained($modelName, false, $config);
$visionProcessor = AutoProcessor::fromPretrained($modelName);
$visionInputs = $visionProcessor($image);
$visionOutput = $visionModel($visionInputs);
$visionEmbeddings = $visionOutput['last_hidden_state']->toArray();

CodeWithKyrian · 2024-05-08T15:38:01Z

Great! This is a perfect use case for using the models directly. The pipelines are just there to make these steps easy, albeit sacrificing some flexibility.

BillyGeat · 2025-01-31T11:26:42Z

Would be great if in (near) future update TransformersPHP will support this "two-way-CLIP-functionality".

In my case, I am trying to get a CLIP vector (cls) from a text like "A cat playing with a ball" to be able to search in a database full of such vectors (using cosine similarity for example) and have exact this 'undefined array key "pixel_values"'-error.

My code so far to get the text-vector:

require './vendor/autoload.php';
use Codewithkyrian\Transformers\Models\Auto\AutoModel;
use Codewithkyrian\Transformers\PreTrainedTokenizers\CLIPTokenizer;

$model = AutoModel::fromPretrained('Xenova/clip-vit-base-patch32');
$tokenizer = CLIPTokenizer::fromPretrained('Xenova/clip-vit-base-patch32');

$inputs = 'A cat playing with a ball';
$encodedInput = $tokenizer($inputs, padding: true, truncation: true);

// Pass the tokenized text through the CLIP model
$outputs = $model->forward($encodedInput);

// Extract the last hidden state of the transformer
$lastHiddenState = $outputs['last_hidden_state'];

// The CLS token vector is the first element of the last hidden state
$textEmbedding = $lastHiddenState[0][0];

BillyGeat · 2025-01-31T15:25:07Z

Here's what I had so far, which kinda "works" if someone stumbles upon this.

// Model config needs "processor_class" key removed to fallback to default image processor
$modelName = 'Xenova/siglip-base-patch16-224';
[...]

@karlomikus
Have you managed to get equal length vector (768 for example) of both, text and image to be able to search via cosine similarity? Did this work? I used your code (thank you for giving "the right way"!) but I can not manage to get the right vectors (CLS) out of them...

karlomikus added the question Further information is requested label May 7, 2024

karlomikus closed this as completed May 8, 2024

martindewawd mentioned this issue Mar 25, 2025

Matlib version error #88

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using same model for vision/text embeddings #31

Using same model for vision/text embeddings #31

karlomikus commented May 7, 2024

CodeWithKyrian commented May 8, 2024

karlomikus commented May 8, 2024

CodeWithKyrian commented May 8, 2024

BillyGeat commented Jan 31, 2025 •

edited

Loading

BillyGeat commented Jan 31, 2025 •

edited

Loading

Using same model for vision/text embeddings #31

Using same model for vision/text embeddings #31

Comments

karlomikus commented May 7, 2024

Your question

Context (optional)

Reference (optional)

CodeWithKyrian commented May 8, 2024

karlomikus commented May 8, 2024

CodeWithKyrian commented May 8, 2024

BillyGeat commented Jan 31, 2025 • edited Loading

BillyGeat commented Jan 31, 2025 • edited Loading

BillyGeat commented Jan 31, 2025 •

edited

Loading

BillyGeat commented Jan 31, 2025 •

edited

Loading