How to extract encoder representation? #1489

colinator · 2023-11-14T23:26:57Z

colinator
Nov 14, 2023

After running audio through the model, I would like to extract the representation of the final encoder output. I'm curious as to whether this representation would contain enough information to perform transfer learning, to detect other things (maybe sentiment or something).

Anyway, not sure how to do it. What I was thinking:

I have the whisper_context
From that, get the whisper_state (this isn't yet exposed)
From that, get "embd_enc", which is

// result of the encoder
struct ggml_tensor * embd_conv = nullptr;
struct ggml_tensor * embd_enc = nullptr;
From embd_enc, I can get the shape. In my case (using the "base" model), it's (512, 1500)
From embd_enc, I can get the actual data: "void * data;", and static_cast it to float*.

I don't think it's working. I can do this, but then if I visualize it like an image, it shows almost no change over time. I suspect that "void* data" isn't actually the data. Could it be coming instead from "struct ggml_backend_buffer * buffer;"?

Edit: nope, "embd_enc" is not the final encoder layer output, it's the output of mel->conv->mlp before it hits the encoder stack. And "void * data" is the weights. I believe.

Edit2: er, now I'm not so sure. At the end of whisper_build_graph_encoder we see "wstate.embd_enc = cur;" where 'cur' is the final tensor output of the encoder_layers. So maybe it the encoder stack output not the audio encoder output?

ggerganov · 2023-11-16T08:05:00Z

ggerganov
Nov 16, 2023
Maintainer

I suspect that "void* data" isn't actually the data.

Yes, you need to extract the data with:

https://github.com/ggerganov/whisper.cpp/blob/ccc85b4ff8d250d0f25ebcac2be0e4a23401c885/ggml-backend.h#L47

Something like (not tested):

std::vector<float> my_buf(ggml_nelements(embd_enc));
ggml_backend_tensor_get(embd_enc, my_buf.data(), 0, ggml_nbytes(embd_enc));

So maybe it the encoder stack output not the audio encoder output?

It is the encoder stack output with normalization applied to it:

https://github.com/ggerganov/whisper.cpp/blob/ccc85b4ff8d250d0f25ebcac2be0e4a23401c885/whisper.cpp#L2009-L2023

This is what comes out as a result from the Encoder and used as input for the cross attention in the Decoder.

4 replies

colinator Nov 16, 2023
Author

Ah, many thanks. A few things:

the output, after the ln_f_g*cur + ln_f_b step, is no longer normalized between 0 and 1. Rough values as I see them are something like min element=-17.4844, max element=14.6094.
I tried just casting void * data to float *, and that gave me the exact same result as ggml_backend_tensor_get(.
Sometimes when I run, the first time through the backend is not set, and I get an assertion failure. It's rare though. Weird.

Figured I'd start with visualization... Want to see the result? This is the 512x1500 embedding normalized and converted to 0-255 uint8, and rendered in grayscale. Time goes to the right. I talk in English for about 10 seconds into a microphone, so you can see about the first 1/3rd has embeddings that look like, well, something I guess. You can distinctly see the positional encoding coming through to the last encoder output.

ggerganov Nov 16, 2023
Maintainer

Very interesting stuff! We can probably build tools for extracting such kind of data from the model and the processing.

I tried just casting void * data to float *, and that gave me the exact same result as ggml_backend_tensor_get()

Yes, it works at the moment, but will likely stop working in the future when we make some changes to the Metal backend.
Using ggml_backend_tensor_get() would also work with CUDA backend

Sometimes when I run, the first time through the backend is not set, and I get an assertion failure. It's rare though. Weird.

The backend member should always be set. Likely this is a side-effect from your modifications, unless you observe this on master too?

colinator Nov 16, 2023
Author

Hm - I'm not making any changes to ggml - I only add this to whisper.cpp:

struct ggml_tensor * whisper_embd_enc(struct whisper_context * ctx) {
    return ctx->state->embd_enc;
}

and then

ggml_tensor * tensor = whisper_embd_enc(ctx);
std::vector<float> tensor_data(ggml_nelements(tensor));
ggml_backend_tensor_get(tensor, tensor_data.data(), 0, ggml_nbytes(tensor));

I'm pretty sure that none of my modifications could have this side-effect. I synced against master a couple days ago. Maybe every 10 times running, I get whisper.cpp/ggml-backend.c:137: backend != NULL && "tensor backend not set" on the first run through the model. Without the call toggml_backend_tensor_get(), it doesn't crash though - it all works fine. So I guess the backend isn't needed? Or maybe there's some predicate about data length or something. Or maybe a race condition? I'll try to dig a bit.

colinator Nov 16, 2023
Author

And a thought - it'd be cool if I could just run the encoder, not the decoder. Furthermore, then encoder doesn't actually need 30 seconds of data, right? I'm a bit unclear on that bit. Can I just give it 25ms of data at a time?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to extract encoder representation? #1489

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to extract encoder representation? #1489

Uh oh!

Uh oh!

colinator Nov 14, 2023

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

ggerganov Nov 16, 2023 Maintainer

Uh oh!

colinator Nov 16, 2023 Author

Uh oh!

ggerganov Nov 16, 2023 Maintainer

Uh oh!

colinator Nov 16, 2023 Author

Uh oh!

colinator Nov 16, 2023 Author

colinator
Nov 14, 2023

Replies: 1 comment 4 replies

ggerganov
Nov 16, 2023
Maintainer

colinator Nov 16, 2023
Author

ggerganov Nov 16, 2023
Maintainer

colinator Nov 16, 2023
Author

colinator Nov 16, 2023
Author