Replies: 1 comment 4 replies
-
Yes, you need to extract the data with: Something like (not tested): std::vector<float> my_buf(ggml_nelements(embd_enc));
ggml_backend_tensor_get(embd_enc, my_buf.data(), 0, ggml_nbytes(embd_enc));
It is the encoder stack output with normalization applied to it: This is what comes out as a result from the Encoder and used as input for the cross attention in the Decoder. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
After running audio through the model, I would like to extract the representation of the final encoder output. I'm curious as to whether this representation would contain enough information to perform transfer learning, to detect other things (maybe sentiment or something).
Anyway, not sure how to do it. What I was thinking:
I have the whisper_context
From that, get the whisper_state (this isn't yet exposed)
From that, get "embd_enc", which is
// result of the encoder
struct ggml_tensor * embd_conv = nullptr;
struct ggml_tensor * embd_enc = nullptr;
From embd_enc, I can get the shape. In my case (using the "base" model), it's (512, 1500)
From embd_enc, I can get the actual data: "void * data;", and static_cast it to float*.
I don't think it's working. I can do this, but then if I visualize it like an image, it shows almost no change over time. I suspect that "void* data" isn't actually the data. Could it be coming instead from "struct ggml_backend_buffer * buffer;"?
Edit: nope, "embd_enc" is not the final encoder layer output, it's the output of mel->conv->mlp before it hits the encoder stack. And "void * data" is the weights. I believe.
Edit2: er, now I'm not so sure. At the end of
whisper_build_graph_encoder
we see "wstate.embd_enc = cur;" where 'cur' is the final tensor output of the encoder_layers. So maybe it the encoder stack output not the audio encoder output?Beta Was this translation helpful? Give feedback.
All reactions