TSNE vis: update the model & embeddings #102

bzz · 2023-04-30T17:01:05Z

Some improvements in visualisation relevant to #62.

It's using a 'all-MiniLM-L6-v2' model (87Mb instead of 418Mb, the same 512 context size) that is faster & seems to provide better visualisation.

Before: the previous model

The previous model using both, abstracts and titles.

The new model (batch size 1, abstracts only)

After: the new model + titles + batched

I tied UMAP for it and the results seems less interesting (but didn't experiment much)

UMAP

I also tried using a larger 420Mb model fine-tuned on scientific papers from SciRepEval - allenai/specter2 \w proximity adaptor that takes 1.5min vs 30sec of the above. It can't be switched though the CLI only, as it requires loading an adaptor.

specter2 TSNE

Let me know what you think and which one do you prefer!

Use smaller model that is fast and proived a better quality 'all-MiniLM-L6-v2' from https://www.sbert.net/docs/pretrained_models.html Use title as well as abstract for paper embeddings. Encode & avg. in batches.

mallamanis · 2023-05-02T20:24:49Z

Thanks for looking into this @bzz! All these options seem quite interesting, yet, it's hard to decide without looking at what each point represents.

May I suggest, that you sample 3-4 papers that you know and try to see which of the visualizations gets "reasonable" neighbors? I'd be happy to go with whatever you find more useful, in that sense 👓

bzz · 2023-05-03T22:00:05Z

Indeed, that was exactly what I did, I apologize for missing this crucial information :)

Here are the interactive visualisations of embeddings + metadata for all 4 different models, sorted by subjective perception of "cluster quality" that should be easy to explore:

T-SNE GPT-3, model (text-embedding-ada-002)
T-SNE sentence-transformers/all-MiniLM-L6-v2, model (new model + titles + batched)
T-SNE allenai/specter2, model (specter2)
T-SNE deepset/sentence_bert, model (the previous model)

The best results were achieved with T-SNE hyperparams: ppl 10-20, LR 0.01, 1-2k steps
It's also better to switch Label By to Title.

Clearly identifiable clusters

Types
Completion
Search
Summarization
Bugs (localization, fix/repair)
(OpenAI) Commit messages, Decompilation/Obfuscation, ...

Let me know what you think!

mallamanis · 2023-05-11T20:52:08Z

This looks great! Thanks a lot for this 💯

mallamanis · 2023-05-14T21:35:44Z

Hi @bzz it seems like that the Action Fails with this change, I suspect this is due to some restriction on GitHub Actions (memory?). Do you maybe have time to investigate?

https://github.com/ml4code/ml4code.github.io/actions/runs/4952439871

For now, I've reverted this PR in #103

bzz · 2023-05-15T20:56:39Z

Oh, my! From a quick glance - it may also have to do with some changes in CI runner image actions/runner-images#7188 🤔

I'll setup CI on my fork to try to reproduce and report back.

bzz · 2023-05-28T11:31:11Z

I suspect this is due to some restriction on GitHub Actions (memory?)

You are right, I missed that the CI has failed repeatedly and it's easily reproducible.

Here is RAM profile across different batch sizes (1 to 512, doubling every ~20sec) 🙄

So the VM gets killed with OOM after exceeding 7Gb RAM limit

tsne vis: change the model & embeddings

e67bf6a

Use smaller model that is fast and proived a better quality 'all-MiniLM-L6-v2' from https://www.sbert.net/docs/pretrained_models.html Use title as well as abstract for paper embeddings. Encode & avg. in batches.

mallamanis merged commit edb4eb5 into ml4code:source May 11, 2023

mallamanis mentioned this pull request May 14, 2023

Revert "TSNE vis: update the model & embeddings" #103

Merged

bzz deleted the source-new-emb branch May 28, 2023 11:31

bzz mentioned this pull request May 28, 2023

TSNE vis: update the model & embeddings (small batch) #104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSNE vis: update the model & embeddings #102

TSNE vis: update the model & embeddings #102

bzz commented Apr 30, 2023

mallamanis commented May 2, 2023

bzz commented May 3, 2023 •

edited

Loading

mallamanis commented May 11, 2023

mallamanis commented May 14, 2023

bzz commented May 15, 2023 •

edited

Loading

bzz commented May 28, 2023

TSNE vis: update the model & embeddings #102

TSNE vis: update the model & embeddings #102

Conversation

bzz commented Apr 30, 2023

mallamanis commented May 2, 2023

bzz commented May 3, 2023 • edited Loading

mallamanis commented May 11, 2023

mallamanis commented May 14, 2023

bzz commented May 15, 2023 • edited Loading

bzz commented May 28, 2023

bzz commented May 3, 2023 •

edited

Loading

bzz commented May 15, 2023 •

edited

Loading