Skip to content

support new marian models #15831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 10, 2022
Merged

Conversation

patil-suraj
Copy link
Contributor

What does this PR do?

This PR updates the Marian model:

  1. To allow not sharing embeddings between encoder and decoder.
  2. Allow tying only decoder embeddings with lm_head.
  3. Separate two vocabs in tokenizer for src and tgt language

To support this, the PR introduces the following new methods:

  • get_decoder_input_embeddings and set_decoder_input_embeddings
    To get and set the decoder embeddings when the embeddings are not shared. These methods will raise an error if the embeddings are shared.
  • resize_decoder_token_embeddings
    To only resize the decoder embeddings. Will raise an error if the embeddings are shared.

This PR also adds two new config attributes to MarianConfig:

  • share_encoder_decoder_embeddings: to indicate if emb should be shared or not
  • decoder_vocab_size: to specify the vocab size for decoder when emb are not shared.

And the following methods from PreTrainedModel class are overridden to support these changes:

  • tie_weights
  • _resize_token_embeddings

Fixes #15109

Comment on lines +1285 to +1293
# if word embeddings are not tied, make sure that lm head is resized as well
if (
self.config.share_encoder_decoder_embeddings
and self.get_output_embeddings() is not None
and not self.config.tie_word_embeddings
):
old_lm_head = self.get_output_embeddings()
new_lm_head = self._get_resized_lm_head(old_lm_head, new_num_tokens)
self.set_output_embeddings(new_lm_head)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only resize the lm_head if embeddings are shared.

Comment on lines +1353 to +1355
# if embeddings are shared this will return shared embeddings otherwise decoder embed_tokens
word_embeddings = self.get_decoder().get_input_embeddings()
self._tie_or_clone_weights(output_embeddings, word_embeddings)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always return decoder embeddings here. This should work for both cases, shared or not shared.

@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.


def get_decoder_input_embeddings(self):
if self.config.share_encoder_decoder_embeddings:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raise an error here? It's totally fine to just return self.get_input_embeddigs() in this case no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't think we need to raise here ;-)

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm in favor of adding the new Marian checkpoints the way it is shown here. The change from having a Marian model that always force-tied encoder and decoder embeddings to a Marian model that can switch between force-tied and no-tied encoder input embeddings and encoder output embeddings is the better option here IMO even though it does go a bit again our philosophy of not changing existing model code.

The main reasons why I'm in favor of the approach as it's implemented now are (with the feedback given below):

  • All the changes of this PR are also applicable to existing Marian V1 checkpoints. More specifically all Marian V1 checkpoints can be loaded here with share_encoder_decoder_embeddings=False and then fine-tuned with embeddings not being tied.
  • Marian V2 comes from the exact same library as Marian V1 and is the same model. Creating a new name here (Marian V2) could confuse users.

Thoughts @LysandreJik @sgugger ?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me. It's really pushing the test for a new model to its limit, but I understand the arguments to keep it in the same model.

@patil-suraj patil-suraj changed the title [WiP] support new marian models support new marian models Mar 10, 2022
Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general.
Left a couple of comments.

Also given that a bunch of new model checkpoints will be added here - let's maybe add a slow integration test as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why is Marian to Torch converter hardcoded for tied vocab ?
4 participants