support new marian models #15831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

patil-suraj merged 19 commits into huggingface:master from patil-suraj:new-marian-models

Mar 10, 2022

Contributor

patil-suraj commented Feb 25, 2022

What does this PR do?

This PR updates the Marian model:

To allow not sharing embeddings between encoder and decoder.
Allow tying only decoder embeddings with lm_head.
Separate two vocabs in tokenizer for src and tgt language

To support this, the PR introduces the following new methods:

get_decoder_input_embeddings and set_decoder_input_embeddings
To get and set the decoder embeddings when the embeddings are not shared. These methods will raise an error if the embeddings are shared.
resize_decoder_token_embeddings
To only resize the decoder embeddings. Will raise an error if the embeddings are shared.

This PR also adds two new config attributes to MarianConfig:

share_encoder_decoder_embeddings: to indicate if emb should be shared or not
decoder_vocab_size: to specify the vocab size for decoder when emb are not shared.

And the following methods from PreTrainedModel class are overridden to support these changes:

tie_weights
_resize_token_embeddings

Fixes #15109

patil-suraj added 4 commits

February 25, 2022 14:26


          support not sharing embeddings

bec07d9


          update modeling

25d5bbc


          update tokenizer

67321da


          fix conversion script

ebe122e

patil-suraj commented

View reviewed changes

src/transformers/models/marian/modeling_marian.py

Comment on lines +1285 to +1293

+                      # if word embeddings are not tied, make sure that lm head is resized as well
+                      if (
+                          self.config.share_encoder_decoder_embeddings
+                          and self.get_output_embeddings() is not None
+                          and not self.config.tie_word_embeddings
+                      ):
+                          old_lm_head = self.get_output_embeddings()
+                          new_lm_head = self._get_resized_lm_head(old_lm_head, new_num_tokens)
+                          self.set_output_embeddings(new_lm_head)

Contributor Author

patil-suraj Feb 25, 2022

This will only resize the lm_head if embeddings are shared.

src/transformers/models/marian/modeling_marian.py

Comment on lines +1353 to +1355

+                          # if embeddings are shared this will return shared embeddings otherwise decoder embed_tokens
+                          word_embeddings = self.get_decoder().get_input_embeddings()
+                          self._tie_or_clone_weights(output_embeddings, word_embeddings)

Contributor Author

patil-suraj Feb 25, 2022

We always return decoder embeddings here. This should work for both cases, shared or not shared.

patil-suraj requested a review from patrickvonplaten

February 25, 2022 13:44

HuggingFaceDocBuilder commented Feb 25, 2022

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py

+                  def get_decoder_input_embeddings(self):
+                      if self.config.share_encoder_decoder_embeddings:
+                          raise ValueError(

Contributor

patrickvonplaten Feb 25, 2022

Why raise an error here? It's totally fine to just return self.get_input_embeddigs() in this case no?

Contributor

patrickvonplaten Mar 10, 2022

Still don't think we need to raise here ;-)

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py Show resolved Hide resolved

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py Show resolved Hide resolved

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py Outdated Show resolved Hide resolved

patrickvonplaten approved these changes

View reviewed changes

Contributor

patrickvonplaten left a comment

Overall, I'm in favor of adding the new Marian checkpoints the way it is shown here. The change from having a Marian model that always force-tied encoder and decoder embeddings to a Marian model that can switch between force-tied and no-tied encoder input embeddings and encoder output embeddings is the better option here IMO even though it does go a bit again our philosophy of not changing existing model code.

The main reasons why I'm in favor of the approach as it's implemented now are (with the feedback given below):

All the changes of this PR are also applicable to existing Marian V1 checkpoints. More specifically all Marian V1 checkpoints can be loaded here with share_encoder_decoder_embeddings=False and then fine-tuned with embeddings not being tied.
Marian V2 comes from the exact same library as Marian V1 and is the same model. Creating a new name here (Marian V2) could confuse users.

Thoughts @LysandreJik @sgugger ?

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py Outdated Show resolved Hide resolved

sgugger approved these changes

View reviewed changes

Collaborator

sgugger left a comment

Ok for me. It's really pushing the test for a new model to its limit, but I understand the arguments to keep it in the same model.

src/transformers/models/marian/convert_marian_to_pytorch.py Outdated Show resolved Hide resolved

src/transformers/models/marian/convert_marian_to_pytorch.py Outdated Show resolved Hide resolved

src/transformers/models/marian/convert_marian_to_pytorch.py Outdated Show resolved Hide resolved

src/transformers/models/marian/convert_marian_to_pytorch.py Outdated Show resolved Hide resolved

src/transformers/models/marian/convert_marian_to_pytorch.py Outdated Show resolved Hide resolved

patil-suraj added 12 commits

February 25, 2022 17:22


          always use self.shared

ba19d5f


          boom boom

f66e59d


          begin tests

c7511aa


          update tests

afcf311


          fix resize_decoder_token_embeddings

473f65a


          address Patrick's comments

ec23827


          style

885717e


          update conversion script

5284f1a


          fix conversion script

56337b1


          fix tokenizer

f9e5c69


          better name target vocab

4b0538a


          add integration test for tokenizer with two vocabs

patil-suraj changed the title ~~[WiP] support new marian models~~ support new marian models

patil-suraj requested a review from patrickvonplaten

March 10, 2022 16:14


          style

7aad0ce

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/convert_marian_to_pytorch.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py Show resolved Hide resolved

patrickvonplaten reviewed

View reviewed changes

src/transformers/models/marian/modeling_marian.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed

View reviewed changes

Contributor

patrickvonplaten left a comment

Looks good to me in general.
Left a couple of comments.

Also given that a bunch of new model checkpoints will be added here - let's maybe add a slow integration test as well?

patil-suraj added 2 commits

March 10, 2022 18:12


          address Patrick's comments

735e980


          add integration test for model

f71eddf

patil-suraj mentioned this pull request

Use the target tokenizer in text generation pipeline #16049

Closed

patil-suraj merged commit ba21001 into huggingface:master

patil-suraj deleted the new-marian-models branch

March 10, 2022 18:42

patil-suraj mentioned this pull request

Marian cannot be fully serialized because it accesses the filesystem after the object instantiation #15982

Closed

4 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet