-
Notifications
You must be signed in to change notification settings - Fork 29.2k
support new marian models #15831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support new marian models #15831
Conversation
# if word embeddings are not tied, make sure that lm head is resized as well | ||
if ( | ||
self.config.share_encoder_decoder_embeddings | ||
and self.get_output_embeddings() is not None | ||
and not self.config.tie_word_embeddings | ||
): | ||
old_lm_head = self.get_output_embeddings() | ||
new_lm_head = self._get_resized_lm_head(old_lm_head, new_num_tokens) | ||
self.set_output_embeddings(new_lm_head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only resize the lm_head
if embeddings are shared.
# if embeddings are shared this will return shared embeddings otherwise decoder embed_tokens | ||
word_embeddings = self.get_decoder().get_input_embeddings() | ||
self._tie_or_clone_weights(output_embeddings, word_embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always return decoder embeddings here. This should work for both cases, shared or not shared.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
||
def get_decoder_input_embeddings(self): | ||
if self.config.share_encoder_decoder_embeddings: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise an error here? It's totally fine to just return self.get_input_embeddigs()
in this case no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still don't think we need to raise here ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I'm in favor of adding the new Marian checkpoints the way it is shown here. The change from having a Marian model that always force-tied encoder and decoder embeddings to a Marian model that can switch between force-tied and no-tied encoder input embeddings and encoder output embeddings is the better option here IMO even though it does go a bit again our philosophy of not changing existing model code.
The main reasons why I'm in favor of the approach as it's implemented now are (with the feedback given below):
- All the changes of this PR are also applicable to existing Marian V1 checkpoints. More specifically all Marian V1 checkpoints can be loaded here with
share_encoder_decoder_embeddings=False
and then fine-tuned with embeddings not being tied. - Marian V2 comes from the exact same library as Marian V1 and is the same model. Creating a new name here (Marian V2) could confuse users.
Thoughts @LysandreJik @sgugger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for me. It's really pushing the test for a new model to its limit, but I understand the arguments to keep it in the same model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general.
Left a couple of comments.
Also given that a bunch of new model checkpoints will be added here - let's maybe add a slow integration test as well?
What does this PR do?
This PR updates the Marian model:
lm_head
.vocabs
intokenizer
forsrc
andtgt
languageTo support this, the PR introduces the following new methods:
get_decoder_input_embeddings
andset_decoder_input_embeddings
To get and set the decoder embeddings when the embeddings are not shared. These methods will raise an error if the embeddings are shared.
resize_decoder_token_embeddings
To only resize the decoder embeddings. Will raise an error if the embeddings are shared.
This PR also adds two new config attributes to
MarianConfig
:share_encoder_decoder_embeddings
: to indicate if emb should be shared or notdecoder_vocab_size
: to specify the vocab size for decoder when emb are not shared.And the following methods from
PreTrainedModel
class are overridden to support these changes:tie_weights
_resize_token_embeddings
Fixes #15109