Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Enhance WMT17 En-Zh task with full dataset. #461

Merged
merged 2 commits into from
Jan 9, 2018

Conversation

twairball
Copy link
Contributor

This PR adds full dataset to TranslateEnzhWmt if available.
The UN parallel corpus, and CWMT corpus need to be downloaded from official website after registering. We add instructions to add the downloaded datasets manually to e.g. /tmp/t2t_datagen/dataset.tgz and add code to append to full training dataset if available.

Fix #446 Added file_size_budget as argument to get_or_generate_vocab.

Fix tensorflow#446 Added `file_size_budget` as argument to `get_or_generate_vocab`.
Copy link
Contributor

@rsepassi rsepassi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!


@registry.register_problem
class TranslateEnzhWmt8k(translate.TranslateProblem):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep an 8k version and add a 32k version. target vocab size should be part of the problem name.

@@ -72,17 +169,32 @@ def source_vocab_name(self):
@property
def target_vocab_name(self):
return "vocab.enzh-zh.%d" % self.targeted_vocab_size

def get_training_dataset(self, tmp_dir):
"""UN Parallel Corpus and CWMT Corpus need to be downloaded manually.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you provide instructions somewhere for how to download these manually?

- Added TranslateEnzhWmt8k problem.
- Renamed to TranslateEnzhWmt32k, to reflect target vocab in problem name
- Added instructions for manually downloading full dataset.
@twairball
Copy link
Contributor Author

@rsepassi I've made the requested changes, hope this is ok!

@rsepassi rsepassi merged commit 92267e8 into tensorflow:master Jan 9, 2018
@rsepassi
Copy link
Contributor

rsepassi commented Jan 9, 2018

Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants