Enhance WMT17 En-Zh task with full dataset. #461

twairball · 2017-12-05T07:36:49Z

This PR adds full dataset to TranslateEnzhWmt if available.
The UN parallel corpus, and CWMT corpus need to be downloaded from official website after registering. We add instructions to add the downloaded datasets manually to e.g. /tmp/t2t_datagen/dataset.tgz and add code to append to full training dataset if available.

Fix #446 Added file_size_budget as argument to get_or_generate_vocab.

Fix tensorflow#446 Added `file_size_budget` as argument to `get_or_generate_vocab`.

rsepassi

Thank you!

rsepassi · 2017-12-15T23:50:03Z

tensor2tensor/data_generators/translate_enzh.py


 @registry.register_problem
-class TranslateEnzhWmt8k(translate.TranslateProblem):


let's keep an 8k version and add a 32k version. target vocab size should be part of the problem name.

rsepassi · 2017-12-19T01:44:50Z

tensor2tensor/data_generators/translate_enzh.py

@@ -72,17 +169,32 @@ def source_vocab_name(self):
  @property
  def target_vocab_name(self):
    return "vocab.enzh-zh.%d" % self.targeted_vocab_size
+
+  def get_training_dataset(self, tmp_dir):
+    """UN Parallel Corpus and CWMT Corpus need to be downloaded manually.


could you provide instructions somewhere for how to download these manually?

- Added TranslateEnzhWmt8k problem. - Renamed to TranslateEnzhWmt32k, to reflect target vocab in problem name - Added instructions for manually downloading full dataset.

twairball · 2017-12-24T07:46:15Z

@rsepassi I've made the requested changes, hope this is ok!

rsepassi · 2018-01-09T00:01:00Z

Thank you!

Enhance WMT17 En-Zh task with full dataset.

4d7db48

Fix tensorflow#446 Added `file_size_budget` as argument to `get_or_generate_vocab`.

rsepassi suggested changes Dec 19, 2017

View reviewed changes

Made requested Fixes:

f9c8a95

- Added TranslateEnzhWmt8k problem. - Renamed to TranslateEnzhWmt32k, to reflect target vocab in problem name - Added instructions for manually downloading full dataset.

rsepassi approved these changes Jan 9, 2018

View reviewed changes

rsepassi merged commit 92267e8 into tensorflow:master Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance WMT17 En-Zh task with full dataset. #461

Enhance WMT17 En-Zh task with full dataset. #461

twairball commented Dec 5, 2017

rsepassi left a comment

rsepassi Dec 15, 2017

rsepassi Dec 19, 2017

twairball commented Dec 24, 2017

rsepassi commented Jan 9, 2018


		@registry.register_problem
		class TranslateEnzhWmt8k(translate.TranslateProblem):

Enhance WMT17 En-Zh task with full dataset. #461

Enhance WMT17 En-Zh task with full dataset. #461

Conversation

twairball commented Dec 5, 2017

rsepassi left a comment

Choose a reason for hiding this comment

rsepassi Dec 15, 2017

Choose a reason for hiding this comment

rsepassi Dec 19, 2017

Choose a reason for hiding this comment

twairball commented Dec 24, 2017

rsepassi commented Jan 9, 2018