Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

t2t-decoder with --checkpoint_path #524

Merged

Conversation

martinpopel
Copy link
Contributor

  • --checkpoint_path allows t2t-decoder to load any model, not only the latest one
  • --keep_timestamp sets the translated file's mtime according to the checkpoint (it is by default on, but active only with --checkpoint_path and --decode_to_file)
  • prevent tf.gfile.Glob crashes due to concurrent filesystem edits

All this is needed for t2t-translate-all + t2t-bleu to work as expected (I forgot to add this commit to #488).
In addition I am trying to solve #523.

and allow keeping timestamp in that case.
This is needed for t2t-translate-all + t2t-bleu to work
as expected (I forgot to add this commit to tensorflow#488).
tf.gfile.Glob may crash with
 tensorflow.python.framework.errors_impl.NotFoundError:
 xy/model.ckpt-1130761_temp_9cb4cb0b0f5f4382b5ea947aadfb7a40; No such file or directory
Let's use standard glob.glob instead, it seems to be more reliable.
this is needed for **locals() to work
As I think about it, I would prefer my original implementation
https://github.com/tensorflow/tensor2tensor/blob/bb1173adce940e62c840970fa0f06f69fd9398db/tensor2tensor/utils/bleu_hook.py#L147-L157
But it seems there are some T2T/Google internal Python guidelines forbidding this,
so we have to live with the singleton.
@@ -67,6 +67,7 @@


def main(_):
FLAGS = flags.FLAGS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FLAGS defined above. rm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can delete the current line 41 (I hope), but not this one.
FLAGS need to be a local variable because they are used few lines below -- https://github.com/martinpopel/tensor2tensor/blob/61005b0b1e063e3e26cb403cf6e8cdea3509c3b5/tensor2tensor/bin/t2t_translate_all.py#L94
Thanks to **locals(), the users can provide e.g --decoder_command='qsub my-decoder.sh {params} --my_param={FLAGS.my_param}' and I think it is more user-friendly than forcing them to write --my_param={flags.FLAGS.my_param}.

BTW: Last time someone deleted this line during an internal-Google refactor, but didn't noticed me within PR review as you do now. It was one of the reasons why t2t_translate_all.py didn't work (another reason was I was lazy to write a test).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about this?:

locals_and_flags = dict()
local_and_flags.update(FLAGS.__dict__)
locals_and_flags.update(locals())
...
.format(**locals_and_flags)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. If you feel it's more elegant, I will do it this way (and test).

@@ -158,6 +159,7 @@ def property_chars(self, prefix):
return "".join(six.unichr(x) for x in range(sys.maxunicode)
if unicodedata.category(six.unichr(x)).startswith(prefix))

uregex = UnicodeRegex()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this top level now? would prefer to leave it where it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating the UnicodeRegex instance is quite slow because I need to iterate twice over all Unicode characters.
It is really a bad idea to create a new instance for each line of the test set to be tokenized.
It caused the whole t2t-bleu.py to be about 10000 times slower (depending on the test set size).
As I note in my commit log, I would slightly prefer my original implementation with a class with static methods only (the class was there rather as a namespace).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok. thanks. fine as is then.

# tensorflow.python.framework.errors_impl.NotFoundError:
# xy/model.ckpt-1130761_temp_9cb4cb0b0f5f4382b5ea947aadfb7a40; No such file or directory
# Let's use standard glob.glob instead.
for filename in glob.glob(path_prefix + '*-[0-9]*' + path_suffix):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer tf.gfile.Glob. why is it crashing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When saving a checkpoint, TensorFlow always creates a temp file first (which unfortunately matches the given wildcard pattern, but this is not important here I think).
This temporary file is deleted soon, but if this happens when tf.gfile.Glob is being called, it crashes.
It crashes within the tf.gfile.Glob (i.e. not within the for-cycle in my code) with tensorflow.python.framework.errors_impl.NotFoundError, which is difficult to handle (I could catch the exception and try the same call again, but that's too hacky).
According to my extensive tests, glob.glob does not have this non-deterministic bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the benefit of tf.gfile is that it will support many underlying filesystems, including Google cloud storage and internal filesystems. glob won't know what to do with a path that starts with gs://.

would updating the pattern to 'ckpt-[0-9]*.index' work?
i'm thinking maybe not because it's probably failing on a IsDirectory call.

i'm inclined to advocate for a retrying_glob function to maintain portability across filesystems though i agree it's hacky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the benefit of tf.gfile and this was the reason why I tried it first.
I would prefer a different solution:

  • import tf.gfile wrapper once which overrides glob.glob, and keep glob.glob everywhere in the code
  • make the wrapper detect special systems (gs://), but otherwise fallback to the standard (and reliable) glob.glob

If this is not acceptable, I will catch the exception and call again.
Even if the bug is rare, it is annoying when running many week-long experiments and some of the evaluation jobs crash.

would updating the pattern to 'ckpt-[0-9]*.index' work

No. I use stepfiles_iterator also in t2t-bleu, where the files being iterated are not checkpoints, but translations of the test set, which I decided to have no extension (but path_suffix can be e.g. .txt as well). For this reason I call it stepfiles and not checkpoints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have the retrying tf.gfile.Glob with the exception catching. Monkey-patching a lib would be really confusing for people reading the code and not understanding why something is happening.

@lkluo
Copy link

lkluo commented Jan 25, 2018

I encountered t2t-translate-all error:

tensor2tensor/bin/t2t_translate_all.py", line 99, in main
).format(**locals())
KeyError: 'FLAGS'

I copied several lines of code up to line 99 from t2t_translate_all.py, and printed out "params", turning out no error.

The functionality was broken during the adoption of TPU trainer_lib.py
instead of the original trainer_utils.py.
Currently, the default is to save checkpoints each 2000 steps,
while in previous T2T versions the default was each 10 minutes.
@martinpopel
Copy link
Contributor Author

@rsepassi: I hope all comments from the review are now fixed.
@lkluo: Yes, this is one of the errors fixed by this PR.

"""
try:
return tf.gfile.Glob(pattern)
except tensorflow.python.framework.errors_impl.NotFoundError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tf.errors.NotFoundError

@rsepassi rsepassi merged commit b6a57e7 into tensorflow:master Jan 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants