-
Notifications
You must be signed in to change notification settings - Fork 3.6k
t2t-decoder with --checkpoint_path #524
t2t-decoder with --checkpoint_path #524
Conversation
and allow keeping timestamp in that case. This is needed for t2t-translate-all + t2t-bleu to work as expected (I forgot to add this commit to tensorflow#488).
tf.gfile.Glob may crash with tensorflow.python.framework.errors_impl.NotFoundError: xy/model.ckpt-1130761_temp_9cb4cb0b0f5f4382b5ea947aadfb7a40; No such file or directory Let's use standard glob.glob instead, it seems to be more reliable.
this is needed for **locals() to work
As I think about it, I would prefer my original implementation https://github.com/tensorflow/tensor2tensor/blob/bb1173adce940e62c840970fa0f06f69fd9398db/tensor2tensor/utils/bleu_hook.py#L147-L157 But it seems there are some T2T/Google internal Python guidelines forbidding this, so we have to live with the singleton.
@@ -67,6 +67,7 @@ | |||
|
|||
|
|||
def main(_): | |||
FLAGS = flags.FLAGS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FLAGS defined above. rm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can delete the current line 41 (I hope), but not this one.
FLAGS need to be a local variable because they are used few lines below -- https://github.com/martinpopel/tensor2tensor/blob/61005b0b1e063e3e26cb403cf6e8cdea3509c3b5/tensor2tensor/bin/t2t_translate_all.py#L94
Thanks to **locals()
, the users can provide e.g --decoder_command='qsub my-decoder.sh {params} --my_param={FLAGS.my_param}'
and I think it is more user-friendly than forcing them to write --my_param={flags.FLAGS.my_param}
.
BTW: Last time someone deleted this line during an internal-Google refactor, but didn't noticed me within PR review as you do now. It was one of the reasons why t2t_translate_all.py
didn't work (another reason was I was lazy to write a test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. How about this?:
locals_and_flags = dict()
local_and_flags.update(FLAGS.__dict__)
locals_and_flags.update(locals())
...
.format(**locals_and_flags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. If you feel it's more elegant, I will do it this way (and test).
@@ -158,6 +159,7 @@ def property_chars(self, prefix): | |||
return "".join(six.unichr(x) for x in range(sys.maxunicode) | |||
if unicodedata.category(six.unichr(x)).startswith(prefix)) | |||
|
|||
uregex = UnicodeRegex() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this top level now? would prefer to leave it where it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating the UnicodeRegex
instance is quite slow because I need to iterate twice over all Unicode characters.
It is really a bad idea to create a new instance for each line of the test set to be tokenized.
It caused the whole t2t-bleu.py
to be about 10000 times slower (depending on the test set size).
As I note in my commit log, I would slightly prefer my original implementation with a class with static methods only (the class was there rather as a namespace).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok. thanks. fine as is then.
tensor2tensor/utils/bleu_hook.py
Outdated
# tensorflow.python.framework.errors_impl.NotFoundError: | ||
# xy/model.ckpt-1130761_temp_9cb4cb0b0f5f4382b5ea947aadfb7a40; No such file or directory | ||
# Let's use standard glob.glob instead. | ||
for filename in glob.glob(path_prefix + '*-[0-9]*' + path_suffix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer tf.gfile.Glob. why is it crashing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When saving a checkpoint, TensorFlow always creates a temp file first (which unfortunately matches the given wildcard pattern, but this is not important here I think).
This temporary file is deleted soon, but if this happens when tf.gfile.Glob
is being called, it crashes.
It crashes within the tf.gfile.Glob
(i.e. not within the for-cycle in my code) with tensorflow.python.framework.errors_impl.NotFoundError, which is difficult to handle (I could catch the exception and try the same call again, but that's too hacky).
According to my extensive tests, glob.glob
does not have this non-deterministic bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the benefit of tf.gfile is that it will support many underlying filesystems, including Google cloud storage and internal filesystems. glob won't know what to do with a path that starts with gs://.
would updating the pattern to 'ckpt-[0-9]*.index' work?
i'm thinking maybe not because it's probably failing on a IsDirectory call.
i'm inclined to advocate for a retrying_glob function to maintain portability across filesystems though i agree it's hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the benefit of tf.gfile and this was the reason why I tried it first.
I would prefer a different solution:
- import tf.gfile wrapper once which overrides glob.glob, and keep glob.glob everywhere in the code
- make the wrapper detect special systems (
gs://
), but otherwise fallback to the standard (and reliable)glob.glob
If this is not acceptable, I will catch the exception and call again.
Even if the bug is rare, it is annoying when running many week-long experiments and some of the evaluation jobs crash.
would updating the pattern to 'ckpt-[0-9]*.index' work
No. I use stepfiles_iterator
also in t2t-bleu
, where the files being iterated are not checkpoints, but translations of the test set, which I decided to have no extension (but path_suffix
can be e.g. .txt
as well). For this reason I call it stepfiles
and not checkpoints
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have the retrying tf.gfile.Glob with the exception catching. Monkey-patching a lib would be really confusing for people reading the code and not understanding why something is happening.
I encountered t2t-translate-all error: tensor2tensor/bin/t2t_translate_all.py", line 99, in main I copied several lines of code up to line 99 from t2t_translate_all.py, and printed out "params", turning out no error. |
The functionality was broken during the adoption of TPU trainer_lib.py instead of the original trainer_utils.py. Currently, the default is to save checkpoints each 2000 steps, while in previous T2T versions the default was each 10 minutes.
tensor2tensor/utils/bleu_hook.py
Outdated
""" | ||
try: | ||
return tf.gfile.Glob(pattern) | ||
except tensorflow.python.framework.errors_impl.NotFoundError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tf.errors.NotFoundError
--checkpoint_path
allowst2t-decoder
to load any model, not only the latest one--keep_timestamp
sets the translated file's mtime according to the checkpoint (it is by default on, but active only with--checkpoint_path
and--decode_to_file
)tf.gfile.Glob
crashes due to concurrent filesystem editsAll this is needed for
t2t-translate-all
+t2t-bleu
to work as expected (I forgot to add this commit to #488).In addition I am trying to solve #523.