Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Proper BLEU evaluation #436

Merged
merged 10 commits into from
Dec 1, 2017
Merged

Proper BLEU evaluation #436

merged 10 commits into from
Dec 1, 2017

Conversation

martinpopel
Copy link
Contributor

@martinpopel martinpopel commented Nov 22, 2017

This PR adds a t2t-bleu script which computes "real" BLEU (giving the same result as sacréBLEU with --tokenization intl and as mteval-v14.pl with --international-tokenization).
It can be used in two ways:

  • To evaluate an already translated file: t2t-bleu --translation=my-wmt13.de --reference=wmt13_deen.de
  • To evaluate all checkpoints in a given directory:
t2t-bleu
  --model_dir=t2t_train
  --data_dir=t2t_data
  --translations_dir=my-translations
  --problems=translate_ende_wmt32k
  --hparams_set=transformer_big_single_gpu
  --source=wmt13_deen.en
  --reference=wmt13_deen.de

I find the second way more useful than t2t-trainer --schedule=continous_eval because:

  • It reports the real BLEU (not approx_bleu), which I can directly compare with papers and e.g. http://matrix.statmt.org/ and http://wmt.ufal.cz/.
  • It reports both case-sensitive and case-insensitive BLEU (both are being used in papers, usually without explicitly saying so).
  • It can be used both concurrently with training (waiting for new checkpoint) or ex-post, for evaluating all checkpoints in a given directory.
  • Either way, it starts from the oldest checkpoints first or from the last evaluated checkpoint (unlike --schedule=continous_eval which uses the newest checkpoint always).
  • It stores the time when the checkpoint was created, not when it was evaluated. It also reports [0,0] point in the TensorBoard graph, so the relative time is correct.
  • As a by-product, for each checkpoint there is a file with a translation of the test set.
  • It is possible to set any decoding parameters, most notably beam size and alpha (while approx_bleu is always greedy).
  • It is possible to run the evaluation multiple times (e.g. with different beam size) and store the resulting events file in a different directory.

Now, I have rebased this PR to v1.3.0.

@colmantse
Copy link

thank you, this would be really useful, I was thinking it would be an issue for token tokenization but mose's fine. I did not remember if mose support Chinese tho.

@martinpopel
Copy link
Contributor Author

@colmantse: there is a Chinese-tokenization version of BLEU (used in recent WMT) https://github.com/mjpost/sacreBLEU/blob/b38690e1537cd4719c3517ef77c8255c5a107cc8/sacrebleu.py#L414-L528
and a related improvement mjpost/sacrebleu#5
It could be easily added here, but it's a question whether it would not be better to integrate sacreBLEU as a dependency to prevent code duplication - but I don't have time for this.
My current plan is to fix the failing Travis tests (update the bleu_hook test so it takes into account the smoothing) and also add an option to wait for new checkpoints being created.

@twairball
Copy link
Contributor

WMT17 homepage references the same chinese character n-gram here: http://www.statmt.org/wmt17/tokenizeChinese.py

Fix BLEU computation for edge case of no matching 4-gram (or trigram,...).
Smoothing is the default in the official BLEU implementation
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl#L843-L885

(Smoothing is not present in multi-bleu.perl, but this script
explicitly says it is internal purposes only
and it is recommended to use mteval-v14.pl instead.)
so one can compute real BLEU on two files
(MT translation=hypothesis and reference).
I've tested this on few datasets and it seems
to agree with the official implementation mteval-v14.pl.
not necessarily the latest one in a given output_dir
So it can be used for continous evaluation or for resuming older
evaluation from a checkpoint with a given number of steps.
It is also possible to specify the name of the events subdirectory
and tag suffix.
Fix for the case when evaluating one checkpoint takes
longer than creating a new checkpoint.
@Gldkslfmsd
Copy link

Hello,
I worry the bleu-hook is not division-by-zero-proof:

Traceback (most recent call last):
  File "t2t-bleu", line 242, in <module>
    tf.app.run()
  File "/lnet/spec/work/people/machacek/t2t-second/p3-t2t-gpu-fork/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "t2t-bleu", line 232, in main
    count_bleu(FLAGS.reference, out_file, ptag="")
  File "t2t-bleu", line 225, in count_bleu
    bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, out_file, case_sensitive=False)
  File "/lnet/spec/work/people/machacek/t2t-second/p3-t2t-gpu-fork/lib/python3.4/site-packages/tensor2tensor-1.2.9-py3.4.egg/tensor2tensor/utils/bleu_hook.py", line 197, in bleu_wrapper
    return compute_bleu(ref_tokens, hyp_tokens)
  File "/lnet/spec/work/people/machacek/t2t-second/p3-t2t-gpu-fork/lib/python3.4/site-packages/tensor2tensor-1.2.9-py3.4.egg/tensor2tensor/utils/bleu_hook.py", line 117, in compute_bleu
    bp = math.exp(1 - 1. / ratio) if ratio < 1.0 else 1.0
ZeroDivisionError: float division by zero

@martinpopel
Copy link
Contributor Author

@Gldkslfmsd thanks for bug report, but I believe this is not related to this PR (although we could fix it here) - I have not changed the way brevity penalty is computed.
I believe this can happen only if the translation_length is zero, i.e. no translation was produced for the whole corpus. Can you confirm this is the case?
I agree the script should not fail, but produce BLEU=0 and also a warning.

@Gldkslfmsd
Copy link

I believe this can happen only if the translation_length is zero, i.e. no translation was produced for the whole corpus. Can you confirm this is the case?

Yes. And maybe it can happen if the reference file is empty.

I agree the script should not fail, but produce BLEU=0 and also a warning.

+1

with a clear error message instead of misleading
"division by zero"
@martinpopel
Copy link
Contributor Author

@Gldkslfmsd: your error was because of empty translation. Empty reference would fail elsewhere.
Either way, I changed my mind and I think this is very suspicious, so we should fail with a clear error
rather than return bleu=0. Thus I have added two assertions.

@Gldkslfmsd
Copy link

Gldkslfmsd commented Nov 30, 2017 via email

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lukaszkaiser lukaszkaiser merged commit bb1173a into tensorflow:master Dec 1, 2017
@Gldkslfmsd
Copy link

FYI:
Gldkslfmsd@697e506

@noe
Copy link
Contributor

noe commented Jan 5, 2018

@martinpopel @lukaszkaiser it seems that the changes of this PR were removed from master in 20c7e41.

@martinpopel
Copy link
Contributor Author

@noe: Yes. It was not intentional removal within that commit, this is just a consequence of the chosen merging strategy between github and the internal google repo. This PR has not been fully integrated into the internal repo. @lukaszkaiser wrote it was because of Python2 problems, but I am not aware of any (those I marked in the code with comments, ie. the encoding parameter for open were solved when merging). Anyway, see #488 for a new version of this PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants