Proper BLEU evaluation #436

martinpopel · 2017-11-22T14:41:53Z

This PR adds a t2t-bleu script which computes "real" BLEU (giving the same result as sacréBLEU with --tokenization intl and as mteval-v14.pl with --international-tokenization).
It can be used in two ways:

To evaluate an already translated file: t2t-bleu --translation=my-wmt13.de --reference=wmt13_deen.de
To evaluate all checkpoints in a given directory:

t2t-bleu
  --model_dir=t2t_train
  --data_dir=t2t_data
  --translations_dir=my-translations
  --problems=translate_ende_wmt32k
  --hparams_set=transformer_big_single_gpu
  --source=wmt13_deen.en
  --reference=wmt13_deen.de

I find the second way more useful than t2t-trainer --schedule=continous_eval because:

It reports the real BLEU (not approx_bleu), which I can directly compare with papers and e.g. http://matrix.statmt.org/ and http://wmt.ufal.cz/.
It reports both case-sensitive and case-insensitive BLEU (both are being used in papers, usually without explicitly saying so).
It can be used both concurrently with training (waiting for new checkpoint) or ex-post, for evaluating all checkpoints in a given directory.
Either way, it starts from the oldest checkpoints first or from the last evaluated checkpoint (unlike --schedule=continous_eval which uses the newest checkpoint always).
It stores the time when the checkpoint was created, not when it was evaluated. It also reports [0,0] point in the TensorBoard graph, so the relative time is correct.
As a by-product, for each checkpoint there is a file with a translation of the test set.
It is possible to set any decoding parameters, most notably beam size and alpha (while approx_bleu is always greedy).
It is possible to run the evaluation multiple times (e.g. with different beam size) and store the resulting events file in a different directory.

Now, I have rebased this PR to v1.3.0.

colmantse · 2017-11-22T15:33:34Z

thank you, this would be really useful, I was thinking it would be an issue for token tokenization but mose's fine. I did not remember if mose support Chinese tho.

martinpopel · 2017-11-22T20:19:02Z

@colmantse: there is a Chinese-tokenization version of BLEU (used in recent WMT) https://github.com/mjpost/sacreBLEU/blob/b38690e1537cd4719c3517ef77c8255c5a107cc8/sacrebleu.py#L414-L528
and a related improvement mjpost/sacrebleu#5
It could be easily added here, but it's a question whether it would not be better to integrate sacreBLEU as a dependency to prevent code duplication - but I don't have time for this.
My current plan is to fix the failing Travis tests (update the bleu_hook test so it takes into account the smoothing) and also add an option to wait for new checkpoints being created.

twairball · 2017-11-24T13:47:49Z

WMT17 homepage references the same chinese character n-gram here: http://www.statmt.org/wmt17/tokenizeChinese.py

Fix BLEU computation for edge case of no matching 4-gram (or trigram,...). Smoothing is the default in the official BLEU implementation https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl#L843-L885 (Smoothing is not present in multi-bleu.perl, but this script explicitly says it is internal purposes only and it is recommended to use mteval-v14.pl instead.)

so one can compute real BLEU on two files (MT translation=hypothesis and reference). I've tested this on few datasets and it seems to agree with the official implementation mteval-v14.pl.

not necessarily the latest one in a given output_dir

So it can be used for continous evaluation or for resuming older evaluation from a checkpoint with a given number of steps. It is also possible to specify the name of the events subdirectory and tag suffix.

Fix for the case when evaluating one checkpoint takes longer than creating a new checkpoint.

Gldkslfmsd · 2017-11-30T15:03:32Z

Hello,
I worry the bleu-hook is not division-by-zero-proof:

Traceback (most recent call last):
  File "t2t-bleu", line 242, in <module>
    tf.app.run()
  File "/lnet/spec/work/people/machacek/t2t-second/p3-t2t-gpu-fork/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "t2t-bleu", line 232, in main
    count_bleu(FLAGS.reference, out_file, ptag="")
  File "t2t-bleu", line 225, in count_bleu
    bleu = 100 * bleu_hook.bleu_wrapper(FLAGS.reference, out_file, case_sensitive=False)
  File "/lnet/spec/work/people/machacek/t2t-second/p3-t2t-gpu-fork/lib/python3.4/site-packages/tensor2tensor-1.2.9-py3.4.egg/tensor2tensor/utils/bleu_hook.py", line 197, in bleu_wrapper
    return compute_bleu(ref_tokens, hyp_tokens)
  File "/lnet/spec/work/people/machacek/t2t-second/p3-t2t-gpu-fork/lib/python3.4/site-packages/tensor2tensor-1.2.9-py3.4.egg/tensor2tensor/utils/bleu_hook.py", line 117, in compute_bleu
    bp = math.exp(1 - 1. / ratio) if ratio < 1.0 else 1.0
ZeroDivisionError: float division by zero

martinpopel · 2017-11-30T15:13:14Z

@Gldkslfmsd thanks for bug report, but I believe this is not related to this PR (although we could fix it here) - I have not changed the way brevity penalty is computed.
I believe this can happen only if the translation_length is zero, i.e. no translation was produced for the whole corpus. Can you confirm this is the case?
I agree the script should not fail, but produce BLEU=0 and also a warning.

Gldkslfmsd · 2017-11-30T15:33:33Z

I believe this can happen only if the translation_length is zero, i.e. no translation was produced for the whole corpus. Can you confirm this is the case?

Yes. And maybe it can happen if the reference file is empty.

I agree the script should not fail, but produce BLEU=0 and also a warning.

+1

with a clear error message instead of misleading "division by zero"

martinpopel · 2017-11-30T19:41:36Z

@Gldkslfmsd: your error was because of empty translation. Empty reference would fail elsewhere.
Either way, I changed my mind and I think this is very suspicious, so we should fail with a clear error
rather than return bleu=0. Thus I have added two assertions.

Gldkslfmsd · 2017-11-30T20:01:05Z

It happened because I was making my own extension to this code and I was debugging it on a one-line source file. Some models translated it as a blank file, but it was a valid translation. So I would vote for no crash in this case. 2017-11-30 20:41 GMT+01:00 Martin Popel <[email protected]>:

…

@Gldkslfmsd <https://github.com/gldkslfmsd>: your error was because of empty translation. Empty reference would fail elsewhere. Either way, I changed my mind and I think this is very suspicious, so we should fail with a clear error rather than return bleu=0. Thus I have added two assertions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#436 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFEYXev9snMyS5QlMv63-dclmdeVlYG7ks5s7wT3gaJpZM4QngOp> .

lukaszkaiser

Thanks!

Gldkslfmsd · 2017-12-01T09:21:03Z

FYI:
Gldkslfmsd@697e506

noe · 2018-01-05T08:45:08Z

@martinpopel @lukaszkaiser it seems that the changes of this PR were removed from master in 20c7e41.

martinpopel · 2018-01-08T12:19:18Z

@noe: Yes. It was not intentional removal within that commit, this is just a consequence of the chosen merging strategy between github and the internal google repo. This PR has not been fully integrated into the internal repo. @lukaszkaiser wrote it was because of Python2 problems, but I am not aware of any (those I marked in the code with comments, ie. the encoding parameter for open were solved when merging). Anyway, see #488 for a new version of this PR.

martinpopel mentioned this pull request Nov 22, 2017

Could you provide a consistent way to calculate BLEU? #405

Closed

martinpopel mentioned this pull request Nov 30, 2017

--eval_run_autoregressive is slow and wrong #407

Open

martinpopel added 9 commits November 30, 2017 13:38

add BLEU-compatible tokenization

4160979

so one can compute real BLEU on two files (MT translation=hypothesis and reference). I've tested this on few datasets and it seems to agree with the official implementation mteval-v14.pl.

allow t2t-decoder to load any checkpoint

685c9fc

not necessarily the latest one in a given output_dir

add t2t-bleu script for proper BLEU evaluation

aade7ec

fix tests because of the added smoothing in BLEU

b912213

fix tests for Python2

1583032

more options for t2t-bleu

a43eff3

So it can be used for continous evaluation or for resuming older evaluation from a checkpoint with a given number of steps. It is also possible to specify the name of the events subdirectory and tag suffix.

first try a new checkpoint before checking the time elapsed

7fc50b9

Fix for the case when evaluating one checkpoint takes longer than creating a new checkpoint.

add t2t-bleu to setup.py

004888a

martinpopel force-pushed the bleu branch from 27c4362 to 004888a Compare November 30, 2017 12:45

if no reference or translation is provided fail

7ba78a2

with a clear error message instead of misleading "division by zero"

lukaszkaiser approved these changes Dec 1, 2017

View reviewed changes

lukaszkaiser merged commit bb1173a into tensorflow:master Dec 1, 2017

This was referenced Dec 1, 2017

v1.3.1 #451

Merged

Evaluation failed when training with multiple work gpus #266

Closed

This was referenced Dec 21, 2017

BLEU Score is not increasing with New Upgrade #483

Closed

scripts for proper BLEU evaluation, batch translation and averaging #488

Merged

martinpopel mentioned this pull request Feb 14, 2018

How to get real bleu score? [approx_bleu_score] #587

Closed

MostafaDehghani mentioned this pull request Sep 3, 2018

universal_transformer in machine translation #1021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper BLEU evaluation #436

Proper BLEU evaluation #436

martinpopel commented Nov 22, 2017 •

edited

Loading

colmantse commented Nov 22, 2017

martinpopel commented Nov 22, 2017

twairball commented Nov 24, 2017

Gldkslfmsd commented Nov 30, 2017

martinpopel commented Nov 30, 2017

Gldkslfmsd commented Nov 30, 2017

martinpopel commented Nov 30, 2017

Gldkslfmsd commented Nov 30, 2017 via email

lukaszkaiser left a comment

Gldkslfmsd commented Dec 1, 2017

noe commented Jan 5, 2018 •

edited

Loading

martinpopel commented Jan 8, 2018

Proper BLEU evaluation #436

Proper BLEU evaluation #436

Conversation

martinpopel commented Nov 22, 2017 • edited Loading

colmantse commented Nov 22, 2017

martinpopel commented Nov 22, 2017

twairball commented Nov 24, 2017

Gldkslfmsd commented Nov 30, 2017

martinpopel commented Nov 30, 2017

Gldkslfmsd commented Nov 30, 2017

martinpopel commented Nov 30, 2017

Gldkslfmsd commented Nov 30, 2017 via email

lukaszkaiser left a comment

Choose a reason for hiding this comment

Gldkslfmsd commented Dec 1, 2017

noe commented Jan 5, 2018 • edited Loading

martinpopel commented Jan 8, 2018

martinpopel commented Nov 22, 2017 •

edited

Loading

noe commented Jan 5, 2018 •

edited

Loading