Could you provide a consistent way to calculate BLEU? #405

skyw · 2017-11-06T17:52:39Z

It takes several steps to calculate BLEU. And it is not absolutely clear how BLEU should be calculated with the decoded text.

I would be nice to have a eval function just calculate the BLEU in the standard and correct way against given target.

vince62s · 2017-11-06T20:44:38Z

Just depending how you tokenize your text could lead to different value.
Google for "mt eval nist" or "multibleu" and you will find what you're looking for.

lukaszkaiser · 2017-11-07T03:32:44Z

Our BLEU functions are correct, use the utils/get...bleu.sh script for results comparable with publications.

martinpopel · 2017-11-07T13:55:35Z

I think no one uses the get_ende_bleu.sh script because two lines are hard-wrapped, so it cannot be executed as is:

tensor2tensor/tensor2tensor/utils/get_ende_bleu.sh

Lines 15 to 18 in 097ea5f

    
           perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $tok_gold_targets > $tok_gold_t 
        
           argets.atat 
        
           perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $decodes_file.tok > $decodes 
        
           _file.atat

It would be nice to have a way how to compute the official BLEU (both case sensitive and insensitive) for a beam-searched translation (of a possibly avg_checkpointed model) and see this curve in TensorBoard. Now, when beam-search decoding is fast (4 minutes for 3000 sentences) it seems doable (I use --save_checkpoints_secs=3600 that is one checkpoint&evaluation per hour).
It's on my todo list (together with character-based metrics, e.g. chrF3 or characTER), but unfortunately at the bottom of the list, which is rather a wish list:-).

martinpopel · 2017-11-19T00:39:53Z

I plan to make a PR with a script as described in my previous post.
So far I have a Python-only (no Perl) function which computes BLEU (including tokenization):
https://github.com/martinpopel/tensor2tensor/tree/bleu

lukaszkaiser · 2017-11-21T23:16:49Z

Would be great to have as a metric! We're also thinking about reporting all future results (on test) with https://github.com/mjpost/sacreBLEU -- what do you think guys?

martinpopel · 2017-11-22T08:30:24Z

I will send a PR soon (just tidying and testing).
Thanks for pointing to sacreBLEU - the idea of autodownloading test sets and reproducible BLEU is great. There is even a PR with chrF3.
I am just missing international tokenization there (without this the correlation of BLEU with humans is much lower for languages with non-ascii characters, which includes also English with “typographic” quotes), but I can discuss this with @mjpost at sacreBLEU's github.

reporting all future results (on test) with https://github.com/mjpost/sacreBLEU

Yes. Even if sacreBLEU is integrated into T2T, I will still need most of my new code, which evaluates all checkpoints in a directory and stores the curve in TensorBoard events file (--schedule=continuous_eval evaluates just the last checkpoint and then waits for new ones, also it cannot use beam search, model averaging and I guess it would be difficult to make it use proper BLEU de/tokenization).

martinpopel · 2017-11-22T20:23:40Z

I did the PR: #436

lukaszkaiser closed this as completed Nov 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you provide a consistent way to calculate BLEU? #405

Could you provide a consistent way to calculate BLEU? #405

skyw commented Nov 6, 2017

vince62s commented Nov 6, 2017

lukaszkaiser commented Nov 7, 2017

martinpopel commented Nov 7, 2017

martinpopel commented Nov 19, 2017

lukaszkaiser commented Nov 21, 2017

martinpopel commented Nov 22, 2017

martinpopel commented Nov 22, 2017

Could you provide a consistent way to calculate BLEU? #405

Could you provide a consistent way to calculate BLEU? #405

Comments

skyw commented Nov 6, 2017

vince62s commented Nov 6, 2017

lukaszkaiser commented Nov 7, 2017

martinpopel commented Nov 7, 2017

martinpopel commented Nov 19, 2017

lukaszkaiser commented Nov 21, 2017

martinpopel commented Nov 22, 2017

martinpopel commented Nov 22, 2017