Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Could you provide a consistent way to calculate BLEU? #405

Closed
skyw opened this issue Nov 6, 2017 · 7 comments
Closed

Could you provide a consistent way to calculate BLEU? #405

skyw opened this issue Nov 6, 2017 · 7 comments

Comments

@skyw
Copy link

skyw commented Nov 6, 2017

It takes several steps to calculate BLEU. And it is not absolutely clear how BLEU should be calculated with the decoded text.

I would be nice to have a eval function just calculate the BLEU in the standard and correct way against given target.

@vince62s
Copy link
Contributor

vince62s commented Nov 6, 2017

Just depending how you tokenize your text could lead to different value.
Google for "mt eval nist" or "multibleu" and you will find what you're looking for.

@lukaszkaiser
Copy link
Contributor

Our BLEU functions are correct, use the utils/get...bleu.sh script for results comparable with publications.

@martinpopel
Copy link
Contributor

I think no one uses the get_ende_bleu.sh script because two lines are hard-wrapped, so it cannot be executed as is:

perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $tok_gold_targets > $tok_gold_t
argets.atat
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $decodes_file.tok > $decodes
_file.atat

It would be nice to have a way how to compute the official BLEU (both case sensitive and insensitive) for a beam-searched translation (of a possibly avg_checkpointed model) and see this curve in TensorBoard. Now, when beam-search decoding is fast (4 minutes for 3000 sentences) it seems doable (I use --save_checkpoints_secs=3600 that is one checkpoint&evaluation per hour).
It's on my todo list (together with character-based metrics, e.g. chrF3 or characTER), but unfortunately at the bottom of the list, which is rather a wish list:-).

@martinpopel
Copy link
Contributor

I plan to make a PR with a script as described in my previous post.
So far I have a Python-only (no Perl) function which computes BLEU (including tokenization):
https://github.com/martinpopel/tensor2tensor/tree/bleu

@lukaszkaiser
Copy link
Contributor

Would be great to have as a metric! We're also thinking about reporting all future results (on test) with https://github.com/mjpost/sacreBLEU -- what do you think guys?

@martinpopel
Copy link
Contributor

I will send a PR soon (just tidying and testing).
Thanks for pointing to sacreBLEU - the idea of autodownloading test sets and reproducible BLEU is great. There is even a PR with chrF3.
I am just missing international tokenization there (without this the correlation of BLEU with humans is much lower for languages with non-ascii characters, which includes also English with “typographic” quotes), but I can discuss this with @mjpost at sacreBLEU's github.

reporting all future results (on test) with https://github.com/mjpost/sacreBLEU

Yes. Even if sacreBLEU is integrated into T2T, I will still need most of my new code, which evaluates all checkpoints in a directory and stores the curve in TensorBoard events file (--schedule=continuous_eval evaluates just the last checkpoint and then waits for new ones, also it cannot use beam search, model averaging and I guess it would be difficult to make it use proper BLEU de/tokenization).

@martinpopel
Copy link
Contributor

I did the PR: #436

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants