You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
Sentences longer than the parameter max_length are excluded from training and lowering this parameter helps to prevent OOM errors and allows to use higher batch_size, so it is quite useful.
Unfortunately, setting this parameter too low results in low BLEU and retarded learning curves. The graph below shows curves (evaluated on dev set) for max_length 25, 50, 70, 150, 200 and 400:
There are two possible explanations, but I think both of them are false:
Setting max_length too low makes the training data smaller. However, with max_length=70 only 2.1% of my training sentences are excluded. Moreover, the "70" BLEU curve is decreasing after the first hour of training, while processing the whole training data (one epoch) takes more than two days of training.
A model trained on short sentences only does not achieve good results when applied on long sentences. However, there are only 2.2% sentences longer than 70 subwords in my dev set (and 0.3% sentences longer than 100 subwords), so this does not seem to be the cause either.
When I increased the batch_size from 1500 to 2000, the results improved: the "25" and "50" curves were still retarded, but "70" and higher achieved the same result as when training without any max_length restriction.
Can someone explain this? Or even fix it if it is a bug?
The text was updated successfully, but these errors were encountered:
@martinpopel are these numbers from tensor2tensor 1.2.9 or from a more recent version? (I ask this in relation to bug #529 , as 1.2.9 is the version some of us are working in).
Sentences longer than the parameter

max_length
are excluded from training and lowering this parameter helps to prevent OOM errors and allows to use higherbatch_size
, so it is quite useful.Unfortunately, setting this parameter too low results in low BLEU and retarded learning curves. The graph below shows curves (evaluated on dev set) for
max_length
25, 50, 70, 150, 200 and 400:There are two possible explanations, but I think both of them are false:
max_length
too low makes the training data smaller. However, withmax_length=70
only 2.1% of my training sentences are excluded. Moreover, the "70" BLEU curve is decreasing after the first hour of training, while processing the whole training data (one epoch) takes more than two days of training.When I increased the
batch_size
from 1500 to 2000, the results improved: the "25" and "50" curves were still retarded, but "70" and higher achieved the same result as when training without anymax_length
restriction.Can someone explain this? Or even fix it if it is a bug?
The text was updated successfully, but these errors were encountered: