File tree 1 file changed +9
-2
lines changed
1 file changed +9
-2
lines changed Original file line number Diff line number Diff line change 38
38
# of the word (see the next paragraph for more details). The
39
39
# ``nn.TransformerEncoder`` consists of multiple layers of
40
40
# `nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
41
- # To produce a probability distribution over output words, the output of
42
- # the ``nn.TransformerEncoder`` model is passed through a linear layer.
41
+ # Along with the input sequence, a square attention mask is required because the
42
+ # self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
43
+ # the earlier positions in the sequence. For the language modeling task, any
44
+ # tokens on the future positions should be masked. To produce a probability
45
+ # distribution over output words, the output of the ``nn.TransformerEncoder``
46
+ # model is passed through a linear layer to output unnormalized logits.
47
+ # The log-softmax function isn't applied here due to the later use of
48
+ # `CrossEntropyLoss <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html>`__,
49
+ # which requires the inputs to be unnormalized logits.
43
50
#
44
51
45
52
import math
You can’t perform that action at this time.
0 commit comments