Merge branch 'master' into master

brianjo · web-flow · commit 496327c177ce · 2021-04-19T10:49:21.000-04:00
diff --git a/advanced_source/cpp_export.rst b/advanced_source/cpp_export.rst
@@ -115,7 +115,7 @@ If you need to exclude some methods in your ``nn.Module``
 because they use Python features that TorchScript doesn't support yet,
 you could annotate those with ``@torch.jit.ignore``
 
-``my_module`` is an instance of
+``sm`` is an instance of
 ``ScriptModule`` that is ready for serialization.
 
 Step 2: Serializing Your Script Module to a File
@@ -132,7 +132,7 @@ on the module and pass it a filename::
   traced_script_module.save("traced_resnet_model.pt")
 
 This will produce a ``traced_resnet_model.pt`` file in your working directory.
-If you also would like to serialize ``my_module``, call ``my_module.save("my_module_model.pt")``
+If you also would like to serialize ``sm``, call ``sm.save("my_module_model.pt")``
 We have now officially left the realm of Python and are ready to cross over to the sphere
 of C++.
 
diff --git a/advanced_source/cpp_extension.rst b/advanced_source/cpp_extension.rst
@@ -115,13 +115,13 @@ PyTorch has no knowledge of the *algorithm* you are implementing. It knows only
 of the individual operations you use to compose your algorithm. As such, PyTorch
 must execute your operations individually, one after the other. Since each
 individual call to the implementation (or *kernel*) of an operation, which may
-involve launch of a CUDA kernel, has a certain amount of overhead, this overhead
-may become significant across many function calls. Furthermore, the Python
-interpreter that is running our code can itself slow down our program.
+involve the launch of a CUDA kernel, has a certain amount of overhead, this
+overhead may become significant across many function calls. Furthermore, the
+Python interpreter that is running our code can itself slow down our program.
 
 A definite method of speeding things up is therefore to rewrite parts in C++ (or
 CUDA) and *fuse* particular groups of operations. Fusing means combining the
-implementations of many functions into a single functions, which profits from
+implementations of many functions into a single function, which profits from
 fewer kernel launches as well as other optimizations we can perform with
 increased visibility of the global flow of data.
 
@@ -509,12 +509,12 @@ and with our new C++ version::
   Forward: 349.335 us | Backward 443.523 us
 
 We can already see a significant speedup for the forward function (more than
-30%). For the backward function a speedup is visible, albeit not major one. The
-backward pass I wrote above was not particularly optimized and could definitely
-be improved. Also, PyTorch's automatic differentiation engine can automatically
-parallelize computation graphs, may use a more efficient flow of operations
-overall, and is also implemented in C++, so it's expected to be fast.
-Nevertheless, this is a good start.
+30%). For the backward function, a speedup is visible, albeit not a major one.
+The backward pass I wrote above was not particularly optimized and could
+definitely be improved. Also, PyTorch's automatic differentiation engine can
+automatically parallelize computation graphs, may use a more efficient flow of
+operations overall, and is also implemented in C++, so it's expected to be
+fast. Nevertheless, this is a good start.
 
 Performance on GPU Devices
 **************************
@@ -571,7 +571,7 @@ And C++/ATen::
 
 That's a great overall speedup compared to non-CUDA code. However, we can pull
 even more performance out of our C++ code by writing custom CUDA kernels, which
-we'll dive into soon. Before that, let's dicuss another way of building your C++
+we'll dive into soon. Before that, let's discuss another way of building your C++
 extensions.
 
 JIT Compiling Extensions
@@ -851,7 +851,7 @@ and ``Double``), you can use ``AT_DISPATCH_ALL_TYPES``.
 
 Note that we perform some operations with plain ATen. These operations will
 still run on the GPU, but using ATen's default implementations. This makes
-sense, because ATen will use highly optimized routines for things like matrix
+sense because ATen will use highly optimized routines for things like matrix
 multiplies (e.g. ``addmm``) or convolutions which would be much harder to
 implement and improve ourselves.
 
@@ -903,7 +903,7 @@ You can see in the CUDA kernel that we work directly on pointers with the right
 type. Indeed, working directly with high level type agnostic tensors inside cuda
 kernels would be very inefficient.
 
-However, this comes at a cost of ease of use and readibility, especially for
+However, this comes at a cost of ease of use and readability, especially for
 highly dimensional data. In our example, we know for example that the contiguous
 ``gates`` tensor has 3 dimensions:
 
@@ -920,7 +920,7 @@ arithmetic.
   gates.data<scalar_t>()[n*3*state_size + row*state_size + column]
 
 
-In addition to being verbose, this expression needs stride to be explicitely
+In addition to being verbose, this expression needs stride to be explicitly
 known, and thus passed to the kernel function within its arguments. You can see
 that in the case of kernel functions accepting multiple tensors with different
 sizes you will end up with a very long list of arguments.
@@ -1101,7 +1101,7 @@ on it:
     const int threads = 1024;
     const dim3 blocks((state_size + threads - 1) / threads, batch_size);
 
-    AT_DISPATCH_FLOATING_TYPES(X.type(), "lltm_forward_cuda", ([&] {
+    AT_DISPATCH_FLOATING_TYPES(X.type(), "lltm_backward_cuda", ([&] {
       lltm_cuda_backward_kernel<scalar_t><<<blocks, threads>>>(
           d_old_cell.packed_accessor32<scalar_t,2,torch::RestrictPtrTraits>(),
           d_gates.packed_accessor32<scalar_t,3,torch::RestrictPtrTraits>(),
diff --git a/beginner_source/basics/optimization_tutorial.py b/beginner_source/basics/optimization_tutorial.py
@@ -12,7 +12,7 @@
 Optimizing Model Parameters
 ===========================
 
-Now that we have a model and data it's time to train, validate and test our model by optimizing it's parameters on 
+Now that we have a model and data it's time to train, validate and test our model by optimizing its parameters on 
 our data. Training a model is an iterative process; in each iteration (called an *epoch*) the model makes a guess about the output, calculates 
 the error in its guess (*loss*), collects the derivatives of the error with respect to its parameters (as we saw in 
 the `previous section  <autograd_tutorial.html>`_), and **optimizes** these parameters using gradient descent. For a more 
diff --git a/beginner_source/blitz/cifar10_tutorial.py b/beginner_source/blitz/cifar10_tutorial.py
@@ -43,15 +43,15 @@
 
 We will do the following steps in order:
 
-1. Load and normalizing the CIFAR10 training and test datasets using
+1. Load and normalize the CIFAR10 training and test datasets using
    ``torchvision``
 2. Define a Convolutional Neural Network
 3. Define a loss function
 4. Train the network on the training data
 5. Test the network on the test data
 
-1. Loading and normalizing CIFAR10
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1. Load and normalize CIFAR10
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Using ``torchvision``, it’s extremely easy to load CIFAR10.
 """
diff --git a/beginner_source/blitz/neural_networks_tutorial.py b/beginner_source/blitz/neural_networks_tutorial.py
@@ -58,7 +58,7 @@ def __init__(self):
     def forward(self, x):
         # Max pooling over a (2, 2) window
         x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
-        # If the size is a square you can only specify a single number
+        # If the size is a square, you can specify with a single number
         x = F.max_pool2d(F.relu(self.conv2(x)), 2)
         x = x.view(-1, self.num_flat_features(x))
         x = F.relu(self.fc1(x))
@@ -176,7 +176,7 @@ def num_flat_features(self, x):
 #           -> loss
 #
 # So, when we call ``loss.backward()``, the whole graph is differentiated
-# w.r.t. the loss, and all Tensors in the graph that has ``requires_grad=True``
+# w.r.t. the loss, and all Tensors in the graph that have ``requires_grad=True``
 # will have their ``.grad`` Tensor accumulated with the gradient.
 #
 # For illustration, let us follow a few steps backward:
diff --git a/beginner_source/dcgan_faces_tutorial.py b/beginner_source/dcgan_faces_tutorial.py
@@ -610,10 +610,10 @@ def forward(self, input):
         output = netD(fake.detach()).view(-1)
         # Calculate D's loss on the all-fake batch
         errD_fake = criterion(output, label)
-        # Calculate the gradients for this batch
+        # Calculate the gradients for this batch, accumulated (summed) with previous gradients
         errD_fake.backward()
         D_G_z1 = output.mean().item()
-        # Add the gradients from the all-real and all-fake batches
+        # Compute error of D as sum over the fake and the real batches
         errD = errD_real + errD_fake
         # Update D
         optimizerD.step()
diff --git a/beginner_source/nlp/README.txt b/beginner_source/nlp/README.txt
@@ -14,9 +14,9 @@ Deep Learning for NLP with Pytorch
 	https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
 
 4. sequence_models_tutorial.py
-	Sequence Models and Long-Short Term Memory Networks
+	Sequence Models and Long Short-Term Memory Networks
 	https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
 
 5. advanced_tutorial.py
 	Advanced: Making Dynamic Decisions and the Bi-LSTM CRF
-	https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html
+	https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html
diff --git a/beginner_source/nlp/sequence_models_tutorial.py b/beginner_source/nlp/sequence_models_tutorial.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
 r"""
-Sequence Models and Long-Short Term Memory Networks
+Sequence Models and Long Short-Term Memory Networks
 ===================================================
 
 At this point, we have seen various feed-forward networks. That is,
diff --git a/beginner_source/nlp/word_embeddings_tutorial.py b/beginner_source/nlp/word_embeddings_tutorial.py
@@ -268,6 +268,8 @@ def forward(self, inputs):
     losses.append(total_loss)
 print(losses)  # The loss decreased every iteration over the training data!
 
+# To get the embedding of a particular word, e.g. "beauty"
+print(model.embeddings.weight[word_to_ix["beauty"]])
 
 ######################################################################
 # Exercise: Computing Word Embeddings: Continuous Bag-of-Words
@@ -277,7 +279,7 @@ def forward(self, inputs):
 # learning. It is a model that tries to predict words given the context of
 # a few words before and a few words after the target word. This is
 # distinct from language modeling, since CBOW is not sequential and does
-# not have to be probabilistic. Typcially, CBOW is used to quickly train
+# not have to be probabilistic. Typically, CBOW is used to quickly train
 # word embeddings, and these embeddings are used to initialize the
 # embeddings of some more complicated model. Usually, this is referred to
 # as *pretraining embeddings*. It almost always helps performance a couple
diff --git a/intermediate_source/reinforcement_q_learning.py b/intermediate_source/reinforcement_q_learning.py
@@ -134,7 +134,7 @@ def __len__(self):
 
 
 ######################################################################
-# Now, let's define our model. But first, let quickly recap what a DQN is.
+# Now, let's define our model. But first, let's quickly recap what a DQN is.
 #
 # DQN algorithm
 # -------------
diff --git a/intermediate_source/spatial_transformer_tutorial.py b/intermediate_source/spatial_transformer_tutorial.py
@@ -176,7 +176,7 @@ def train(epoch):
                 epoch, batch_idx * len(data), len(train_loader.dataset),
                 100. * batch_idx / len(train_loader), loss.item()))
 #
-# A simple test procedure to measure STN the performances on MNIST.
+# A simple test procedure to measure the STN performances on MNIST.
 #
 
 

Original file line number	Diff line number	Diff line change
`@@ -134,7 +134,7 @@ def __len__(self):`
`134`	`134`
`135`	`135`
`136`	`136`	`######################################################################`
`137`		`-# Now, let's define our model. But first, let quickly recap what a DQN is.`
	`137`	`+# Now, let's define our model. But first, let's quickly recap what a DQN is.`
`138`	`138`	`#`
`139`	`139`	`# DQN algorithm`
`140`	`140`	`# -------------`
Original file line number	Diff line number	Diff line change
`@@ -176,7 +176,7 @@ def train(epoch):`
`176`	`176`	`epoch, batch_idx * len(data), len(train_loader.dataset),`
`177`	`177`	`100. * batch_idx / len(train_loader), loss.item()))`
`178`	`178`	`#`
`179`		`-# A simple test procedure to measure STN the performances on MNIST.`
	`179`	`+# A simple test procedure to measure the STN performances on MNIST.`
`180`	`180`	`#`
`181`	`181`
`182`	`182`