Pytorch Tutorial
Tensors
1 | from __future__ import print_function |
Operations
1 | # Addition: syntax 1 |
Read later:
100+ Tensor operations, including transposing, indexing, slicing, mathematical operations, linear algebra, random numbers, etc., are described here.
NumPy Bridge
1 | # Converting a Torch Tensor to a NumPy array and vice versa is a breeze. |
CUDA Tensors
1 | # Tensors can be moved onto any device using the '.to' method. |
AUTOGRAD: AUTOMATIC DIFFERENTIATION
Central to all neural networks in PyTorch is the
autograd
package. Let’s first briefly visit this, and we
will then go to training our first neural network.
The autograd
package provides automatic differentiation
for all operations on Tensors. It is a define-by-run framework, which
means that your backprop is defined by how your code is run, and that
every single iteration can be different.
Let us see this in more simple terms with some examples.
Tensor
torch.Tensor
is the central class of the package. If you
set its attribute .requires_grad
as True
, it
starts to track all operations on it. When you finish your computation
you can call .backward()
and have all the gradients
computed automatically. The gradient for this tensor will be accumulated
into .grad
attribute.
To stop a tensor from tracking history, you can call
.detach()
to detach it from the computation history, and to
prevent future computation from being tracked.
To prevent tracking history (and using memory), you can also wrap the
code block in with torch.no_grad():
. This can be
particularly helpful when evaluating a model because the model may have
trainable parameters with requires_grad=True
, but for which
we don’t need the gradients.
There’s one more class which is very important for autograd
implementation - a Function
.
Tensor
and Function
are interconnected and
build up an acyclic graph, that encodes a complete history of
computation. Each tensor has a .grad_fn
attribute that
references a Function
that has created the
Tensor
(except for Tensors created by the user - their
grad_fn isNone
).
If you want to compute the derivatives, you can call
.backward()
on a Tensor
. If
Tensor
is a scalar (i.e. it holds a one element data), you
don’t need to specify any arguments to backward()
, however
if it has more elements, you need to specify a gradient
argument that is a tensor of matching shape.
1 | x = torch.ones(2, 2, requires_grad=True) |
Gradients
1 | #Let’s backprop now. Because out contains a single scalar, out.backward() is equivalent to out.backward(torch.tensor(1.)). |
Read Later: Documentation of autograd and Function is at https://pytorch.org/docs/autograd
NEURAL NETWORKS
Neural networks can be constructed using the torch.nn
package.
Now that you had a glimpse of autograd
, nn
depends on autograd
to define models and differentiate
them. An nn.Module
contains layers, and a method
forward(input)
that returns the output
.
For example, look at this network that classifies digit images:
Define the network
Let’s define this network:
1 | import torch |
You just have to define the forward
function, and the
backward
function (where gradients are computed) is
automatically defined for you using autograd
. You can use
any of the Tensor operations in the forward
function.
The learnable parameters of a model are returned by
net.parameters()
1 | params = list(net.parameters()) |
Let try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on MNIST dataset, please resize the images from the dataset to 32x32.
1 | input = torch.randn(1, 1, 32, 32) |
Zero the gradient buffers of all parameters and backprops with random gradients:
1 | net.zero_grad() |
Notice:
torch.nn
only supports mini-batches. The entiretorch.nn
package only supports inputs that are a mini-batch of samples, and not a single sample.For example,
nn.Conv2d
will take in a 4D Tensor ofnSamples x nChannels x Height x Width
.If you have a single sample, just use
input.unsqueeze(0)
to add a fake batch dimension.
Before proceeding further, let’s recap all the classes you’ve seen so far.
Recap:
torch.Tensor
- A multi-dimensional array with support for autograd operations likebackward()
. Also holds the gradient w.r.t. the tensor.nn.Module
- Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc.nn.Parameter
- A kind of Tensor, that is automatically registered as a parameter when assigned as an attribute to aModule
.autograd.Function
- Implements forward and backward definitions of an autograd operation. EveryTensor
operation creates at least a singleFunction
node that connects to functions that created aTensor
and encodes its history.
At this point, we covered:
- Defining a neural network
- Processing inputs and calling backward
Still Left:
- Computing the loss
- Updating the weights of the network
Loss Function
A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.
There are several different loss
functions under the nn package. A simple loss is:
nn.MSELoss
which computes the mean-squared error between
the input and the target.
For example:
1 | output = net(input) |
Now, if you follow loss
in the backward direction, using
its .grad_fn
attribute, you will see a graph of
computations that looks like this:
input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d -> view -> linear -> relu -> linear -> relu -> linear -> MSELoss -> loss
So, when we call loss.backward()
, the whole graph is
differentiated w.r.t. the loss, and all Tensors in the graph that has
requires_grad=True
will have their .grad
Tensor accumulated with the gradient.
For illustration, let us follow a few steps backward:
1 | print(loss.grad_fn) # MSELoss |
Backprop
To backpropagate the error all we have to do is to
loss.backward()
. You need to clear the existing gradients
though, else gradients will be accumulated to existing gradients.
Now we shall call loss.backward()
, and have a look at
conv1’s bias gradients before and after the backward.
1 | net.zero_grad() # zeroes the gradient buffers of all parameters |
Now, we have seen how to use loss functions.
Read Later:
The neural network package contains various modules and loss functions that form the building blocks of deep neural networks. A full list with documentation is here.
The only thing left to learn is:
Updating the weights of the network
Update the weights
The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):
1 | learning_rate = 0.01 |
However, as you use neural networks, you want to use various
different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To
enable this, we built a small package: torch.optim
that
implements all these methods. Using it is very simple:
1 | import torch.optim as optim |
Observe how gradient buffers had to be manually set to zero using
optimizer.zero_grad()
. This is because gradients are accumulated as explained in Backprop section.
TRAINING A CLASSIFIER
This is it. You have seen how to define neural networks, compute loss and make updates to the weights of the network.
Now you might be thinking,
What about data?
Generally, when you have to deal with image, text, audio or video
data, you can use standard python packages that load data into a numpy
array. Then you can convert this array into a
torch.*Tensor
.
- For images, packages such as Pillow, OpenCV are useful
- For audio, packages such as scipy and librosa
- For text, either raw Python or Cython based loading, or NLTK and SpaCy are useful
Specifically for vision, we have created a package called
torchvision
, that has data loaders for common datasets such
as Imagenet, CIFAR10, MNIST, etc. and data transformers for images,
viz., torchvision.datasets
and
torch.utils.data.DataLoader
.
This provides a huge convenience and avoids writing boilerplate code.
For this tutorial, we will use the CIFAR10 dataset. It has the classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.
Training an image classifier
We will do the following steps in order:
- Load and normalizing the CIFAR10 training and test datasets using
torchvision
- Define a Convolutional Neural Network
- Define a loss function
- Train the network on the training data
- Test the network on the test data
1. Loading and normalizing CIFAR10
Using torchvision
, it’s extremely easy to load
CIFAR10.
1 | import torch |
The output of torchvision datasets are PILImage images of range [0, 1]. We transform them to Tensors of normalized range [-1, 1].
1 | transform = transforms.Compose( |
Let us show some of the training images, for fun.
1 | import matplotlib.pyplot as plt |
2. Define a Convolutional Neural Network
Copy the neural network from the Neural Networks section before and modify it to take 3-channel images (instead of 1-channel images as it was defined).
1 | import torch.nn as nn |
3. Define a Loss function and optimizer
Let’s use a Classification Cross-Entropy loss and SGD with momentum.
1 | import torch.optim as optim |
4. Train the network
This is when things start to get interesting. We simply have to loop over our data iterator, and feed the inputs to the network and optimize.
1 | for epoch in range(2): # loop over the dataset multiple times |
5. Test the network on the test data
We have trained the network for 2 passes over the training dataset. But we need to check if the network has learnt anything at all.
We will check this by predicting the class label that the neural network outputs, and checking it against the ground-truth. If the prediction is correct, we add the sample to the list of correct predictions.
Okay, first step. Let us display an image from the test set to get familiar.
1 | dataiter = iter(testloader) |
Okay, now let us see what the neural network thinks these examples above are:
1 | outputs = net(images) |
The outputs are energies for the 10 classes. The higher the energy for a class, the more the network thinks that the image is of the particular class. So, let’s get the index of the highest energy:
1 | _, predicted = torch.max(outputs, 1) |
The results seem pretty good.
Let us look at how the network performs on the whole dataset.
1 | correct = 0 |
That looks waaay better than chance, which is 10% accuracy (randomly picking a class out of 10 classes). Seems like the network learnt something.
Hmmm, what are the classes that performed well, and the classes that did not perform well:
1 | class_correct = list(0. for i in range(10)) |
Okay, so what next?
How do we run these neural networks on the GPU?
Training on GPU
Just like how you transfer a Tensor onto the GPU, you transfer the neural net onto the GPU.
Let’s first define our device as the first visible cuda device if we have CUDA available:
1 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
The rest of this section assumes that device
is a CUDA
device.
Then these methods will recursively go over all modules and convert their parameters and buffers to CUDA tensors:
1 | net.to(device) |
Remember that you will have to send the inputs and targets at every step to the GPU too:
1 | inputs, labels = inputs.to(device), labels.to(device) |
Why don't I notice MASSIVE speedup compared to CPU? Because your network is really small.
Exercise: Try increasing the width of your network
(argument 2 of the first nn.Conv2d
, and argument 1 of the
second nn.Conv2d
– they need to be the same number), see
what kind of speedup you get.
Goals achieved:
- Understanding PyTorch’s Tensor library and neural networks at a high level.
- Train a small neural network to classify images
Training on multiple GPUs
If you want to see even more MASSIVE speedup using all of your GPUs, please check out Optional: Data Parallelism.
Where do I go next?
- Train neural nets to play video games
- Train a state-of-the-art ResNet network on imagenet
- Train a face generator using Generative Adversarial Networks
- Train a word-level language model using Recurrent LSTM networks
- More examples
- More tutorials
- Discuss PyTorch on the Forums
- Chat with other users on Slack
Save Model
In PyTorch, the learnable parameters (i.e. weights and biases) of an
torch.nn.Module
model are contained in the model’s
parameters(accessed with model.parameters()
). A
state_dict is simply a Python dictionary object that maps each
layer to its parameter tensor. Note that only layers with learnable
parameters (convolutional layers, linear layers, etc.) and registered
buffers (batchnorm’s running_mean) have entries in the model’s
state_dict. Optimizer objects (torch.optim
) also
have a state_dict, which contains information about the
optimizer’s state, as well as the hyperparameters used.
Because state_dict objects are Python dictionaries, they can be easily saved, updated, altered, and restored, adding a great deal of modularity to PyTorch models and optimizers.
Example:
Let’s take a look at the state_dict from the simple model used in the Training a classifier tutorial.
1 | # Define model |
Save:
1 | torch.save(model.state_dict(), PATH) |
Load:
1 | model = TheModelClass(*args, **kwargs) |
When saving a model for inference, it is only necessary to save the
trained model’s learned parameters. Saving the model’s
state_dict with the torch.save()
function will
give you the most flexibility for restoring the model later, which is
why it is the recommended method for saving models.
A common PyTorch convention is to save models using either a
.pt
or .pth
file extension.
Remember that you must call model.eval()
to set dropout
and batch normalization layers to evaluation mode before running
inference. Failing to do this will yield inconsistent inference
results.
NOTE
Notice that the load_state_dict()
function takes a
dictionary object, NOT a path to a saved object. This means that you
must deserialize the saved state_dict before you pass it to the
load_state_dict()
function. For example, you CANNOT load
usingmodel.load_state_dict(PATH)
.
Save/Load Entire Model
Save:
1 | torch.save(model, PATH) |
Load:
1 | # Model class must be defined somewhere |
This save/load process uses the most intuitive syntax and involves the least amount of code. Saving a model in this way will save the entire module using Python’s pickle module. The disadvantage of this approach is that the serialized data is bound to the specific classes and the exact directory structure used when the model is saved. The reason for this is because pickle does not save the model class itself. Rather, it saves a path to the file containing the class, which is used during load time. Because of this, your code can break in various ways when used in other projects or after refactors.
A common PyTorch convention is to save models using either a
.pt
or .pth
file extension.
Remember that you must call model.eval()
to set dropout
and batch normalization layers to evaluation mode before running
inference. Failing to do this will yield inconsistent inference
results.