This tutorial will walk you through building a handwritten digits classifier using the MNIST dataset, arguably the “Hello World” of neural networks. More tutorials and examples can be found in the Lasagne Recipes repository.

Before we start

The tutorial assumes that you are somewhat familiar with neural networks and Theano (the library which Lasagne is built on top of). You can try to learn both at once from the Deeplearning Tutorial.

For a more slow-paced introduction to artificial neural networks, we recommend Convolutional Neural Networks for Visual Recognition by Andrej Karpathy et al., Neural Networks and Deep Learning by Michael Nielsen or a standard text book such as “Machine Learning” by Tom Mitchell.

To learn more about Theano, have a look at the Theano tutorial. You will not need all of it, but a basic understanding of how Theano works is required to be able to use Lasagne. If you’re new to Theano, going through that tutorial up to (and including) “More Examples” should get you covered! Graph Structures is a good extra read if you’re curious about its inner workings.

Run the MNIST example

In this first part of the tutorial, we will just run the MNIST example that’s included in the source distribution of Lasagne.

We assume that you have already run through the Installation. If you haven’t done so already, get a copy of the source tree of Lasagne, and navigate to the folder in a terminal window. Enter the examples folder and run the mnist.py example script:

cd examples
python mnist.py

If everything is set up correctly, you will get an output like the following:

Using gpu device 0: GeForce GT 640
Loading data...
Downloading train-images-idx3-ubyte.gz
Downloading train-labels-idx1-ubyte.gz
Downloading t10k-images-idx3-ubyte.gz
Downloading t10k-labels-idx1-ubyte.gz
Building model and compiling functions...
Starting training...

Epoch 1 of 500 took 1.858s
  training loss:                1.233348
  validation loss:              0.405868
  validation accuracy:          88.78 %
Epoch 2 of 500 took 1.845s
  training loss:                0.571644
  validation loss:              0.310221
  validation accuracy:          91.24 %
Epoch 3 of 500 took 1.845s
  training loss:                0.471582
  validation loss:              0.265931
  validation accuracy:          92.35 %
Epoch 4 of 500 took 1.847s
  training loss:                0.412204
  validation loss:              0.238558
  validation accuracy:          93.05 %

The example script allows you to try three different models, selected via the first command line argument. Run the script with python mnist.py --help for more information and feel free to play around with it some more before we have a look at the implementation.

Understand the MNIST example

Let’s now investigate what’s needed to make that happen! To follow along, open up the source code in your favorite editor (or online: mnist.py).


The first thing you might notice is that besides Lasagne, we also import numpy and Theano:

import numpy as np
import theano
import theano.tensor as T

import lasagne

While Lasagne is built on top of Theano, it is meant as a supplement helping with some tasks, not as a replacement. You will always mix Lasagne with some vanilla Theano code.

Loading data

The first piece of code defines a function load_dataset(). Its purpose is to download the MNIST dataset (if it hasn’t been downloaded yet) and return it in the form of regular numpy arrays. There is no Lasagne involved at all, so for the purpose of this tutorial, we can regard it as:

def load_dataset():
    return X_train, y_train, X_val, y_val, X_test, y_test

X_train.shape is (50000, 1, 28, 28), to be interpreted as: 50,000 images of 1 channel, 28 rows and 28 columns each. Note that the number of channels is 1 because we have monochrome input. Color images would have 3 channels, spectrograms also would have a single channel. y_train.shape is simply (50000,), that is, it is a vector the same length of X_train giving an integer class label for each image – namely, the digit between 0 and 9 depicted in the image (according to the human annotator who drew that digit).

Building the model

This is where Lasagne steps in. It allows you to define an arbitrarily structured neural network by creating and stacking or merging layers. Since every layer knows its immediate incoming layers, the output layer (or output layers) of a network double as a handle to the network as a whole, so usually this is the only thing we will pass on to the rest of the code.

As mentioned above, mnist.py supports three types of models, and we implement that via three easily exchangeable functions of the same interface. First, we’ll define a function that creates a Multi-Layer Perceptron (MLP) of a fixed architecture, explaining all the steps in detail. We’ll then present a function generating an MLP of a custom architecture. Finally, we’ll show how to create a Convolutional Neural Network (CNN).

Multi-Layer Perceptron (MLP)

The first function, build_mlp(), creates an MLP of two hidden layers of 800 units each, followed by a softmax output layer of 10 units. It applies 20% dropout to the input data and 50% dropout to the hidden layers. It is similar, but not fully equivalent to the smallest MLP in [Hinton2012] (that paper uses different nonlinearities, weight initialization and training).

The foundation of each neural network in Lasagne is an InputLayer instance (or multiple of those) representing the input data that will subsequently be fed to the network. Note that the InputLayer is not tied to any specific data yet, but only holds the shape of the data that will be passed to the network. In addition, it creates or can be linked to a Theano variable that will represent the network input in the Theano graph we’ll build from the network later. Thus, our function starts like this:

def build_mlp(input_var=None):
    l_in = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),

The four numbers in the shape tuple represent, in order: (batchsize, channels, rows, columns). Here we’ve set the batchsize to None, which means the network will accept input data of arbitrary batchsize after compilation. If you know the batchsize beforehand and do not need this flexibility, you should give the batchsize here – especially for convolutional layers, this can allow Theano to apply some optimizations. input_var denotes the Theano variable we want to link the network’s input layer to. If it is omitted (or set to None), the layer will just create a suitable variable itself, but it can be handy to link an existing variable to the network at construction time – especially if you’re creating networks of multiple input layers. Here, we link it to a variable given as an argument to the build_mlp() function.

Before adding the first hidden layer, we’ll apply 20% dropout to the input data. This is realized via a DropoutLayer instance:

l_in_drop = lasagne.layers.DropoutLayer(l_in, p=0.2)

Note that the first constructor argument is the incoming layer, such that l_in_drop is now stacked on top of l_in. All layers work this way, except for layers that merge multiple inputs: those accept a list of incoming layers as their first constructor argument instead.

We’ll proceed with the first fully-connected hidden layer of 800 units. Note that when stacking a DenseLayer on higher-order input tensors, they will be flattened implicitly so we don’t need to care about that. In this case, the input will be flattened from 1x28x28 images to 784-dimensional vectors.

l_hid1 = lasagne.layers.DenseLayer(
        l_in_drop, num_units=800,

Again, the first constructor argument means that we’re stacking l_hid1 on top of l_in_drop. num_units simply gives the number of units for this fully-connected layer. nonlinearity takes a nonlinearity function, several of which are defined in lasagne.nonlinearities. Here we’ve chosen the linear rectifier, so we’ll obtain ReLUs. Finally, lasagne.init.GlorotUniform() gives the initializer for the weight matrix W. This particular initializer samples weights from a uniform distribution of a carefully chosen range. Other initializers are available in lasagne.init, and alternatively, W could also have been initialized from a Theano shared variable or numpy array of the correct shape (784x800 in this case, as the input to this layer has 1*28*28=784 dimensions). Note that lasagne.init.GlorotUniform() is the default, so we’ll omit it from here – we just wanted to highlight that there is a choice.

We’ll now add dropout of 50%, another 800-unit dense layer and 50% dropout again:

l_hid1_drop = lasagne.layers.DropoutLayer(l_hid1, p=0.5)

l_hid2 = lasagne.layers.DenseLayer(
        l_hid1_drop, num_units=800,

l_hid2_drop = lasagne.layers.DropoutLayer(l_hid2, p=0.5)

Finally, we’ll add the fully-connected output layer. The main difference is that it uses the softmax nonlinearity, as we’re planning to solve a 10-class classification problem with this network.

l_out = lasagne.layers.DenseLayer(
        l_hid2_drop, num_units=10,

As mentioned above, each layer is linked to its incoming layer(s), so we only need the output layer(s) to access a network in Lasagne:

return l_out

Custom MLP

The second function has a slightly more extensive signature:

def build_custom_mlp(input_var=None, depth=2, width=800, drop_input=.2,

By default, it creates the same network as build_mlp() described above, but it can be customized with respect to the number and size of hidden layers, as well as the amount of input and hidden dropout. This demonstrates how creating a network in Python code can be a lot more flexible than a configuration file. See for yourself:

# Input layer and dropout (with shortcut `dropout` for `DropoutLayer`):
network = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
if drop_input:
    network = lasagne.layers.dropout(network, p=drop_input)
# Hidden layers and dropout:
nonlin = lasagne.nonlinearities.rectify
for _ in range(depth):
    network = lasagne.layers.DenseLayer(
            network, width, nonlinearity=nonlin)
    if drop_hidden:
        network = lasagne.layers.dropout(network, p=drop_hidden)
# Output layer:
softmax = lasagne.nonlinearities.softmax
network = lasagne.layers.DenseLayer(network, 10, nonlinearity=softmax)
return network

With two if clauses and a for loop, this network definition allows varying the architecture in a way that would be impossible for a .yaml file in Pylearn2 or a .cfg file in cuda-convnet.

Note that to make the code easier, all the layers are just called network here – there is no need to give them different names if all we return is the last one we created anyway; we just used different names before for clarity.

Convolutional Neural Network (CNN)

Finally, the build_cnn() function creates a CNN of two convolution and pooling stages, a fully-connected hidden layer and a fully-connected output layer. The function begins like the others:

def build_cnn(input_var=None):
    network = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),

We don’t apply dropout to the inputs, as this tends to work less well for convolutional layers. Instead of a DenseLayer, we now add a Conv2DLayer with 32 filters of size 5x5 on top:

network = lasagne.layers.Conv2DLayer(
        network, num_filters=32, filter_size=(5, 5),

The nonlinearity and weight initializer can be given just as for the DenseLayer (and again, GlorotUniform() is the default, we’ll omit it from now). Strided and padded convolutions are supported as well; see the Conv2DLayer docstring.


For experts: Conv2DLayer will create a convolutional layer using T.nnet.conv2d, Theano’s default convolution. On compilation for GPU, Theano replaces this with a cuDNN-based implementation if available, otherwise falls back to a gemm-based implementation. For details on this, please see the Theano convolution documentation.

Lasagne also provides convolutional layers directly enforcing a specific implementation: lasagne.layers.dnn.Conv2DDNNLayer to enforce cuDNN, lasagne.layers.corrmm.Conv2DMMLayer to enforce the gemm-based one, lasagne.layers.cuda_convnet.Conv2DCCLayer for Krizhevsky’s cuda-convnet.

We then apply max-pooling of factor 2 in both dimensions, using a MaxPool2DLayer instance:

network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))

We add another convolution and pooling stage like the ones before:

network = lasagne.layers.Conv2DLayer(
        network, num_filters=32, filter_size=(5, 5),
network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))

Then a fully-connected layer of 256 units with 50% dropout on its inputs (using the lasagne.layers.dropout shortcut directly inline):

network = lasagne.layers.DenseLayer(
        lasagne.layers.dropout(network, p=.5),

And finally a 10-unit softmax output layer, again with 50% dropout:

network = lasagne.layers.DenseLayer(
        lasagne.layers.dropout(network, p=.5),

return network

Training the model

The remaining part of the mnist.py script copes with setting up and running a training loop over the MNIST dataset.

Dataset iteration

It first defines a short helper function for synchronously iterating over two numpy arrays of input data and targets, respectively, in mini-batches of a given number of items. For the purpose of this tutorial, we can shorten it to:

def iterate_minibatches(inputs, targets, batchsize, shuffle=False):
    if shuffle:
    for ...:
        yield inputs[...], targets[...]

All that’s relevant is that it is a generator function that serves one batch of inputs and targets at a time until the given dataset (in inputs and targets) is exhausted, either in sequence or in random order. Below we will plug this function into our training loop, validation loop and test loop.


Let’s now focus on the main() function. A bit simplified, it begins like this:

# Load the dataset
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()
# Prepare Theano variables for inputs and targets
input_var = T.tensor4('inputs')
target_var = T.ivector('targets')
# Create neural network model
network = build_mlp(input_var)

The first line loads the inputs and targets of the MNIST dataset as numpy arrays, split into training, validation and test data. The next two statements define symbolic Theano variables that will represent a mini-batch of inputs and targets in all the Theano expressions we will generate for network training and inference. They are not tied to any data yet, but their dimensionality and data type is fixed already and matches the actual inputs and targets we will process later. Finally, we call one of the three functions for building the Lasagne network, depending on the first command line argument – we’ve just removed command line handling here for clarity. Note that we hand the symbolic input variable to build_mlp() so it will be linked to the network’s input layer.

Loss and update expressions

Continuing, we create a loss expression to be minimized in training:

prediction = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(prediction, target_var)
loss = loss.mean()

The first step generates a Theano expression for the network output given the input variable linked to the network’s input layer(s). The second step defines a Theano expression for the categorical cross-entropy loss between said network output and the targets. Finally, as we need a scalar loss, we simply take the mean over the mini-batch. Depending on the problem you are solving, you will need different loss functions, see lasagne.objectives for more.

Having the model and the loss function defined, we create update expressions for training the network. An update expression describes how to change the trainable parameters of the network at each presented mini-batch. We will use Stochastic Gradient Descent (SGD) with Nesterov momentum here, but the lasagne.updates module offers several others you can plug in instead:

params = lasagne.layers.get_all_params(network, trainable=True)
updates = lasagne.updates.nesterov_momentum(
        loss, params, learning_rate=0.01, momentum=0.9)

The first step collects all Theano SharedVariable instances making up the trainable parameters of the layer, and the second step generates an update expression for each parameter.

For monitoring progress during training, after each epoch, we evaluate the network on the validation set. We need a slightly different loss expression for that:

test_prediction = lasagne.layers.get_output(network, deterministic=True)
test_loss = lasagne.objectives.categorical_crossentropy(test_prediction,
test_loss = test_loss.mean()

The crucial difference is that we pass deterministic=True to the get_output call. This causes all nondeterministic layers to switch to a deterministic implementation, so in our case, it disables the dropout layers. As an additional monitoring quantity, we create an expression for the classification accuracy:

test_acc = T.mean(T.eq(T.argmax(test_prediction, axis=1), target_var),

It also builds on the deterministic test_prediction expression.


Equipped with all the necessary Theano expressions, we’re now ready to compile a function performing a training step:

train_fn = theano.function([input_var, target_var], loss, updates=updates)

This tells Theano to generate and compile a function taking two inputs – a mini-batch of images and a vector of corresponding targets – and returning a single output: the training loss. Additionally, each time it is invoked, it applies all parameter updates in the updates dictionary, thus performing a gradient descent step with Nesterov momentum.

For validation, we compile a second function:

val_fn = theano.function([input_var, target_var], [test_loss, test_acc])

This one also takes a mini-batch of images and targets, then returns the (deterministic) loss and classification accuracy, not performing any updates.

Training loop

We’re finally ready to write the training loop. In essence, we just need to do the following:

for epoch in range(num_epochs):
    for batch in iterate_minibatches(X_train, y_train, 500, shuffle=True):
        inputs, targets = batch
        train_fn(inputs, targets)

This uses our dataset iteration helper function to iterate over the training data in random order, in mini-batches of 500 items each, for num_epochs epochs, and calls the training function we compiled to perform an update step of the network parameters.

But to be able to monitor the training progress, we capture the training loss, compute the validation loss and print some information to the console every time an epoch finishes:

for epoch in range(num_epochs):
    # In each epoch, we do a full pass over the training data:
    train_err = 0
    train_batches = 0
    start_time = time.time()
    for batch in iterate_minibatches(X_train, y_train, 500, shuffle=True):
        inputs, targets = batch
        train_err += train_fn(inputs, targets)
        train_batches += 1

    # And a full pass over the validation data:
    val_err = 0
    val_acc = 0
    val_batches = 0
    for batch in iterate_minibatches(X_val, y_val, 500, shuffle=False):
        inputs, targets = batch
        err, acc = val_fn(inputs, targets)
        val_err += err
        val_acc += acc
        val_batches += 1

    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss:\t\t{:.6f}".format(train_err / train_batches))
    print("  validation loss:\t\t{:.6f}".format(val_err / val_batches))
    print("  validation accuracy:\t\t{:.2f} %".format(
        val_acc / val_batches * 100))

At the very end, we re-use the val_fn() function to compute the loss and accuracy on the test set, finishing the script.

Where to go from here

This finishes our introductory tutorial. For more information on what you can do with Lasagne’s layers, just continue reading through Layers and Creating custom layers. More tutorials, examples and code snippets can be found in the Lasagne Recipes repository. Finally, the reference lists and explains all layers (lasagne.layers), weight initializers (lasagne.init), nonlinearities (lasagne.nonlinearities), loss expressions (lasagne.objectives), training methods (lasagne.updates) and regularizers (lasagne.regularization) included in the library, and should also make it simple to create your own.

[Hinton2012]Improving neural networks by preventing co-adaptation of feature detectors. http://arxiv.org/abs/1207.0580