# Recurrent layers¶

Layers to construct recurrent networks. Recurrent layers can be used similarly to feed-forward layers except that the input shape is expected to be (batch_size, sequence_length, num_inputs). The CustomRecurrentLayer can also support more than one “feature” dimension (e.g. using convolutional connections), but for all other layers, dimensions trailing the third dimension are flattened.

The following recurrent layers are implemented:

 CustomRecurrentLayer A layer which implements a recurrent connection. RecurrentLayer Dense recurrent neural network (RNN) layer LSTMLayer A long short-term memory (LSTM) layer. GRULayer Gated Recurrent Unit (GRU) Layer

For recurrent layers with gates we use a helper class to set up the parameters in each gate:

 Gate Simple class to hold the parameters for a gate connection.

Please refer to that class if you need to modify initial conditions of gates.

Recurrent layers and feed-forward layers can be combined in the same network by using a few reshape operations; please refer to the example below.

## Examples¶

The following example demonstrates how recurrent layers can be easily mixed with feed-forward layers using ReshapeLayer and how to build a network with variable batch size and number of time steps.

>>> from lasagne.layers import *
>>> num_inputs, num_units, num_classes = 10, 12, 5
>>> # By setting the first two dimensions as None, we are allowing them to vary
>>> # They correspond to batch size and sequence length, so we will be able to
>>> # feed in batches of varying size with sequences of varying length.
>>> l_inp = InputLayer((None, None, num_inputs))
>>> # We can retrieve symbolic references to the input variable's shape, which
>>> # we will later use in reshape layers.
>>> batchsize, seqlen, _ = l_inp.input_var.shape
>>> l_lstm = LSTMLayer(l_inp, num_units=num_units)
>>> # In order to connect a recurrent layer to a dense layer, we need to
>>> # flatten the first two dimensions (our "sample dimensions"); this will
>>> # cause each time step of each sequence to be processed independently
>>> l_shp = ReshapeLayer(l_lstm, (-1, num_units))
>>> l_dense = DenseLayer(l_shp, num_units=num_classes)
>>> # To reshape back to our original shape, we can use the symbolic shape
>>> # variables we retrieved above.
>>> l_out = ReshapeLayer(l_dense, (batchsize, seqlen, num_classes))


A layer which implements a recurrent connection.

This layer allows you to specify custom input-to-hidden and hidden-to-hidden connections by instantiating lasagne.layers.Layer instances and passing them on initialization. Note that these connections can consist of multiple layers chained together. The output shape for the provided input-to-hidden and hidden-to-hidden connections must be the same. If you are looking for a standard, densely-connected recurrent layer, please see RecurrentLayer. The output is computed by

$h_t = \sigma(f_i(x_t) + f_h(h_{t-1}))$
Parameters: incoming : a lasagne.layers.Layer instance or a tuple The layer feeding into this layer, or the expected input shape. input_to_hidden : lasagne.layers.Layer lasagne.layers.Layer instance which connects input to the hidden state ($$f_i$$). This layer may be connected to a chain of layers, which must end in a lasagne.layers.InputLayer with the same input shape as incoming, except for the first dimension: When precompute_input == True (the default), it must be incoming.output_shape[0]*incoming.output_shape[1] or None; when precompute_input == False, it must be incoming.output_shape[0] or None. hidden_to_hidden : lasagne.layers.Layer Layer which connects the previous hidden state to the new state ($$f_h$$). This layer may be connected to a chain of layers, which must end in a lasagne.layers.InputLayer with the same input shape as hidden_to_hidden‘s output shape. nonlinearity : callable or None Nonlinearity to apply when computing new state ($$\sigma$$). If None is provided, no nonlinearity will be applied. hid_init : callable, np.ndarray, theano.shared or Layer Initializer for initial hidden state ($$h_0$$). backwards : bool If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from $$x_1$$ to $$x_n$$. learn_init : bool If True, initial hidden values are learned. gradient_steps : int Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence. grad_clipping : float If nonzero, the gradient messages are clipped to the given value during the backward pass. See [1] (p. 6) for further explanation. unroll_scan : bool If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None). precompute_input : bool If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage. mask_input : lasagne.layers.Layer Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length). only_return_final : bool If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

References

 [1] (1, 2) Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).

Examples

The following example constructs a simple CustomRecurrentLayer which has dense input-to-hidden and hidden-to-hidden connections.

>>> import lasagne
>>> n_batch, n_steps, n_in = (2, 3, 4)
>>> n_hid = 5
>>> l_in = lasagne.layers.InputLayer((n_batch, n_steps, n_in))
>>> l_in_hid = lasagne.layers.DenseLayer(
...     lasagne.layers.InputLayer((None, n_in)), n_hid)
>>> l_hid_hid = lasagne.layers.DenseLayer(
...     lasagne.layers.InputLayer((None, n_hid)), n_hid)
>>> l_rec = lasagne.layers.CustomRecurrentLayer(l_in, l_in_hid, l_hid_hid)


The CustomRecurrentLayer can also support “convolutional recurrence”, as is demonstrated below.

>>> n_batch, n_steps, n_channels, width, height = (2, 3, 4, 5, 6)
>>> n_out_filters = 7
>>> filter_shape = (3, 3)
>>> l_in = lasagne.layers.InputLayer(
...     (n_batch, n_steps, n_channels, width, height))
>>> l_in_to_hid = lasagne.layers.Conv2DLayer(
...     lasagne.layers.InputLayer((None, n_channels, width, height)),
>>> l_hid_to_hid = lasagne.layers.Conv2DLayer(
...     lasagne.layers.InputLayer(l_in_to_hid.output_shape),
>>> l_rec = lasagne.layers.CustomRecurrentLayer(
...     l_in, l_in_to_hid, l_hid_to_hid)

get_output_for(inputs, **kwargs)[source]

Compute this layer’s output function given a symbolic input variable.

Parameters: inputs : list of theano.TensorType inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i). When the hidden state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. layer_output : theano.TensorType Symbolic output variable.
class lasagne.layers.RecurrentLayer(incoming, num_units, W_in_to_hid=lasagne.init.Uniform(), W_hid_to_hid=lasagne.init.Uniform(), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.rectify, hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]

Dense recurrent neural network (RNN) layer

A “vanilla” RNN layer, which has dense input-to-hidden and hidden-to-hidden connections. The output is computed as

$h_t = \sigma(x_t W_x + h_{t-1} W_h + b)$
Parameters: incoming : a lasagne.layers.Layer instance or a tuple The layer feeding into this layer, or the expected input shape. num_units : int Number of hidden units in the layer. W_in_to_hid : Theano shared variable, numpy array or callable Initializer for input-to-hidden weight matrix ($$W_x$$). W_hid_to_hid : Theano shared variable, numpy array or callable Initializer for hidden-to-hidden weight matrix ($$W_h$$). b : Theano shared variable, numpy array, callable or None Initializer for bias vector ($$b$$). If None is provided there will be no bias. nonlinearity : callable or None Nonlinearity to apply when computing new state ($$\sigma$$). If None is provided, no nonlinearity will be applied. hid_init : callable, np.ndarray, theano.shared or Layer Initializer for initial hidden state ($$h_0$$). backwards : bool If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from $$x_1$$ to $$x_n$$. learn_init : bool If True, initial hidden values are learned. gradient_steps : int Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence. grad_clipping : float If nonzero, the gradient messages are clipped to the given value during the backward pass. See [1] (p. 6) for further explanation. unroll_scan : bool If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None). precompute_input : bool If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage. mask_input : lasagne.layers.Layer Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length). only_return_final : bool If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

References

 [1] (1, 2) Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
class lasagne.layers.LSTMLayer(incoming, num_units, ingate=lasagne.layers.Gate(), forgetgate=lasagne.layers.Gate(), cell=lasagne.layers.Gate( W_cell=None, nonlinearity=lasagne.nonlinearities.tanh), outgate=lasagne.layers.Gate(), nonlinearity=lasagne.nonlinearities.tanh, cell_init=lasagne.init.Constant(0.), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, peepholes=True, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]

A long short-term memory (LSTM) layer.

Includes optional “peephole connections” and a forget gate. Based on the definition in [1], which is the current common definition. The output is computed by

$\begin{split}i_t &= \sigma_i(x_t W_{xi} + h_{t-1} W_{hi} + w_{ci} \odot c_{t-1} + b_i)\\ f_t &= \sigma_f(x_t W_{xf} + h_{t-1} W_{hf} + w_{cf} \odot c_{t-1} + b_f)\\ c_t &= f_t \odot c_{t - 1} + i_t \odot \sigma_c(x_t W_{xc} + h_{t-1} W_{hc} + b_c)\\ o_t &= \sigma_o(x_t W_{xo} + h_{t-1} W_{ho} + w_{co} \odot c_t + b_o)\\ h_t &= o_t \odot \sigma_h(c_t)\end{split}$
Parameters: incoming : a lasagne.layers.Layer instance or a tuple The layer feeding into this layer, or the expected input shape. num_units : int Number of hidden/cell units in the layer. ingate : Gate Parameters for the input gate ($$i_t$$): $$W_{xi}$$, $$W_{hi}$$, $$w_{ci}$$, $$b_i$$, and $$\sigma_i$$. forgetgate : Gate Parameters for the forget gate ($$f_t$$): $$W_{xf}$$, $$W_{hf}$$, $$w_{cf}$$, $$b_f$$, and $$\sigma_f$$. cell : Gate Parameters for the cell computation ($$c_t$$): $$W_{xc}$$, $$W_{hc}$$, $$b_c$$, and $$\sigma_c$$. outgate : Gate Parameters for the output gate ($$o_t$$): $$W_{xo}$$, $$W_{ho}$$, $$w_{co}$$, $$b_o$$, and $$\sigma_o$$. nonlinearity : callable or None The nonlinearity that is applied to the output ($$\sigma_h$$). If None is provided, no nonlinearity will be applied. cell_init : callable, np.ndarray, theano.shared or Layer Initializer for initial cell state ($$c_0$$). hid_init : callable, np.ndarray, theano.shared or Layer Initializer for initial hidden state ($$h_0$$). backwards : bool If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from $$x_1$$ to $$x_n$$. learn_init : bool If True, initial hidden values are learned. peepholes : bool If True, the LSTM uses peephole connections. When False, ingate.W_cell, forgetgate.W_cell and outgate.W_cell are ignored. gradient_steps : int Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence. grad_clipping : float If nonzero, the gradient messages are clipped to the given value during the backward pass. See [1] (p. 6) for further explanation. unroll_scan : bool If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None). precompute_input : bool If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage. mask_input : lasagne.layers.Layer Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length). only_return_final : bool If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

References

 [1] (1, 2, 3) Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
get_output_for(inputs, **kwargs)[source]

Compute this layer’s output function given a symbolic input variable

Parameters: inputs : list of theano.TensorType inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i). When the hidden state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. When the cell state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. When both the cell state and the hidden state are being pre-filled inputs[-2] is the hidden state, while inputs[-1] is the cell state. layer_output : theano.TensorType Symbolic output variable.
class lasagne.layers.GRULayer(incoming, num_units, resetgate=lasagne.layers.Gate(W_cell=None), updategate=lasagne.layers.Gate(W_cell=None), hidden_update=lasagne.layers.Gate( W_cell=None, lasagne.nonlinearities.tanh), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=0, unroll_scan=False, precompute_input=True, mask_input=None, only_return_final=False, **kwargs)[source]

Gated Recurrent Unit (GRU) Layer

Implements the recurrent step proposed in [1], which computes the output by

$\begin{split}r_t &= \sigma_r(x_t W_{xr} + h_{t - 1} W_{hr} + b_r)\\ u_t &= \sigma_u(x_t W_{xu} + h_{t - 1} W_{hu} + b_u)\\ c_t &= \sigma_c(x_t W_{xc} + r_t \odot (h_{t - 1} W_{hc}) + b_c)\\ h_t &= (1 - u_t) \odot h_{t - 1} + u_t \odot c_t\end{split}$
Parameters: incoming : a lasagne.layers.Layer instance or a tuple The layer feeding into this layer, or the expected input shape. num_units : int Number of hidden units in the layer. resetgate : Gate Parameters for the reset gate ($$r_t$$): $$W_{xr}$$, $$W_{hr}$$, $$b_r$$, and $$\sigma_r$$. updategate : Gate Parameters for the update gate ($$u_t$$): $$W_{xu}$$, $$W_{hu}$$, $$b_u$$, and $$\sigma_u$$. hidden_update : Gate Parameters for the hidden update ($$c_t$$): $$W_{xc}$$, $$W_{hc}$$, $$b_c$$, and $$\sigma_c$$. hid_init : callable, np.ndarray, theano.shared or Layer Initializer for initial hidden state ($$h_0$$). backwards : bool If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from $$x_1$$ to $$x_n$$. learn_init : bool If True, initial hidden values are learned. gradient_steps : int Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence. grad_clipping : float If nonzero, the gradient messages are clipped to the given value during the backward pass. See [1] (p. 6) for further explanation. unroll_scan : bool If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None). precompute_input : bool If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage. mask_input : lasagne.layers.Layer Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length). only_return_final : bool If True, only return the final sequential output (e.g. for tasks where a single target value for the entire sequence is desired). In this case, Theano makes an optimization which saves memory.

Notes

An alternate update for the candidate hidden state is proposed in [2]:

$\begin{split}c_t &= \sigma_c(x_t W_{ic} + (r_t \odot h_{t - 1})W_{hc} + b_c)\\\end{split}$

We use the formulation from [1] because it allows us to do all matrix operations in a single dot product.

References

 [1] (1, 2, 3, 4) Cho, Kyunghyun, et al: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
 [2] (1, 2) Chung, Junyoung, et al.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014).
 [3] Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
get_output_for(inputs, **kwargs)[source]

Compute this layer’s output function given a symbolic input variable

Parameters: inputs : list of theano.TensorType inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i). When the hidden state of this layer is to be pre-filled (i.e. was set to a Layer instance) inputs should have length at least 2, and inputs[-1] is the hidden state to prefill with. layer_output : theano.TensorType Symbolic output variable.
class lasagne.layers.Gate(W_in=lasagne.init.Normal(0.1), W_hid=lasagne.init.Normal(0.1), W_cell=lasagne.init.Normal(0.1), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.sigmoid)[source]

Simple class to hold the parameters for a gate connection. We define a gate loosely as something which computes the linear mix of two inputs, optionally computes an element-wise product with a third, adds a bias, and applies a nonlinearity.

Parameters: W_in : Theano shared variable, numpy array or callable Initializer for input-to-gate weight matrix. W_hid : Theano shared variable, numpy array or callable Initializer for hidden-to-gate weight matrix. W_cell : Theano shared variable, numpy array, callable, or None Initializer for cell-to-gate weight vector. If None, no cell-to-gate weight vector will be stored. b : Theano shared variable, numpy array or callable Initializer for input gate bias vector. nonlinearity : callable or None The nonlinearity that is applied to the input gate activation. If None is provided, no nonlinearity will be applied.

References

 [1] (1, 2) Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM.” Neural computation 12.10 (2000): 2451-2471.

Examples

For LSTMLayer the bias of the forget gate is often initialized to a large positive value to encourage the layer initially remember the cell value, see e.g. [1] page 15.

>>> import lasagne
>>> forget_gate = Gate(b=lasagne.init.Constant(5.0))
>>> l_lstm = LSTMLayer((10, 20, 30), num_units=10,
...                    forgetgate=forget_gate)