Recurrent layers¶
Layers to construct recurrent networks. Recurrent layers can be used similarly to feed-forward layers except that the input shape is expected to be (batch_size, sequence_length, num_inputs). The CustomRecurrentLayer can also support more than one “feature” dimension (e.g. using convolutional connections), but for all other layers, dimensions trailing the third dimension are flattened.
The following recurrent layers are implemented:
CustomRecurrentLayer | A layer which implements a recurrent connection. |
RecurrentLayer | Dense recurrent neural network (RNN) layer |
LSTMLayer | A long short-term memory (LSTM) layer. |
GRULayer | Gated Recurrent Unit (GRU) Layer |
For recurrent layers with gates we use a helper class to set up the parameters in each gate:
Gate | Simple class to hold the parameters for a gate connection. |
Please refer to that class if you need to modify initial conditions of gates.
Recurrent layers and feed-forward layers can be combined in the same network by using a few reshape operations; please refer to the example below.
Examples¶
The following example demonstrates how recurrent layers can be easily mixed with feed-forward layers using ReshapeLayer and how to build a network with variable batch size and number of time steps.
>>> from lasagne.layers import *
>>> num_inputs, num_units, num_classes = 10, 12, 5
>>> # By setting the first two dimensions as None, we are allowing them to vary
>>> # They correspond to batch size and sequence length, so we will be able to
>>> # feed in batches of varying size with sequences of varying length.
>>> l_inp = InputLayer((None, None, num_inputs))
>>> # We can retrieve symbolic references to the input variable's shape, which
>>> # we will later use in reshape layers.
>>> batchsize, seqlen, _ = l_inp.input_var.shape
>>> l_lstm = LSTMLayer(l_inp, num_units=num_units)
>>> # In order to connect a recurrent layer to a dense layer, we need to
>>> # flatten the first two dimensions (our "sample dimensions"); this will
>>> # cause each time step of each sequence to be processed independently
>>> l_shp = ReshapeLayer(l_lstm, (-1, num_units))
>>> l_dense = DenseLayer(l_shp, num_units=num_classes)
>>> # To reshape back to our original shape, we can use the symbolic shape
>>> # variables we retrieved above.
>>> l_out = ReshapeLayer(l_dense, (batchsize, seqlen, num_classes))
- class lasagne.layers.CustomRecurrentLayer(incoming, input_to_hidden, hidden_to_hidden, nonlinearity=lasagne.nonlinearities.rectify, hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=False, unroll_scan=False, precompute_input=True, mask_input=None, **kwargs)[source]¶
A layer which implements a recurrent connection.
This layer allows you to specify custom input-to-hidden and hidden-to-hidden connections by instantiating lasagne.layers.Layer instances and passing them on initialization. Note that these connections can consist of multiple layers chained together. The output shape for the provided input-to-hidden and hidden-to-hidden connections must be the same. If you are looking for a standard, densely-connected recurrent layer, please see RecurrentLayer. The output is computed by
\[h_t = \sigma(f_i(x_t) + f_h(h_{t-1}))\]Parameters: incoming : a lasagne.layers.Layer instance or a tuple
The layer feeding into this layer, or the expected input shape.
input_to_hidden : lasagne.layers.Layer
lasagne.layers.Layer instance which connects input to the hidden state (\(f_i\)). This layer may be connected to a chain of layers, which must end in a lasagne.layers.InputLayer with the same input shape as incoming.
hidden_to_hidden : lasagne.layers.Layer
Layer which connects the previous hidden state to the new state (\(f_h\)). This layer may be connected to a chain of layers, which must end in a lasagne.layers.InputLayer with the same input shape as hidden_to_hidden‘s output shape.
nonlinearity : callable or None
Nonlinearity to apply when computing new state (\(\sigma\)). If None is provided, no nonlinearity will be applied.
hid_init : callable, np.ndarray, theano.shared or TensorVariable
Initializer for initial hidden state (\(h_0\)). If a TensorVariable (Theano expression) is supplied, it will not be learned regardless of the value of learn_init.
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned. If hid_init is a TensorVariable then the TensorVariable is used and learn_init is ignored.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping : False or float
If a float is provided, the gradient messages are clipped during the backward pass. If False, the gradients will not be clipped. See [R25] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input : lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
References
[R25] (1, 2) Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013). Examples
The following example constructs a simple CustomRecurrentLayer which has dense input-to-hidden and hidden-to-hidden connections.
>>> import lasagne >>> n_batch, n_steps, n_in = (2, 3, 4) >>> n_hid = 5 >>> l_in = lasagne.layers.InputLayer((n_batch, n_steps, n_in)) >>> l_in_hid = lasagne.layers.DenseLayer( ... lasagne.layers.InputLayer((None, n_in)), n_hid) >>> l_hid_hid = lasagne.layers.DenseLayer( ... lasagne.layers.InputLayer((None, n_hid)), n_hid) >>> l_rec = lasagne.layers.CustomRecurrentLayer(l_in, l_in_hid, l_hid_hid)
The CustomRecurrentLayer can also support “convolutional recurrence”, as is demonstrated below.
>>> n_batch, n_steps, n_channels, width, height = (2, 3, 4, 5, 6) >>> n_out_filters = 7 >>> filter_shape = (3, 3) >>> l_in = lasagne.layers.InputLayer( ... (n_batch, n_steps, n_channels, width, height)) >>> l_in_to_hid = lasagne.layers.Conv2DLayer( ... lasagne.layers.InputLayer((None, n_channels, width, height)), ... n_out_filters, filter_shape, pad='same') >>> l_hid_to_hid = lasagne.layers.Conv2DLayer( ... lasagne.layers.InputLayer(l_in_to_hid.output_shape), ... n_out_filters, filter_shape, pad='same') >>> l_rec = lasagne.layers.CustomRecurrentLayer( ... l_in, l_in_to_hid, l_hid_to_hid)
- get_output_for(inputs, **kwargs)[source]¶
Compute this layer’s output function given a symbolic input variable.
Parameters: inputs : list of theano.TensorType
inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i).
Returns: layer_output : theano.TensorType
Symbolic output variable.
- class lasagne.layers.RecurrentLayer(incoming, num_units, W_in_to_hid=lasagne.init.Uniform(), W_hid_to_hid=lasagne.init.Uniform(), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.rectify, hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, gradient_steps=-1, grad_clipping=False, unroll_scan=False, precompute_input=True, mask_input=None, **kwargs)[source]¶
Dense recurrent neural network (RNN) layer
A “vanilla” RNN layer, which has dense input-to-hidden and hidden-to-hidden connections. The output is computed as
\[h_t = \sigma(x_t W_x + h_{t-1} W_h + b)\]Parameters: incoming : a lasagne.layers.Layer instance or a tuple
The layer feeding into this layer, or the expected input shape.
num_units : int
Number of hidden units in the layer.
W_in_to_hid : Theano shared variable, numpy array or callable
Initializer for input-to-hidden weight matrix (\(W_x\)).
W_hid_to_hid : Theano shared variable, numpy array or callable
Initializer for hidden-to-hidden weight matrix (\(W_h\)).
b : Theano shared variable, numpy array, callable or None
Initializer for bias vector (\(b\)). If None is provided there will be no bias.
nonlinearity : callable or None
Nonlinearity to apply when computing new state (\(\sigma\)). If None is provided, no nonlinearity will be applied.
hid_init : callable, np.ndarray, theano.shared or TensorVariable
Initializer for initial hidden state (\(h_0\)). If a TensorVariable (Theano expression) is supplied, it will not be learned regardless of the value of learn_init.
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned. If hid_init is a TensorVariable then learn_init is ignored.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping : False or float
If a float is provided, the gradient messages are clipped during the backward pass. If False, the gradients will not be clipped. See [R26] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input : lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
References
[R26] (1, 2) Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
- class lasagne.layers.LSTMLayer(incoming, num_units, ingate=lasagne.layers.Gate(), forgetgate=lasagne.layers.Gate(), cell=lasagne.layers.Gate( W_cell=None, nonlinearity=lasagne.nonlinearities.tanh), outgate=lasagne.layers.Gate(), nonlinearity=lasagne.nonlinearities.tanh, cell_init=lasagne.init.Constant(0.), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=False, peepholes=True, gradient_steps=-1, grad_clipping=False, unroll_scan=False, precompute_input=True, mask_input=None, **kwargs)[source]¶
A long short-term memory (LSTM) layer.
Includes optional “peephole connections” and a forget gate. Based on the definition in [R27], which is the current common definition. The output is computed by
\[\begin{split}i_t &= \sigma_i(x_t W_{xi} + h_{t-1} W_{hi} + w_{ci} \odot c_{t-1} + b_i)\\ f_t &= \sigma_f(x_t W_{xf} + h_{t-1} W_{hf} + w_{cf} \odot c_{t-1} + b_f)\\ c_t &= f_t \odot c_{t - 1} + i_t\sigma_c(x_t W_{xc} + h_{t-1} W_{hc} + b_c)\\ o_t &= \sigma_o(x_t W_{xo} + h_{t-1} W_{ho} + w_{co} \odot c_t + b_o)\\ h_t &= o_t \odot \sigma_h(c_t)\end{split}\]Parameters: incoming : a lasagne.layers.Layer instance or a tuple
The layer feeding into this layer, or the expected input shape.
num_units : int
Number of hidden/cell units in the layer.
ingate : Gate
Parameters for the input gate (\(i_t\)): \(W_{xi}\), \(W_{hi}\), \(w_{ci}\), \(b_i\), and \(\sigma_i\).
forgetgate : Gate
Parameters for the forget gate (\(f_t\)): \(W_{xf}\), \(W_{hf}\), \(w_{cf}\), \(b_f\), and \(\sigma_f\).
cell : Gate
Parameters for the cell computation (\(c_t\)): \(W_{xc}\), \(W_{hc}\), \(b_c\), and \(\sigma_c\).
outgate : Gate
Parameters for the output gate (\(o_t\)): \(W_{xo}\), \(W_{ho}\), \(w_{co}\), \(b_o\), and \(\sigma_o\).
nonlinearity : callable or None
The nonlinearity that is applied to the output (\(\sigma_h\)). If None is provided, no nonlinearity will be applied.
cell_init : callable, np.ndarray, theano.shared or TensorVariable
Initializer for initial cell state (\(c_0\)). If a TensorVariable (Theano expression) is supplied, it will not be learned regardless of the value of learn_init.
hid_init : callable, np.ndarray, theano.shared or TensorVariable
Initializer for initial hidden state (\(h_0\)). If a TensorVariable (Theano expression) is supplied, it will not be learned regardless of the value of learn_init.
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned. If hid_init or cell_init are TensorVariables then the TensorVariable is used and learn_init is ignored for that initial state.
peepholes : bool
If True, the LSTM uses peephole connections. When False, ingate.W_cell, forgetgate.W_cell and outgate.W_cell are ignored.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping: False or float
If a float is provided, the gradient messages are clipped during the backward pass. If False, the gradients will not be clipped. See [R27] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input : lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
References
[R27] (1, 2, 3) Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013). - get_output_for(inputs, **kwargs)[source]¶
Compute this layer’s output function given a symbolic input variable
Parameters: inputs : list of theano.TensorType
inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i).
Returns: layer_output : theano.TensorType
Symbolic output variable.
- class lasagne.layers.GRULayer(incoming, num_units, resetgate=lasagne.layers.Gate(W_cell=None), updategate=lasagne.layers.Gate(W_cell=None), hidden_update=lasagne.layers.Gate( W_cell=None, lasagne.nonlinearities.tanh), hid_init=lasagne.init.Constant(0.), backwards=False, learn_init=True, gradient_steps=-1, grad_clipping=False, unroll_scan=False, precompute_input=True, mask_input=None, **kwargs)[source]¶
Gated Recurrent Unit (GRU) Layer
Implements the recurrent step proposed in [R28], which computes the output by
\[\begin{split}r_t &= \sigma_r(x_t W_{xr} + h_{t - 1} W_{hr} + b_r)\\ u_t &= \sigma_u(x_t W_{xu} + h_{t - 1} W_{hu} + b_u)\\ c_t &= \sigma_c(x_t W_{xc} + r_t \odot (h_{t - 1} W_{hc}) + b_c)\\ h_t &= (1 - u_t) \odot h_{t - 1} + u_t \odot c_t\end{split}\]Parameters: incoming : a lasagne.layers.Layer instance or a tuple
The layer feeding into this layer, or the expected input shape.
num_units : int
Number of hidden units in the layer.
resetgate : Gate
Parameters for the reset gate (\(r_t\)): \(W_{xr}\), \(W_{hr}\), \(b_r\), and \(\sigma_r\).
updategate : Gate
Parameters for the update gate (\(u_t\)): \(W_{xu}\), \(W_{hu}\), \(b_u\), and \(\sigma_u\).
hidden_update : Gate
Parameters for the hidden update (\(c_t\)): \(W_{xc}\), \(W_{hc}\), \(b_c\), and \(\sigma_c\).
hid_init : callable, np.ndarray, theano.shared or TensorVariable
Initializer for initial hidden state (\(h_0\)). If a TensorVariable (Theano expression) is supplied, it will not be learned regardless of the value of learn_init.
backwards : bool
If True, process the sequence backwards and then reverse the output again such that the output from the layer is always from \(x_1\) to \(x_n\).
learn_init : bool
If True, initial hidden values are learned. If hid_init is a TensorVariable then the TensorVariable is used and learn_init is ignored.
gradient_steps : int
Number of timesteps to include in the backpropagated gradient. If -1, backpropagate through the entire sequence.
grad_clipping : False or float
If a float is provided, the gradient messages are clipped during the backward pass. If False, the gradients will not be clipped. See [R28] (p. 6) for further explanation.
unroll_scan : bool
If True the recursion is unrolled instead of using scan. For some graphs this gives a significant speed up but it might also consume more memory. When unroll_scan is True, backpropagation always includes the full sequence, so gradient_steps must be set to -1 and the input sequence length must be known at compile time (i.e., cannot be given as None).
precompute_input : bool
If True, precompute input_to_hid before iterating through the sequence. This can result in a speedup at the expense of an increase in memory usage.
mask_input : lasagne.layers.Layer
Layer which allows for a sequence mask to be input, for when sequences are of variable length. Default None, which means no mask will be supplied (i.e. all sequences are of the same length).
Notes
An alternate update for the candidate hidden state is proposed in [R29]:
\[\begin{split}c_t &= \sigma_c(x_t W_{ic} + (r_t \odot h_{t - 1})W_{hc} + b_c)\\\end{split}\]We use the formulation from [R28] because it allows us to do all matrix operations in a single dot product.
References
[R28] (1, 2, 3, 4) Cho, Kyunghyun, et al: On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014). [R29] (1, 2) Chung, Junyoung, et al.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014). [R30] Graves, Alex: “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013). - get_output_for(inputs, **kwargs)[source]¶
Compute this layer’s output function given a symbolic input variable
Parameters: inputs : list of theano.TensorType
inputs[0] should always be the symbolic input variable. When this layer has a mask input (i.e. was instantiated with mask_input != None, indicating that the lengths of sequences in each batch vary), inputs should have length 2, where inputs[1] is the mask. The mask should be supplied as a Theano variable denoting whether each time step in each sequence in the batch is part of the sequence or not. mask should be a matrix of shape (n_batch, n_time_steps) where mask[i, j] = 1 when j <= (length of sequence i) and mask[i, j] = 0 when j > (length of sequence i).
Returns: layer_output : theano.TensorType
Symbolic output variable.
- class lasagne.layers.Gate(W_in=lasagne.init.Normal(0.1), W_hid=lasagne.init.Normal(0.1), W_cell=lasagne.init.Normal(0.1), b=lasagne.init.Constant(0.), nonlinearity=lasagne.nonlinearities.sigmoid)[source]¶
Simple class to hold the parameters for a gate connection. We define a gate loosely as something which computes the linear mix of two inputs, optionally computes an element-wise product with a third, adds a bias, and applies a nonlinearity.
Parameters: W_in : Theano shared variable, numpy array or callable
Initializer for input-to-gate weight matrix.
W_hid : Theano shared variable, numpy array or callable
Initializer for hidden-to-gate weight matrix.
W_cell : Theano shared variable, numpy array, callable, or None
Initializer for cell-to-gate weight vector. If None, no cell-to-gate weight vector will be stored.
b : Theano shared variable, numpy array or callable
Initializer for input gate bias vector.
nonlinearity : callable or None
The nonlinearity that is applied to the input gate activation. If None is provided, no nonlinearity will be applied.
References
[R31] (1, 2) Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM.” Neural computation 12.10 (2000): 2451-2471. Examples
For LSTMLayer the bias of the forget gate is often initialized to a large positive value to encourage the layer initially remember the cell value, see e.g. [R31] page 15.
>>> import lasagne >>> forget_gate = Gate(b=lasagne.init.Constant(5.0)) >>> l_lstm = LSTMLayer((10, 20, 30), num_units=10, ... forgetgate=forget_gate)