Normalization layers

The LocalResponseNormalization2DLayer implementation contains code from pylearn2, which is covered by the following license:

Copyright (c) 2011–2014, Université de Montréal All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

class lasagne.layers.LocalResponseNormalization2DLayer(incoming, alpha=0.0001, k=2, beta=0.75, n=5, **kwargs)[source]

Cross-channel Local Response Normalization for 2D feature maps.

Aggregation is purely across channels, not within channels, and performed “pixelwise”.

If the value of the \(i\) th channel is \(x_i\), the output is

\[x_i = \frac{x_i}{ (k + ( \alpha \sum_j x_j^2 ))^\beta }\]

where the summation is performed over this position on \(n\) neighboring channels.

Parameters:
incoming : a Layer instance or a tuple

The layer feeding into this layer, or the expected input shape. Must follow BC01 layout, i.e., (batchsize, channels, rows, columns).

alpha : float scalar

coefficient, see equation above

k : float scalar

offset, see equation above

beta : float scalar

exponent, see equation above

n : int

number of adjacent channels to normalize over, must be odd

Notes

This code is adapted from pylearn2. See the module docstring for license information.

class lasagne.layers.BatchNormLayer(incoming, axes='auto', epsilon=1e-4, alpha=0.1, beta=lasagne.init.Constant(0), gamma=lasagne.init.Constant(1), mean=lasagne.init.Constant(0), inv_std=lasagne.init.Constant(1), **kwargs)[source]

Batch Normalization

This layer implements batch normalization of its inputs, following [1]:

\[y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \gamma + \beta\]

That is, the input is normalized to zero mean and unit variance, and then linearly transformed. The crucial part is that the mean and variance are computed across the batch dimension, i.e., over examples, not per example.

During training, \(\mu\) and \(\sigma^2\) are defined to be the mean and variance of the current input mini-batch \(x\), and during testing, they are replaced with average statistics over the training data. Consequently, this layer has four stored parameters: \(\beta\), \(\gamma\), and the averages \(\mu\) and \(\sigma^2\) (nota bene: instead of \(\sigma^2\), the layer actually stores \(1 / \sqrt{\sigma^2 + \epsilon}\), for compatibility to cuDNN). By default, this layer learns the average statistics as exponential moving averages computed during training, so it can be plugged into an existing network without any changes of the training procedure (see Notes).

Parameters:
incoming : a Layer instance or a tuple

The layer feeding into this layer, or the expected input shape

axes : ‘auto’, int or tuple of int

The axis or axes to normalize over. If 'auto' (the default), normalize over all axes except for the second: this will normalize over the minibatch dimension for dense layers, and additionally over all spatial dimensions for convolutional layers.

epsilon : scalar

Small constant \(\epsilon\) added to the variance before taking the square root and dividing by it, to avoid numerical problems

alpha : scalar

Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training; the closer to one, the more it will depend on the last batches seen

beta : Theano shared variable, expression, numpy array, callable or None

Initial value, expression or initializer for \(\beta\). Must match the incoming shape, skipping all axes in axes. Set to None to fix it to 0.0 instead of learning it. See lasagne.utils.create_param() for more information.

gamma : Theano shared variable, expression, numpy array, callable or None

Initial value, expression or initializer for \(\gamma\). Must match the incoming shape, skipping all axes in axes. Set to None to fix it to 1.0 instead of learning it. See lasagne.utils.create_param() for more information.

mean : Theano shared variable, expression, numpy array, or callable

Initial value, expression or initializer for \(\mu\). Must match the incoming shape, skipping all axes in axes. See lasagne.utils.create_param() for more information.

inv_std : Theano shared variable, expression, numpy array, or callable

Initial value, expression or initializer for \(1 / \sqrt{ \sigma^2 + \epsilon}\). Must match the incoming shape, skipping all axes in axes. See lasagne.utils.create_param() for more information.

**kwargs

Any additional keyword arguments are passed to the Layer superclass.

See also

batch_norm
Convenience function to apply batch normalization to a layer

Notes

This layer should be inserted between a linear transformation (such as a DenseLayer, or Conv2DLayer) and its nonlinearity. The convenience function batch_norm() modifies an existing layer to insert batch normalization in front of its nonlinearity.

The behavior can be controlled by passing keyword arguments to lasagne.layers.get_output() when building the output expression of any network containing this layer.

During training, [1] normalize each input mini-batch by its statistics and update an exponential moving average of the statistics to be used for validation. This can be achieved by passing deterministic=False. For validation, [1] normalize each input mini-batch by the stored statistics. This can be achieved by passing deterministic=True.

For more fine-grained control, batch_norm_update_averages can be passed to update the exponential moving averages (True) or not (False), and batch_norm_use_averages can be passed to use the exponential moving averages for normalization (True) or normalize each mini-batch by its own statistics (False). These settings override deterministic.

Note that for testing a model after training, [1] replace the stored exponential moving average statistics by fixing all network weights and re-computing average statistics over the training data in a layerwise fashion. This is not part of the layer implementation.

In case you set axes to not include the batch dimension (the first axis, usually), normalization is done per example, not across examples. This does not require any averages, so you can pass batch_norm_update_averages and batch_norm_use_averages as False in this case.

References

[1](1, 2, 3, 4, 5) Ioffe, Sergey and Szegedy, Christian (2015): Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. http://arxiv.org/abs/1502.03167.
lasagne.layers.batch_norm(layer, **kwargs)[source]

Apply batch normalization to an existing layer. This is a convenience function modifying an existing layer to include batch normalization: It will steal the layer’s nonlinearity if there is one (effectively introducing the normalization right before the nonlinearity), remove the layer’s bias if there is one (because it would be redundant), and add a BatchNormLayer and NonlinearityLayer on top.

Parameters:
layer : A Layer instance

The layer to apply the normalization to; note that it will be irreversibly modified as specified above

**kwargs

Any additional keyword arguments are passed on to the BatchNormLayer constructor.

Returns:
BatchNormLayer or NonlinearityLayer instance

A batch normalization layer stacked on the given modified layer, or a nonlinearity layer stacked on top of both if layer was nonlinear.

Examples

Just wrap any layer into a batch_norm() call on creating it:

>>> from lasagne.layers import InputLayer, DenseLayer, batch_norm
>>> from lasagne.nonlinearities import tanh
>>> l1 = InputLayer((64, 768))
>>> l2 = batch_norm(DenseLayer(l1, num_units=500, nonlinearity=tanh))

This introduces batch normalization right before its nonlinearity:

>>> from lasagne.layers import get_all_layers
>>> [l.__class__.__name__ for l in get_all_layers(l2)]
['InputLayer', 'DenseLayer', 'BatchNormLayer', 'NonlinearityLayer']
class lasagne.layers.StandardizationLayer(incoming, axes='auto', epsilon=0.0001, **kwargs)[source]

Standardize inputs to zero mean and unit variance:

\[y_i = \frac{x_i - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}\]

The mean \(\mu_i\) and variance \(\sigma_i^2\) are computed and shared across a given set of axes. In contrast to batch normalization, these axes usually do not include the batch dimension, so each example is normalized independently from other examples in the minibatch, both during training and testing.

The StandardizationLayer can be employed to realize instance normalization [1] and layer normalization [2], for both of which convenience functions (instance_norm() and layer_norm()) are available.

Parameters:
incoming : a Layer instance or a tuple

The layer feeding into this layer, or the expected input shape

axes : ‘auto’, ‘spatial’, ‘features’, int or tuple of int

The axis or axes to normalize over. If 'auto' (the default), two-dimensional inputs are normalized over the last dimension (i.e., this will normalize over units for dense layers), input tensors with more than two dimensions are normalized over all but the first two dimensions (i.e., this will normalize over all spatial dimensions for convolutional layers). If 'spatial', will normalize over all but the first two dimensions. If 'features', will normalize over all but the first dimension.

epsilon : scalar

Small constant \(\epsilon\) added to the variance before taking the square root and dividing by it, to avoid numerical problems

**kwargs

Any additional keyword arguments are passed to the Layer superclass.

See also

instance_norm
Convenience function to apply instance normalization
layer_norm
Convenience function to apply layer normalization to a layer

References

[1](1, 2) Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016): Instance Normalization: The Missing Ingredient for Fast Stylization. https://arxiv.org/abs/1607.08022.
[2](1, 2) Ba, J., Kiros, J., & Hinton, G. (2016): Layer normalization. https://arxiv.org/abs/1607.06450.
lasagne.layers.instance_norm(layer, learn_scale=True, learn_bias=True, **kwargs)[source]

Apply instance normalization to an existing layer. This is a convenience function modifying an existing layer to include instance normalization: It will steal the layer’s nonlinearity if there is one (effectively introducing the normalization right before the nonlinearity), remove the layer’s bias if there is one (because it would be effectless), and add a StandardizationLayer and NonlinearityLayer on top. Depending on the given arguments, an additional ScaleLayer and BiasLayer will be inserted in between.

In effect, it will separately standardize each feature map of each input example, followed by an optional scale and shift learned per channel, followed by the original nonlinearity, as proposed in [1].

Parameters:
layer : A Layer instance

The layer to apply the normalization to; note that it will be irreversibly modified as specified above

learn_scale : bool (default: True)

Whether to add a ScaleLayer after the StandardizationLayer

learn_bias : bool (default: True)

Whether to add a BiasLayer after the StandardizationLayer (or the optional ScaleLayer)

**kwargs

Any additional keyword arguments are passed on to the StandardizationLayer constructor.

Returns:
StandardizationLayer, ScaleLayer, BiasLayer, or NonlinearityLayer instance

The last layer stacked on top of the given modified layer to implement instance normalization with optional scaling and shifting.

References

[1](1, 2) Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016): Instance Normalization: The Missing Ingredient for Fast Stylization. https://arxiv.org/abs/1607.08022.

Examples

Just wrap any layer into a instance_norm() call on creating it:

>>> from lasagne.layers import InputLayer, Conv2DLayer, instance_norm
>>> from lasagne.nonlinearities import rectify
>>> l1 = InputLayer((10, 3, 28, 28))
>>> l2 = instance_norm(Conv2DLayer(l1, num_filters=64, filter_size=3,
...                                nonlinearity=rectify))

This introduces instance normalization right before its nonlinearity:

>>> from lasagne.layers import get_all_layers
>>> [l.__class__.__name__ for l in get_all_layers(l2)]
['InputLayer', 'Conv2DLayer', 'StandardizationLayer', 'ScaleLayer', 'BiasLayer', 'NonlinearityLayer']
lasagne.layers.layer_norm(layer, **kwargs)[source]

Apply layer normalization to an existing layer. This is a convenience function modifying an existing layer to include layer normalization: It will steal the layer’s nonlinearity if there is one (effectively introducing the normalization right before the nonlinearity), remove the layer’s bias if there is one, and add a StandardizationLayer, ScaleLayer, BiasLayer, and NonlinearityLayer on top.

In effect, it will standardize each input example across the feature and spatial dimensions (if any), followed by a scale and shift learned per feature, followed by the original nonlinearity, as proposed in [1].

Parameters:
layer : A Layer instance

The layer to apply the normalization to; note that it will be irreversibly modified as specified above

**kwargs

Any additional keyword arguments are passed on to the StandardizationLayer constructor.

Returns:
StandardizationLayer or NonlinearityLayer instance

The last layer stacked on top of the given modified layer to implement layer normalization with feature-wise scaling and shifting.

References

[1](1, 2) Ba, J., Kiros, J., & Hinton, G. (2016): Layer normalization. https://arxiv.org/abs/1607.06450.

Examples

Just wrap any layer into a layer_norm() call on creating it:

>>> from lasagne.layers import InputLayer, DenseLayer, layer_norm
>>> from lasagne.nonlinearities import rectify
>>> l1 = InputLayer((10, 28))
>>> l2 = layer_norm(DenseLayer(l1, num_units=64, nonlinearity=rectify))

This introduces layer normalization right before its nonlinearity:

>>> from lasagne.layers import get_all_layers
>>> [l.__class__.__name__ for l in get_all_layers(l2)]
['InputLayer', 'DenseLayer', 'StandardizationLayer', 'ScaleLayer', 'BiasLayer', 'NonlinearityLayer']