Skip to the content.

#BatchNormalization ##Why

Normalization

Batch Normalization

Normalization is a widely used pre-processing trick in machine learning. The unbalanced feature scopes make learning rate hard to choose – the same learning rate may explore some features while vanish other features at the same time.

After normalization, the same learning rate imposes same effect on all features, which makes the LR easier to choose. And then make use of regularization tools to select preferred features

Internal Covariate Shift

The topic is brought up by paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

We define Internal Covariate Shift as the change in th distribution of network activations due to the change in network parameters during training.

An intuitive explanation is that the training of parameters of layer Hn+1 is updated based on distribution of output of Hn. But the sequence of back-propagation if from Hn+1 to Hn. So the ongoing update of different layers may conflict with each other or cancel update effect of each other.

covirate

Above figure showed a simple case.

When batch normalization has been introduced, the output the each layer would be in the fixed center with limited scope, the network just update the shape to fit into the target.

Some other papers argued that it may not the root cause of improved performance: Understanding Batch Normalization

Other Observations

The following sequence figure is from Batch Norm Paper Reading

BatchNormForward

Backward

Follow the sequence shown in above figure as (reverse) topology sort.

  1. ∂l/∂x_hati = ∂l/∂yi * ∂yi/∂x_hati = ∂l/∂yi * γ
  2. ∂l/∂γ = Σ(∂l/∂yi * ∂yi/∂γ) = Σ(∂l/∂yi * x_hati)
  3. ∂l/∂β = Σ(∂l/∂yi * ∂yi/∂β) = Σ(∂l/∂yi)
  4. ∂l/∂σ2 = Σ(∂l/∂x_hati * ∂x_hati/∂σ2) = Σ(∂l/∂x_hati * (-1/2) * (xi - μ) * (σ2 + ǫ)(-3/2))
  5. ∂σ2/∂μ = (-2/m) * Σ(xi - μ)
  6. ∂x_hati/∂μ = (-1) * (σ2 + ε)(-1/2)
  7. ∂x_hati/∂xi = (σ2 + ǫ)(-1/2)
  8. ∂μ/∂xi = 1/m
  9. ∂σ2/∂xi = (2/m) * (xi - μ)

Then

The batch-norm output is y = γ * x_hat + β = γ * ((x - μ)/(σ2 + ε)(-1/2)) + β

Benefits

Notes of paper

Other References

Covariate Shift

Batch Norm(Chinese)

Batch Normalization(Andrew Ang)