What Happens in dp4j LenetMnist Example

Class Diagram

Class

Sequence

CNN Structure

Codes Sequence

Initiation

InitSequence

Fit

fitsequence

Computation

Initiation

Parameters

Inputsequence Count number of all parameters:

MultiLayerNetwork.init
    for (int i = 0; i < nLayers; i++) {
        NeuralNetConfiguration conf = layerWiseConfigurations.getConf(i);
        nParamsPerLayer[i] = conf.getLayer().initializer().numParams(conf);
        paramLength += nParamsPerLayer[i];
    }

Take Convolution Layer for example:

ConvolutionParamInitializer.numParams
    public int numParams(Layer l) {
        org.deeplearning4j.nn.conf.layers.ConvolutionLayer layerConf =
                        (org.deeplearning4j.nn.conf.layers.ConvolutionLayer) l;

        int[] kernel = layerConf.getKernelSize();
        int nIn = layerConf.getNIn();
        int nOut = layerConf.getNOut();
        return nIn * nOut * kernel[0] * kernel[1] + (layerConf.hasBias() ? nOut : 0);
    }

Create a huge bulk of memory in 1-dimension

MultiLayerNetwork.init
    if (parameters != null) {
        // ...
    } else {
        flattenedParams = Nd4j.create(1, paramLength);
        initializeParams = true;
    }

Assign part of this bulk into each layer in sequence

MultiLayerNetwork.init
     int paramCountSoFar = 0;
     for (int i = 0; i < nLayers; i++) {
        INDArray paramsView;
        if (nParamsPerLayer[i] > 0) {
            //...
        } else {
            paramsView = null;
        }
        paramCountSoFar += nParamsPerLayer[i];

        NeuralNetConfiguration conf = layerWiseConfigurations.getConf(i);
        layers[i] = conf.getLayer().instantiate(conf, listeners, i, paramsView, initializeParams);
        layerMap.put(conf.getLayer().getLayerName(), layers[i]);
     }

Input/Label

For each mini batch, the input is organized into INDArray of batch * #imagebyte Similar to labels.

MnistDataFetcher.fetch
    public void fetch(int numExamples) {
        //...
        float[][] featureData = new float[numExamples][0];
        float[][] labelData = new float[numExamples][0];

        int actualExamples = 0;
        byte[] working = null;
        for (int i = 0; i < numExamples; i++, cursor++) {
            if (!hasMore())
                break;

            byte[] img = man.readImageUnsafe(order[cursor]);

            //...

            int label = man.readLabel(order[cursor]);

            //...

            float[] featureVec = new float[img.length];
            featureData[actualExamples] = featureVec;
            labelData[actualExamples] = new float[numOutcomes];
            labelData[actualExamples][label] = 1.0f;

            //...

            actualExamples++;
        }

        //...

        INDArray features = Nd4j.create(featureData);
        INDArray labels = Nd4j.create(labelData);
        curr = new DataSet(features, labels);
    }

Set the shape data.length * data[0].length

    public BaseNDArray(float[][] data, char ordering) {
        this(Nd4j.createBuffer(ordering == 'c' ? ArrayUtil.flatten(data) : ArrayUtil.flattenF(data)),
                        new int[] {data.length, data[0].length},
                        Nd4j.getStrides(new int[] {data.length, data[0].length}, ordering), 0, ordering);
    }

Forward Propagation

Input

Input of each layer is set at

AbstractLayer.java
    @Override
    public INDArray activate(INDArray input, boolean training) {
        setInput(input);
        return activate(training);
    }

Input type was set in MultiLayerConfiguration builder

LenetMins.java
    .setInputType(InputType.convolutionalFlat(inputImageSize, inputImageSize, channelNum))

A PreProcessor was created:

MultiLayerConfiguration.Builder.build
           if (inputType != null) {
                InputType currentInputType = inputType;
                for (int i = 0; i < confs.size(); i++) {
                    Layer l = confs.get(i).getLayer();
                    if (inputPreProcessors.get(i) == null) {
                        //Don't override preprocessor setting, but set preprocessor if required...
                        InputPreProcessor inputPreProcessor = l.getPreProcessorForInputType(currentInputType);
                        if (inputPreProcessor != null) {
                            inputPreProcessors.put(i, inputPreProcessor);
                        }
                    }
                    //...
                }
                //...
           }

For CNN layer, a default FeedForwardToCnnPreProcessor is created

    @JsonCreator
    public FeedForwardToCnnPreProcessor(@JsonProperty("inputHeight") int inputHeight,
                    @JsonProperty("inputWidth") int inputWidth, @JsonProperty("numChannels") int numChannels) {
        this.inputHeight = inputHeight;
        this.inputWidth = inputWidth;
        this.numChannels = numChannels;
    }

Preprocessor is executed before active() for each layer. MultiLayerNetwork aggregate them. It is not embedded in ConvolutionLayer class

    public INDArray activationFromPrevLayer(int curr, INDArray input, boolean training) {
        // ...
        input = getLayerWiseConfigurations().getInputPreProcess(curr).preProcess(input, getInputMiniBatchSize());
        // ...

        INDArray ret = layers[curr].activate(input, training);
        return ret;
    }

The 2-dim buffer is reshaped into 4-dim buffer

FeedForwardToCnnPreProcessor.java
    @Override
    public INDArray preProcess(INDArray input, int miniBatchSize) {
        //...
        return input.reshape('c', input.size(0), numChannels, inputHeight, inputWidth);
    }

Convolution Layer

Suppose batchSize = 2. The layer should be like the following figure Take the numbers in the figure for example.

Test

ConvolutionLayer.preOutput

        INDArray bias = getParamWithNoise(ConvolutionParamInitializer.BIAS_KEY, training);
        INDArray weights = getParamWithNoise(ConvolutionParamInitializer.WEIGHT_KEY, training);

weights is of dim {nIn, nOut, kernelHeight, kernelWidth}. input is of dim {batchSize, numChannels, inputHeight, inputWidth} “n” and “depth of channel” and “number of channels” are the same

        INDArray col = Nd4j.createUninitialized(new int[] {miniBatch, outH, outW, inDepth, kH, kW}, 'c');
        INDArray col2 = col.permute(0, 3, 4, 5, 1, 2);

Create INDArray of size (miniBatch * outH * outW * inDepth * kH * kW) Permute into shape (miniBatch, inDepth, kH, kW, outH, outW)

        Convolution.im2col(input, kH, kW, strides[0], strides[1], pad[0], pad[1], dilation[0], dilation[1],
                        convolutionMode == ConvolutionMode.Same, col2);
        INDArray im2col2d = Shape.newShapeNoCopy(col, new int[] {miniBatch * outH * outW, inDepth * kH * kW}, false);

im2col expand the input into 2-dim matrix. That is, to extract each block that would be conved with a set of kernel to a line of matrix

        INDArray permutedW = weights.permute(3, 2, 1, 0);
        INDArray reshapedW = permutedW.reshape('f', kW * kH * inDepth, outDepth);

And to extract each set of kernel as a column of a matrix

            z = im2col2d.mmul(reshapedW);

Just like the following figure

im2col

Multiply them and get the z, that is, output before activation

        z = Shape.newShapeNoCopy(z, new int[] {outW, outH, miniBatch, outDepth}, true);
        z = z.permute(2, 3, 1, 0);

Reshape z in shape {miniBatch, outDepth, outH, outW} It is permute(2, 3, 1, 0) in stead of (2, 3, 0, 1) is that the next block counted by row number in input while in col number in output matrix.

ConvolutionLayer.activate(）

        INDArray activation = afn.getActivation(z, training);

ActivationReLU.getActivation()
    @Override
    public INDArray getActivation(INDArray in, boolean training) {
        Nd4j.getExecutioner().execAndReturn(new RectifedLinear(in));
        return in;
    }

Output is of the same dim as z, which is 4 rank matrix. And is sent to subsampling layer.

Subsample Layer

Not confirmed. Guess that the max pooling is executed by:

SubsamplingLayer.activate
--> Convolution.pooling2D
--> Nd4jCpu.double_maxpool2d
--> libnd4j maxpool2d.cpp
--> OpenBlas

And also guess that OpenBlas does not have too much trick. It is a SIMD of naive implementation as Caffe does. Caffe CNN Layer

Dense Layer

BaseLayer.preOutput()
        INDArray ret = input.mmul(W);
BaseLayer.activate()
        INDArray ret = layerConf().getActivationFn().getActivation(z, training);

Output Layer

set input of output layer

MultiLayerNetwork.computeGradientAndScore()
                INDArray actSecondLastLayer = activations.get(activations.size() - 1);
                getOutputLayer().setInput(actSecondLastLayer);
                backprop();

set labels of output layer

MultiLayerNetwork.calcBackpropGradients()
            outputLayer.setLabels(labels);
            currPair = outputLayer.backpropGradient(null);

Get the output

LossMCXENT.computeGradient()
        INDArray output = activationFn.getActivation(preOutput.dup(), true);

Then computed by ActivationSoftmax.getActivation()

Annotation in ActivationSoftmax.java
/**
 * f_i(x) = exp(x_i - shift) / sum_j exp(x_j - shift)
 * where shift = max_i(x_i)
 */

Suppose it is a kind of normalization. To be clarified.

Back Propagation

Output Layer

The loss function is created without weights parameters

OutputLayer.Builder
        public Builder(LossFunction lossFunction) {
            super.lossFunction(lossFunction);
        }

The gradient was computed as:

LossMCXENT.computeGradient()
                grad = output.subi(labels); //labels is one-hot vector

softmax gradient English softmax gradient Chinese The output layer is as the following figure:

softmax

Σ(k) means sum over index k
sigma(i, j) = (i == j)
loss = l = -log(zy), labels(y) = 1
zi = exp(xi - xt) / Σ(k)(exp(xk - xt)), xt = max(xk)

dzi/dxj = (sigma(i, j) * exp(xi - xt) / Σ) + (exp(xi - xt) * (-1) * (Σ ^ (-2)) * exp(xj - xt))
        = (exp(xi - xt) / Σ) * (sigma(i, j) - exp(xj - xt) / Σ)
        = zi * (sigma(i, j) - zj)

dl/dzi = sigma(y, i) * (-1) * (1 / zi)

dl/dxj = Σ(i)((dl/dzi) * (dzi/dzj))
       = Σ(i)(sigma(y, i) * (1 / zi) * zi * (zj - sigma(i, j)))
       = Σ(i)(sigma(y, i) * (zj - sigma(i, j)))
As sigma(y, i) == 0 when (y != i)
--> dl/dxj = zj - sigma(y, j)
And sigma(y, j) == labels
--> dL/dX = output - labels

Dense Layer

DenseLayer

f = do / dz
epsilon = dl / do

dl/dwij = (dl/doi) * (doi/dzi) * (dzi / dwij)
        = epsilon(i) * f(i) * xj

dl/dxi = Σ(k)((dl/dok) * (dok/dzk) * (dzk/dxi))
       = Σ(k)(epsilon(k) * f(k) * w(k,i)

The back propagation happens in

DenseLayer --> BaseLayer.backpropGradient()
    @Override
    public Pair<Gradient, INDArray> backpropGradient(INDArray epsilon) {
        //get z
        INDArray z = preOutput(true);
        //get f based on value of z
        //and get (f muli epsilon)
        INDArray delta = layerConf().getActivationFn().backprop(z, epsilon).getFirst(); //TODO handle activation function params

        Gradient ret = new DefaultGradient();

        INDArray weightGrad = gradientViews.get(DefaultParamInitializer.WEIGHT_KEY); //f order
        //weightGrad = input * delta = input * (f muli epsilon)
        Nd4j.gemm(input, delta, weightGrad, true, false, 1.0, 0.0);
        ret.gradientForVariable().put(DefaultParamInitializer.WEIGHT_KEY, weightGrad);
        //End of w gradient

        if(hasBias()){
            INDArray biasGrad = gradientViews.get(DefaultParamInitializer.BIAS_KEY);
            delta.sum(biasGrad, 0); //biasGrad is initialized/zeroed first
            ret.gradientForVariable().put(DefaultParamInitializer.BIAS_KEY, biasGrad);
        }

        //get w
        INDArray W = getParamWithNoise(DefaultParamInitializer.WEIGHT_KEY, true);
        //get W * (f muli epsilon)
        INDArray epsilonNext = W.mmul(delta.transpose()).transpose();
        //End of x gradient

        weightNoiseParams.clear();

        return new Pair<>(ret, epsilonNext);
    }

Subsampling Layer

SubsamplingLayer.backpropGradient()

subsamp init

            col6d = Nd4j.create(new int[] {miniBatch, inDepth, outH, outW, kernel[0], kernel[1]}, 'c');

subsamp con6d

            epsilon1d = epsilon.reshape('c', ArrayUtil.prod(epsilon.length()), 1); //zero copy reshape

subsamp epsilon1d

                Convolution.im2col(input, kernel[0], kernel[1], strides[0], strides[1], pad[0], pad[1], dilation[0], dilation[1],
                                convolutionMode == ConvolutionMode.Same, col6dPermuted);

subsamp im2col

                INDArray isMax = Nd4j.getExecutioner().execAndReturn(new IsMax(col2d, 1));

submsamp isMax

                isMax.muliColumnVector(epsilon1d);

subsamp epsilon

        Convolution.col2im(col6dPermuted, outEpsilon, strides[0], strides[1], pad[0], pad[1], inputHeight, inputWidth, dilation[0], dilation[1]);

subsamp c6permuted

Get the output

subsamp output

Convolution Layer

There are many respectable documents about convolution back propagation. So, just put the im2col/col2im view as implemented in dp4j In forward phase, we have

convback11

The certain weight contributes to output by corresponding input.

preOutput(i, j) = Σ(k)(X(i, k) * W(k, j))
d(preOutput(i, j))/d(W(k, j)) = X(i, k)

d(loss)/d(w(k, j)) = Σ(i)((d(preOutput(i, j)/d(W(k, j))) * (d(l)/d(preOutput(i, j))))
                   = Σ(i)(X(i, k) * epsilon(i, j) * df(i, j))
                   = Σ(i)((X.transpose()(k, i)) * epsilon(i, j) * df(i, j))
                   = X.row(k) * (epsilon dot f).col(j)

ConvolutionLayer.backpropGradient()
        // p.first is of preOutputShape
        // afn.backprop get df/dpreOutput and executed element wise product of epsilon
        delta = afn.backprop(p.getFirst(), epsilon).getFirst(); //TODO handle activation function params

        delta = delta.permute(1, 0, 2, 3); //To shape: [outDepth,miniBatch,outH,outW]
        INDArray delta2d = delta.reshape('c', new int[] {outDepth, miniBatch * outH * outW}); //Shape.newShapeNoCopy(delta,new int[]{outDepth,miniBatch*outH*outW},false);

        //im2col2d is as described in previous figure
        INDArray im2col2d = p.getSecond(); //Re-use im2col2d array from forward pass if available; recalculate if not

        //get im2col2d and delta2d both transposed
        Nd4j.gemm(im2col2d, delta2d, weightGradView2df, true, true, 1.0, 0.0);

convback12

Similarly
preOutput(i, j) = Σ(k)(X(i, k) * W(k, j))
d(preOutput(i, j))/d(X(i, k)) = W(k, j)

d(l)/d(X(i, k)) = Σ(j)(W(k, j) * epsilon(i, j) * df(i, j))
As preOutput2d is of shape (batch * outH * outW, outDepth)
And delta2d is of shape (outDepth, batch * outH * outW)
-->
d(l)/d(X(i, k)) = Σ(j)(W(k, j) * delta2d(j, i))
                = W.row(k) * delta2d.col(i)

convback13

Update

StochasticGradientDescent.optimize()
        //Update all parameters by configured updater
        stepFunction.step(params, gradient.gradient());
        //Set the updated parameters
        model.setParams(params);

Score

The score sequence is like:

score

It computes L1 and L2 normalization at first. There is no such regularization in this example. And then to compute the score in LossMCXENT.scoreArray

        INDArray output = activationFn.getActivation(preOutput.dup(), true);
        INDArray scoreArr = Transforms.log(output, false).muli(labels);

Just to execute the softmax function and get the log of them. Then pick the one whose label == 1.it