Skip to the content.

What Happens in dp4j LenetMnist Example

Class Diagram

Class

Sequence

CNN Structure

Codes Sequence

Initiation

InitSequence

Fit

fitsequence

Computation

Initiation

Parameters

Inputsequence Count number of all parameters:

MultiLayerNetwork.init
    for (int i = 0; i < nLayers; i++) {
        NeuralNetConfiguration conf = layerWiseConfigurations.getConf(i);
        nParamsPerLayer[i] = conf.getLayer().initializer().numParams(conf);
        paramLength += nParamsPerLayer[i];
    }

Take Convolution Layer for example:

ConvolutionParamInitializer.numParams
    public int numParams(Layer l) {
        org.deeplearning4j.nn.conf.layers.ConvolutionLayer layerConf =
                        (org.deeplearning4j.nn.conf.layers.ConvolutionLayer) l;

        int[] kernel = layerConf.getKernelSize();
        int nIn = layerConf.getNIn();
        int nOut = layerConf.getNOut();
        return nIn * nOut * kernel[0] * kernel[1] + (layerConf.hasBias() ? nOut : 0);
    }

Create a huge bulk of memory in 1-dimension

MultiLayerNetwork.init
    if (parameters != null) {
        // ...
    } else {
        flattenedParams = Nd4j.create(1, paramLength);
        initializeParams = true;
    }

Assign part of this bulk into each layer in sequence

MultiLayerNetwork.init
     int paramCountSoFar = 0;
     for (int i = 0; i < nLayers; i++) {
        INDArray paramsView;
        if (nParamsPerLayer[i] > 0) {
            //...
        } else {
            paramsView = null;
        }
        paramCountSoFar += nParamsPerLayer[i];

        NeuralNetConfiguration conf = layerWiseConfigurations.getConf(i);
        layers[i] = conf.getLayer().instantiate(conf, listeners, i, paramsView, initializeParams);
        layerMap.put(conf.getLayer().getLayerName(), layers[i]);
     }

Input/Label

For each mini batch, the input is organized into INDArray of batch * #imagebyte Similar to labels.

MnistDataFetcher.fetch
    public void fetch(int numExamples) {
        //...
        float[][] featureData = new float[numExamples][0];
        float[][] labelData = new float[numExamples][0];

        int actualExamples = 0;
        byte[] working = null;
        for (int i = 0; i < numExamples; i++, cursor++) {
            if (!hasMore())
                break;

            byte[] img = man.readImageUnsafe(order[cursor]);

            //...

            int label = man.readLabel(order[cursor]);

            //...

            float[] featureVec = new float[img.length];
            featureData[actualExamples] = featureVec;
            labelData[actualExamples] = new float[numOutcomes];
            labelData[actualExamples][label] = 1.0f;

            //...

            actualExamples++;
        }

        //...

        INDArray features = Nd4j.create(featureData);
        INDArray labels = Nd4j.create(labelData);
        curr = new DataSet(features, labels);
    }

Set the shape data.length * data[0].length

    public BaseNDArray(float[][] data, char ordering) {
        this(Nd4j.createBuffer(ordering == 'c' ? ArrayUtil.flatten(data) : ArrayUtil.flattenF(data)),
                        new int[] {data.length, data[0].length},
                        Nd4j.getStrides(new int[] {data.length, data[0].length}, ordering), 0, ordering);
    }

Forward Propagation

Input

Input of each layer is set at

AbstractLayer.java
    @Override
    public INDArray activate(INDArray input, boolean training) {
        setInput(input);
        return activate(training);
    }

Input type was set in MultiLayerConfiguration builder

LenetMins.java
    .setInputType(InputType.convolutionalFlat(inputImageSize, inputImageSize, channelNum))

A PreProcessor was created:

MultiLayerConfiguration.Builder.build
           if (inputType != null) {
                InputType currentInputType = inputType;
                for (int i = 0; i < confs.size(); i++) {
                    Layer l = confs.get(i).getLayer();
                    if (inputPreProcessors.get(i) == null) {
                        //Don't override preprocessor setting, but set preprocessor if required...
                        InputPreProcessor inputPreProcessor = l.getPreProcessorForInputType(currentInputType);
                        if (inputPreProcessor != null) {
                            inputPreProcessors.put(i, inputPreProcessor);
                        }
                    }
                    //...
                }
                //...
           }

For CNN layer, a default FeedForwardToCnnPreProcessor is created

    @JsonCreator
    public FeedForwardToCnnPreProcessor(@JsonProperty("inputHeight") int inputHeight,
                    @JsonProperty("inputWidth") int inputWidth, @JsonProperty("numChannels") int numChannels) {
        this.inputHeight = inputHeight;
        this.inputWidth = inputWidth;
        this.numChannels = numChannels;
    }

Preprocessor is executed before active() for each layer. MultiLayerNetwork aggregate them. It is not embedded in ConvolutionLayer class

    public INDArray activationFromPrevLayer(int curr, INDArray input, boolean training) {
        // ...
        input = getLayerWiseConfigurations().getInputPreProcess(curr).preProcess(input, getInputMiniBatchSize());
        // ...

        INDArray ret = layers[curr].activate(input, training);
        return ret;
    }

The 2-dim buffer is reshaped into 4-dim buffer

FeedForwardToCnnPreProcessor.java
    @Override
    public INDArray preProcess(INDArray input, int miniBatchSize) {
        //...
        return input.reshape('c', input.size(0), numChannels, inputHeight, inputWidth);
    }

Convolution Layer

Suppose batchSize = 2. The layer should be like the following figure Take the numbers in the figure for example.

Test

ConvolutionLayer.preOutput
        INDArray bias = getParamWithNoise(ConvolutionParamInitializer.BIAS_KEY, training);
        INDArray weights = getParamWithNoise(ConvolutionParamInitializer.WEIGHT_KEY, training);

weights is of dim {nIn, nOut, kernelHeight, kernelWidth}. input is of dim {batchSize, numChannels, inputHeight, inputWidth} “n” and “depth of channel” and “number of channels” are the same

        INDArray col = Nd4j.createUninitialized(new int[] {miniBatch, outH, outW, inDepth, kH, kW}, 'c');
        INDArray col2 = col.permute(0, 3, 4, 5, 1, 2);

Create INDArray of size (miniBatch * outH * outW * inDepth * kH * kW) Permute into shape (miniBatch, inDepth, kH, kW, outH, outW)

        Convolution.im2col(input, kH, kW, strides[0], strides[1], pad[0], pad[1], dilation[0], dilation[1],
                        convolutionMode == ConvolutionMode.Same, col2);
        INDArray im2col2d = Shape.newShapeNoCopy(col, new int[] {miniBatch * outH * outW, inDepth * kH * kW}, false);

im2col expand the input into 2-dim matrix. That is, to extract each block that would be conved with a set of kernel to a line of matrix

        INDArray permutedW = weights.permute(3, 2, 1, 0);
        INDArray reshapedW = permutedW.reshape('f', kW * kH * inDepth, outDepth);

And to extract each set of kernel as a column of a matrix

            z = im2col2d.mmul(reshapedW);

Just like the following figure

im2col

Multiply them and get the z, that is, output before activation

        z = Shape.newShapeNoCopy(z, new int[] {outW, outH, miniBatch, outDepth}, true);
        z = z.permute(2, 3, 1, 0);

Reshape z in shape {miniBatch, outDepth, outH, outW} It is permute(2, 3, 1, 0) in stead of (2, 3, 0, 1) is that the next block counted by row number in input while in col number in output matrix.

ConvolutionLayer.activate()
        INDArray activation = afn.getActivation(z, training);
ActivationReLU.getActivation()
    @Override
    public INDArray getActivation(INDArray in, boolean training) {
        Nd4j.getExecutioner().execAndReturn(new RectifedLinear(in));
        return in;
    }

Output is of the same dim as z, which is 4 rank matrix. And is sent to subsampling layer.

Subsample Layer

Not confirmed. Guess that the max pooling is executed by:

SubsamplingLayer.activate
--> Convolution.pooling2D
--> Nd4jCpu.double_maxpool2d
--> libnd4j maxpool2d.cpp
--> OpenBlas

And also guess that OpenBlas does not have too much trick. It is a SIMD of naive implementation as Caffe does. Caffe CNN Layer

Dense Layer

BaseLayer.preOutput()
        INDArray ret = input.mmul(W);
BaseLayer.activate()
        INDArray ret = layerConf().getActivationFn().getActivation(z, training);

Output Layer

set input of output layer

MultiLayerNetwork.computeGradientAndScore()
                INDArray actSecondLastLayer = activations.get(activations.size() - 1);
                getOutputLayer().setInput(actSecondLastLayer);
                backprop();

set labels of output layer

MultiLayerNetwork.calcBackpropGradients()
            outputLayer.setLabels(labels);
            currPair = outputLayer.backpropGradient(null);

Get the output

LossMCXENT.computeGradient()
        INDArray output = activationFn.getActivation(preOutput.dup(), true);

Then computed by ActivationSoftmax.getActivation()

Annotation in ActivationSoftmax.java
/**
 * f_i(x) = exp(x_i - shift) / sum_j exp(x_j - shift)
 * where shift = max_i(x_i)
 */

Suppose it is a kind of normalization. To be clarified.

Back Propagation

Output Layer

The loss function is created without weights parameters

OutputLayer.Builder
        public Builder(LossFunction lossFunction) {
            super.lossFunction(lossFunction);
        }

The gradient was computed as:

LossMCXENT.computeGradient()
                grad = output.subi(labels); //labels is one-hot vector

softmax gradient English softmax gradient Chinese The output layer is as the following figure:

softmax

Σ(k) means sum over index k
sigma(i, j) = (i == j)
loss = l = -log(zy), labels(y) = 1
zi = exp(xi - xt) / Σ(k)(exp(xk - xt)), xt = max(xk)

dzi/dxj = (sigma(i, j) * exp(xi - xt) / Σ) + (exp(xi - xt) * (-1) * (Σ ^ (-2)) * exp(xj - xt))
        = (exp(xi - xt) / Σ) * (sigma(i, j) - exp(xj - xt) / Σ)
        = zi * (sigma(i, j) - zj)

dl/dzi = sigma(y, i) * (-1) * (1 / zi)

dl/dxj = Σ(i)((dl/dzi) * (dzi/dzj))
       = Σ(i)(sigma(y, i) * (1 / zi) * zi * (zj - sigma(i, j)))
       = Σ(i)(sigma(y, i) * (zj - sigma(i, j)))
As sigma(y, i) == 0 when (y != i)
--> dl/dxj = zj - sigma(y, j)
And sigma(y, j) == labels
--> dL/dX = output - labels

Dense Layer

DenseLayer

f = do / dz
epsilon = dl / do

dl/dwij = (dl/doi) * (doi/dzi) * (dzi / dwij)
        = epsilon(i) * f(i) * xj

dl/dxi = Σ(k)((dl/dok) * (dok/dzk) * (dzk/dxi))
       = Σ(k)(epsilon(k) * f(k) * w(k,i)

The back propagation happens in

DenseLayer --> BaseLayer.backpropGradient()
    @Override
    public Pair<Gradient, INDArray> backpropGradient(INDArray epsilon) {
        //get z
        INDArray z = preOutput(true);
        //get f based on value of z
        //and get (f muli epsilon)
        INDArray delta = layerConf().getActivationFn().backprop(z, epsilon).getFirst(); //TODO handle activation function params

        Gradient ret = new DefaultGradient();

        INDArray weightGrad = gradientViews.get(DefaultParamInitializer.WEIGHT_KEY); //f order
        //weightGrad = input * delta = input * (f muli epsilon)
        Nd4j.gemm(input, delta, weightGrad, true, false, 1.0, 0.0);
        ret.gradientForVariable().put(DefaultParamInitializer.WEIGHT_KEY, weightGrad);
        //End of w gradient

        if(hasBias()){
            INDArray biasGrad = gradientViews.get(DefaultParamInitializer.BIAS_KEY);
            delta.sum(biasGrad, 0); //biasGrad is initialized/zeroed first
            ret.gradientForVariable().put(DefaultParamInitializer.BIAS_KEY, biasGrad);
        }

        //get w
        INDArray W = getParamWithNoise(DefaultParamInitializer.WEIGHT_KEY, true);
        //get W * (f muli epsilon)
        INDArray epsilonNext = W.mmul(delta.transpose()).transpose();
        //End of x gradient

        weightNoiseParams.clear();

        return new Pair<>(ret, epsilonNext);
    }

Subsampling Layer

SubsamplingLayer.backpropGradient()

subsamp init

            col6d = Nd4j.create(new int[] {miniBatch, inDepth, outH, outW, kernel[0], kernel[1]}, 'c');

subsamp con6d

            epsilon1d = epsilon.reshape('c', ArrayUtil.prod(epsilon.length()), 1); //zero copy reshape

subsamp epsilon1d

                Convolution.im2col(input, kernel[0], kernel[1], strides[0], strides[1], pad[0], pad[1], dilation[0], dilation[1],
                                convolutionMode == ConvolutionMode.Same, col6dPermuted);

subsamp im2col

                INDArray isMax = Nd4j.getExecutioner().execAndReturn(new IsMax(col2d, 1));

submsamp isMax

                isMax.muliColumnVector(epsilon1d);

subsamp epsilon

        Convolution.col2im(col6dPermuted, outEpsilon, strides[0], strides[1], pad[0], pad[1], inputHeight, inputWidth, dilation[0], dilation[1]);

subsamp c6permuted

Get the output

subsamp output

Convolution Layer

There are many respectable documents about convolution back propagation. So, just put the im2col/col2im view as implemented in dp4j In forward phase, we have

convback11

The certain weight contributes to output by corresponding input.

preOutput(i, j) = Σ(k)(X(i, k) * W(k, j))
d(preOutput(i, j))/d(W(k, j)) = X(i, k)

d(loss)/d(w(k, j)) = Σ(i)((d(preOutput(i, j)/d(W(k, j))) * (d(l)/d(preOutput(i, j))))
                   = Σ(i)(X(i, k) * epsilon(i, j) * df(i, j))
                   = Σ(i)((X.transpose()(k, i)) * epsilon(i, j) * df(i, j))
                   = X.row(k) * (epsilon dot f).col(j)
ConvolutionLayer.backpropGradient()
        // p.first is of preOutputShape
        // afn.backprop get df/dpreOutput and executed element wise product of epsilon
        delta = afn.backprop(p.getFirst(), epsilon).getFirst(); //TODO handle activation function params

        delta = delta.permute(1, 0, 2, 3); //To shape: [outDepth,miniBatch,outH,outW]
        INDArray delta2d = delta.reshape('c', new int[] {outDepth, miniBatch * outH * outW}); //Shape.newShapeNoCopy(delta,new int[]{outDepth,miniBatch*outH*outW},false);

        //im2col2d is as described in previous figure
        INDArray im2col2d = p.getSecond(); //Re-use im2col2d array from forward pass if available; recalculate if not

        //get im2col2d and delta2d both transposed
        Nd4j.gemm(im2col2d, delta2d, weightGradView2df, true, true, 1.0, 0.0);

convback12

Similarly
preOutput(i, j) = Σ(k)(X(i, k) * W(k, j))
d(preOutput(i, j))/d(X(i, k)) = W(k, j)

d(l)/d(X(i, k)) = Σ(j)(W(k, j) * epsilon(i, j) * df(i, j))
As preOutput2d is of shape (batch * outH * outW, outDepth)
And delta2d is of shape (outDepth, batch * outH * outW)
-->
d(l)/d(X(i, k)) = Σ(j)(W(k, j) * delta2d(j, i))
                = W.row(k) * delta2d.col(i)

convback13

Update

StochasticGradientDescent.optimize()
        //Update all parameters by configured updater
        stepFunction.step(params, gradient.gradient());
        //Set the updated parameters
        model.setParams(params);

Score

The score sequence is like:

score

It computes L1 and L2 normalization at first. There is no such regularization in this example. And then to compute the score in LossMCXENT.scoreArray

        INDArray output = activationFn.getActivation(preOutput.dup(), true);
        INDArray scoreArr = Transforms.log(output, false).muli(labels);

Just to execute the softmax function and get the log of them. Then pick the one whose label == 1.it