Skip to the content.

RNN Operation

Codes

Overview

The rnn.cpp defined the 3 models, all models have both run-net and init-net

The run-net is shared by training model and predict model.

From model-net view, the architecture is like:

net_arch

rnn_net

The rnn net is defined as an operator in AddLSTM and AddRecurrentNetworkOp

Overview of the LSTM is like:

rnn_overview

AddLSTM

		AddFcOps(input_blob, scope + "/i2h", vocab_size, 4 * hidden_size, 2);

The line of the codes deals with (Wz * x) in above picture. This part has nothing to do with recurrent operations, and the FcOp processes this part from 0 to SeqLen once for all.

		*hidden_output = scope + "/hidden_t_last";
		*cell_state = scope + "/cell_t_last";

This part defines output of recurrent network.

		initModel.AddXavierFillOp({4 * hidden_size, hidden_size},
				scope + "/gates_t_w");
		trainModel.AddInput(scope + "/gates_t_w");
		initModel.AddConstantFillOp({4 * hidden_size}, scope + "/gates_t_b");
		trainModel.AddInput(scope + "/gates_t_b");

This part initializes Rz (refer to above figure).

AddRecurrentNetworkOp

Define simple network for step operation in recurrent network

Workspaces

The RNN network is built as an operator and step network is built as an argument, for workspace coherence, the workspace is set by an input argument scope + “/step_workspaces”:

		auto op = trainModel.AddOp("RecurrentNetwork",
					{scope + "/i2h", hidden_init, cell_init, scope + "/gates_t_w",
						scope + "/gates_t_b", seq_lengths},
					{scope + "/hidden_t_all", hidden_output, scope + "/cell_t_all",
						cell_state, scope + "/step_workspaces"}
		);

//recurrent_network_op.h
template <class Context>
class RecurrentNetworkGradientOp final : public Operator<Context> {
 public:
  USE_OPERATOR_CONTEXT_FUNCTIONS;
  RecurrentNetworkGradientOp(const OperatorDef& operator_def, Workspace* ws)
      : Operator<Context>(operator_def, ws),
        sharedWs_(ws),

The RecurrentNetworkOp and GradientOp assign a collection of workspaces, one for each time step.

//DoRunWithType
    const detail::ScratchWorkspaces& scratch =
        this->template Input<detail::ScratchWorkspaces>(InputSize() - 1);
    const std::vector<std::shared_ptr<Workspace>>& stepWorkspaces =
        scratch.stepWorkspaces;
    CAFFE_ENFORCE_GE(stepWorkspaces.size(), seqLen);
    Workspace& sharedBlobsWs = *scratch.sharedBlobsWs.get();

Workspaces are organized as inherent tree, with input workspace argument as sharedBlobsWs shared by all time step workspaces.

//DoRunWithType()
    for (auto t = 0; t < seqLen; ++t) {
      auto& currentStepWorkspace =
          (has_backward_pass ? stepWorkspaces[t] :
              stepWorkspaces[t % num_workspaces_on_fwd_only]);
      if (!currentStepWorkspace) {
        currentStepWorkspace = std::make_shared<Workspace>(sharedBlobsWs.get());
      }

The StepNetwork is created in the corresponding workspace:

    for (auto t = 0; t < seqLen; ++t) {
//...
      if (rnnExecutor_) {
        if (!has_backward_pass) {
          // Need to limit timestep parallelism because we cycle over workspaces
          rnnExecutor_->SetMaxParallelTimesteps(num_workspaces_on_fwd_only);
        }
        rnnExecutor_->EnsureTimestepInitialized(
            t, currentStepWorkspace.get(), this->observers_list_);
      } else {
        detail::UpdateTimestepBlob(currentStepWorkspace.get(), timestep_, t);
        auto* stepNet = currentStepWorkspace->GetNet(stepNetDef_.name());
        if (stepNet == nullptr) {
          stepNet = currentStepWorkspace->CreateNet(stepNetDef_);
          }
        }

Why should assign a workspace for each step?

		NetUtil forward(scope);
		forward.SetType("rnn");
		forward.AddInput("input_t");
		forward.AddInput("timestep");
		forward.AddInput(scope + "/hidden_t_prev");
		forward.AddInput(scope + "/cell_t_prev");
		forward.AddInput(scope + "/gates_t_w");
		forward.AddInput(scope + "/gates_t_b");

Calculate R * C(t - 1) and add to W * X(t - 1)

		auto fc = forward.AddFcOp(scope + "/hidden_t_prev", scope + "/gates_t_w",
									scope + "/gates_t_b", scope + "/gates_t", 2);
		auto sum = forward.AddSumOp({scope + "/gates_t", "input_t"}, scope + "/gates_t");

Add seq for recurrent calculation and add LSTMUnitOp

		forward.AddInput(seq_lengths);
		auto lstm = forward.AddLSTMUnitOp({scope + "/hidden_t_prev", scope + "/cell_t_prev", scope + "/gates_t", seq_lengths, "timestep"},
											{scope + "/hidden_t", scope + "/cell_t"});

Add output of LSTM (Not Recurrent network).

		forward.AddOutput(scope + "/hidden_t");
		forward.AddOutput(scope + "/cell_t");
Backward Pass Step Network

Create backward network

		NetUtil backward("RecurrentBackwardStep");
		backward.SetType("simple");
		backward.AddGradientOp(*lstm);

Replace output “LSTM/hidden_t_prev_grad” of FcGradient by “LSTM/hidden_t_prev_grad_split”, as the name “LSTM/hidden_t_prev_grad” conflicts with LSTMUnitGradient output.

		auto grad = backward.AddGradientOp(*fc);
		grad->set_output(2, scope + "/hidden_t_prev_grad_split");

Add the FcGradient part into δy(t - 1)

		backward.AddSumOp(
				{scope + "/hidden_t_prev_grad", scope + "/hidden_t_prev_grad_split"},
					scope + "/hidden_t_prev_grad");

Define input of backward network

		backward.AddInput(scope + "/gates_t");
		backward.AddInput(scope + "/hidden_t_grad");
		backward.AddInput(scope + "/cell_t_grad");
		backward.AddInput("input_t");
		backward.AddInput("timestep");
		backward.AddInput(scope + "/hidden_t_prev");
		backward.AddInput(scope + "/cell_t_prev");
		backward.AddInput(scope + "/gates_t_w");
		backward.AddInput(scope + "/gates_t_b");
		backward.AddInput(seq_lengths);
		backward.AddInput(scope + "/hidden_t");
		backward.AddInput(scope + "/cell_t");
Recurrent Network

Create Recurrent network by RecurrentNetwork Operation

		auto op = trainModel.AddOp("RecurrentNetwork",
					{scope + "/i2h", hidden_init, cell_init, scope + "/gates_t_w",
						scope + "/gates_t_b", seq_lengths},
					{scope + "/hidden_t_all", hidden_output, scope + "/cell_t_all",
						cell_state, scope + "/step_workspaces"}
		);

What’s function of LSTM/step_workspaces?

Setup links for recurrent operation in time T

		  net_add_arg(*op, "link_internal",
		              std::vector<std::string>{
		                  scope + "/hidden_t_prev",
						  scope + "/hidden_t",
		                  scope + "/cell_t_prev",
						  scope + "/cell_t",
						  "input_t"});
		  net_add_arg(*op, "link_external",
		              std::vector<std::string>{
		                  scope + "/" + scope + "/hidden_t_prev_states",
		                  scope + "/" + scope + "/hidden_t_prev_states",
		                  scope + "/" + scope + "/cell_t_prev_states",
		                  scope + "/" + scope + "/cell_t_prev_states",
						  scope + "/i2h"});
		  net_add_arg(*op, "link_offset", std::vector<int>{0, 1, 0, 1, 0});

The link operation defined in recurrent_network_op.h as RNNApplyLinkOp and recurrent_network_op.cc AddApplyLinkOps

It maps part of external data into internal data as a pointer (reference).

    const auto& t0 = this->template Input<Tensor>(0, CPU);
    const auto t = t0.template data<int32_t>()[0];
    auto& external = Input(1);

    auto* internal_out = Output(0);
    auto* external_out = Output(1);

    CAFFE_ENFORCE_GT(external.numel(), 0);
    const int64_t externalTimestepSize = external.numel() / external.size(0);
    auto* externalData = external_out->template mutable_data<T>() +
        (t + offset_) * externalTimestepSize;
    auto internalDims = external_out->sizes().vec();
    internalDims[0] = window_;

    internal_out->Resize(internalDims);
    internal_out->ShareExternalPointer(
        externalData, externalTimestepSize * window_);

The link operation is like:

rnn_link

So, in above links, taking cell_t for example, at time t:

the LSTM/LSTM/cell_t_prev_states(t + 1) would be filled in after RNN step t calculation, for calculation of step (t + 1)

Alias Operation

Then alias setup for forward network:

		  net_add_arg(
		      *op, "alias_src",
		      std::vector<std::string>{scope + "/" + scope + "/hidden_t_prev_states",
		                               scope + "/" + scope + "/hidden_t_prev_states",
		                               scope + "/" + scope + "/cell_t_prev_states",
		                               scope + "/" + scope + "/cell_t_prev_states"});
		  net_add_arg(*op, "alias_dst",
		              std::vector<std::string>{
			  	  	  	  scope + "/hidden_t_all",
						  hidden_output,
		                  scope + "/cell_t_all",
						  cell_state});
		  net_add_arg(*op, "alias_offset", std::vector<int>{1, -1, 1, -1});

Alias operation is similar to Link operation, to share part of source data into destination blob. The difference is that

    net_add_arg(*op, "recompute_blobs_on_backward");
		  
  void initializeBlobsToRecomputeOnBackward(Workspace* sharedBlobsWs) {
    std::vector<std::string> v;
    const auto& blobs = this->template GetRepeatedArgument<std::string>(
        "recompute_blobs_on_backward", v);
    for (const auto& b : blobs) {
      // Note: if the blob already was created, this is a no-op.
      sharedBlobsWs->CreateBlob(b);
    }
  }		  

In backward pass, recurrent_states is bound with alias_src and alias_offset. In constructRecurrentGradients, it finds the corresponding gradient of RNN network inputs. And the result is LSTM/LSTM/hidden_t_prev_states_grad ==> LSTM/hidden_t_all_grad

AddGradientInputAccumulationOps defines accumulation operation to get the gradient passed from upper layer in each time step. After gradients of all the time steps, the output is copied back into init blob for next forward pass:

   for (int i = 0; i < recurrentInputIds_.size(); ++i) {
      // See GetRecurrentNetworkGradient to understand offseting here
      // Outputs of the gradient are inputs of the forward pass.
      // So we need to offset on all inputs that go before recurrent
      // initial ones
      auto outputIdx = i + params_.size() + numSequences_;
      // because first gradInputs_.size() inputs are from GO
      int inputId = recurrentInputIds_[i] + gradInputs_.size();
      VLOG(1) << "Resetting output " << this->debug_def().output(outputIdx)
              << " like input " << this->debug_def().input(inputId);
      Output(outputIdx)->ResizeLike(Input(inputId));
      T* output_data = Output(outputIdx)->template mutable_data<T>();
      auto pBlob = sharedWs_->GetBlob(recurrentGradients_[i].grad);
      CAFFE_ENFORCE(pBlob);
      auto* p = BlobGetMutableTensor(pBlob, Context::GetDeviceType());

      if (Input(inputId).dim() >= 2) {
        // Gradient states blob should live. And if it gets changed by the
        // backward pass, then output should be changed as well. Thus it should
        // be okay to share data here
        Output(outputIdx)->template ShareExternalPointer<T>(
            p->template mutable_data<T>());
      }
Step Network

Define forward network and backward network for each step execution

		  net_add_arg(*op, "step_net", forward.Proto());
		  net_add_arg(*op, "backward_step_net", backward.Proto());

In RecurrentNetworkOp::DoRunWithType:

    for (auto t = 0; t < seqLen; ++t) {
      auto& currentStepWorkspace =
          (has_backward_pass ? stepWorkspaces[t] :
              stepWorkspaces[t % num_workspaces_on_fwd_only]);
      if (!currentStepWorkspace) {
        currentStepWorkspace = std::make_shared<Workspace>(sharedBlobsWs.get());
      }

      if (rnnExecutor_) {
    	 //...
      } else {
        // Use plain Caffe2 nets
        detail::UpdateTimestepBlob(currentStepWorkspace.get(), timestep_, t);
        auto* stepNet = currentStepWorkspace->GetNet(stepNetDef_.name());
        if (stepNet == nullptr) {
          stepNet = currentStepWorkspace->CreateNet(stepNetDef_);
        }
        CAFFE_ENFORCE(stepNet, "Step Net construction failure");
        // Since we have a SimpleNet, there are no races here.
        stepNet->RunAsync();
      }
    }

Similar in RecurrentNetworkGradientOp::DoRunWithType

Sequence

Forward Pass

forward pass

  1. Calculate (W * X) for all time steps
  2. Copy initiated value for hidden_t_prev_states time step 0
  3. Copy initiated value for cell_t_prev_states time step 0
  4. Recurrent network step at time t
    • 4.0: update timestep value as t
    • 4.1: LSTM/hidden_t_prev refers to LSTM/LSTM/hidden_t_prev_states values at t
    • 4.2: LSTM/hidden_t refers to LSTM/LSTM/hidden_t_prev_states values at (t + 1) to restore LSTMUnitOp result
    • 4.3: LSTM/cell_t_prev refers to LSTM/LSTM/cell_t_prev_states values at t
    • 4.4: LSTM/cell_t refers to LSTM/LSTM/cell_t_prev_states values at (t + 1) to restore LSTMUnitOp result
    • 4.5: input_t refers to LSTM/i2h values at t
    • 4.6: (R * c(t - 1)) at time step t
    • 4.7: (R * c(t - 1) + W * x(t)) at time step t
    • 4.8: (R * c(t - 1) + W * x(t)), c(t - 1), y(t - 1) as inputs for all the gates and outputs
    • 4.9: Get output of c(t) and y(t)
  5. Map LSTM/LSTM/hidden_t_prev_states as LSTM/hidden_t_all. That y(t) for {t 0 <= t < sequenceLen}
  6. Map LSTM/LSTM/hidden_t_prev_states as LSTM/hidden_t_last in reverse sequence for backward pass
  7. Map LSTM/LSTM/cell_t_prev_states as LSTM/cell_t_all, do not know further usage
  8. Map LSTM/LSTM/cell_t_prev_states as LSTM/cell_t_last for backward pass
  9. Calculate o(t) from y(t) for {t | 0 <= t < sequenceLen}, function is FC + Softmax

    Backward Pass

    Backward Propagation:

backward prop The output of SoftmaxGradient is accumulation of loss of all time steps.

  1. Get δy of all the time steps
  2. Recurrent Gradient at time step t
    • 2.0 update timestep value as t
    • 2.1 Get y(t)
    • 2.2 Get y(t + 1)
    • 2.3 Get c(t)
    • 2.4 Get c(t + 1)
    • 2.5 Get δy(t + 1)
    • 2.6 Get address of δy(t)
    • 2.7 Get δc(t + 1)
    • 2.8 Get address of δc(t)
    • 2.9 Get address of δz(t)
    • 2.10 Get and add part of δy(t + 1) from upper layer
    • 2.11 LSTMUnitGradient outputs δy(t), δc(t), δz(t)
    • 2.12 Get part of δy(t) from δ(R * y)/δy
    • 2.13 Add it into original δy(t) and map it into LSTM/LSTM/hidden_t_prev_states_grad for (t - 1)
    • 2.14 Get δR in step t and have it accumulated
    • 2.15 Get δRb in step t and have it accumulated
  3. When all δz(t) calculated, get the δW and δWb

Update

update