Skip to the content.

MuZero

Ideas

pUTC

In expanded children selection, the rule is:

ak = argmax[Q(s,a) + P(s,a) * ((∑bN(s,b))1/2 / (1 + N(s,a))) * (c1 + log((∑bN(s,b) + c2 + 1) / c2))]

Gk = ∑τ=[0:l-1-k]γτrk+1+τ + γl-kvl

QSum(sk-1,ak) += Gk

N(sk-1,ak) += 1

Q(sk-1,ak) := QSum(sk-1,ak) / N(sk-1,ak)

Temperature

The estimation of π is based on N1/T instead of pure N.

Pa = N(a)1/T / ∑bN(b)1/T

The larger T is, the less greedy and less deterministic π is

Network Training

There are three networks:

The State generated by either Representation network or Dynamics network is not used to input compression. It is not used decode State back into Observation. It is used to extract features that weight in Value/Policy prediction. So we do not care about loss between raw Observation and State. Then the loss function is:

lt(θ) = ∑k=[0:K]Lr(ut+k, rtk) + Lv(zt+k, vtk) + Lpt+k, ptk) + Regl2: K = unroll steps

Target

  1. Select a sample t. K = unroll steps
  2. [obst:t+K, actiont:t+K, ut:t+K] has been stored in replay buffer in Data Collection phase.
  3. Get zt = ∑i=t:t+K-1uiγi + vt+K. vt+KγK is looked up from transition and reward table that calculated in Data Collection phase
  4. ut has been collected in Data Collection Phase
  5. πt := ratio of visit count collected in Data Collection Phase

    Output

    output

    Reanalysis

    MuZero Reanalyze revisits its past time-steps and re-executes its search using the latest model parameters. That is, select samples (Obst, Rewardt, Actiont) and its rollout steps (Obst+i, Rewardt+i, Actiont+i: i=0…K) from replay buffer, then re-calculate their network outputs and re-estimate their corresponding target values. For target values:

    • Reward (u) remains untouched as there is no real re-play in real environment.
    • Value (z) has been re-calculated where truncated vt+K has been replaced by network with current parameters
    • Policy (π) has been re-calculated by same algorithm that executed in Data Collection, but with updated networks. And each input of time step t (Obst+i) has been calculated end-to-end independently. That is, sequence of obst+i –> statet+i –> MCTS searcht+i –> target policy outputt+i.

It is obvious an off-policy training as:

  1. Value(s) has been calculated by updated network
  2. The rollout steps have not been generated by current policy, they are in episodes that had been conducted by networks that are different from current ones. If the part of an episode start with Obst happened with instruction of current network(θ), the Obst+1, Obst+2 … may be far away from that extracted from replay buffer. That introduced extra loss as:
    • The algorithm trains dynamic network indirectly by policy/value loss
    • The algorithm considers prediction of both one step and multi-step

But guess that it still deserves as:

  1. It is data efficient
  2. There is a hidden policy called rollout policy. It may be good to rollout-policy training.
  3. Rollout policy is of less importance. Rollout often helps whatever the policy is.
  4. It is to avoid overfitting.

    Difference to RNN

    It is not complicated Recurrent state-space model (RSSM). It is simpler than LSTM/GRU. It is just a simple recurrent process.

    Reference