MuZero

Ideas

Learn a model(MDP) and corresponding value/policy networks. It is an end-to-end algorithm.
A representing network has been learned to encode raw observations.
Model of transition and reward is learned based on representation network output.
The model learned is not for original observation decoding. It is to make value/policy network a better one.
Representation is based on a sequence of observation. It is for both Attention and Recurrent states.
It uniformed learning on illegal actions and terminal actions.
It is prospecting in real-world-environment problem learning and partially observation problem learning.
Training

Data Collection

Details

∑ τ α π δ γ ∏ Δ Λ μ ∈ ∀ β λ ∞ η θ ∝ σ

pUTC

In expanded children selection, the rule is:

a^k = argmax[Q(s,a) + P(s,a) * ((∑_bN(s,b))^1/2 / (1 + N(s,a))) * (c₁ + log((∑_bN(s,b) + c₂ + 1) / c₂))]

Both Q and P weight. Either one of them seemed prospecting, the action deserves more trials.
Q is an unknown scalar, while 0 <= P <= 1, so P has to be adjusted to match the range of Q to impact its effect.
If N(s,a) is relatively small, it deserves more exploration.
It is in fact the c₁ that dominates the scale of P part.
log((∑_bN(s,b) + c₂ + 1) / c₂) = log(∑_bN(s,b) / c₂ + 1 + 1 / c₂) > 0. The larger ∑_bN(s,b) is, the more times that S has been visited, the more believable P is.
Dirichlet Noise
Sum of P is still 1 after adding the noise
It has no favor to any particular choice
TD(lambda)

In Q update in backup phase, the Q is a TD(λ) summary:

G^k = ∑_τ=[0:l-1-k]γ^τr_k+1+τ + γ^l-kv^l

QSum(s^k-1,a^k) += G^k

N(s^k-1,a^k) += 1

Q(s^k-1,a^k) := QSum(s^k-1,a^k) / N(s^k-1,a^k)

Temperature

The estimation of π is based on N^1/T instead of pure N.

Pa = N(a)^1/T / ∑_bN(b)^1/T

The larger T is, the less greedy and less deterministic π is

Network Training

There are three networks:

Representation: Observation –> State
Dynamics: (Previous States, Previous Action, State, Action) –> (Reward, NextState)
Value/Policy: State –> (Policy, Value)

The State generated by either Representation network or Dynamics network is not used to input compression. It is not used decode State back into Observation. It is used to extract features that weight in Value/Policy prediction. So we do not care about loss between raw Observation and State. Then the loss function is:

l_t(θ) = ∑_k=[0:K]L^r(u_t+k, r_t^k) + L^v(z_t+k, v_t^k) + L^p(π_t+k, p_t^k) + Reg_l2: K = unroll steps

Target

Select a sample t. K = unroll steps
[obs_t:t+K, action_t:t+K, u_t:t+K] has been stored in replay buffer in Data Collection phase.
Get z_t = ∑_i=t:t+K-1u_iγⁱ + v_t+K. v_t+Kγ^K is looked up from transition and reward table that calculated in Data Collection phase
u_t has been collected in Data Collection Phase
π_t := ratio of visit count collected in Data Collection Phase
Output

Reanalysis

MuZero Reanalyze revisits its past time-steps and re-executes its search using the latest model parameters. That is, select samples (Obs_t, Reward_t, Action_t) and its rollout steps (Obs_t+i, Reward_t+i, Action_t+i: i=0…K) from replay buffer, then re-calculate their network outputs and re-estimate their corresponding target values. For target values:
- Reward (u) remains untouched as there is no real re-play in real environment.
- Value (z) has been re-calculated where truncated v_t+K has been replaced by network with current parameters
- Policy (π) has been re-calculated by same algorithm that executed in Data Collection, but with updated networks. And each input of time step t (Obs_t+i) has been calculated end-to-end independently. That is, sequence of obs_t+i –> state_t+i –> MCTS search_t+i –> target policy output_t+i.

It is obvious an off-policy training as:

Value(s) has been calculated by updated network
The rollout steps have not been generated by current policy, they are in episodes that had been conducted by networks that are different from current ones. If the part of an episode start with Obs_t happened with instruction of current network(θ), the Obs_t+1, Obs_t+2 … may be far away from that extracted from replay buffer. That introduced extra loss as:
- The algorithm trains dynamic network indirectly by policy/value loss
- The algorithm considers prediction of both one step and multi-step

But guess that it still deserves as:

It is data efficient
There is a hidden policy called rollout policy. It may be good to rollout-policy training.
Rollout policy is of less importance. Rollout often helps whatever the policy is.
It is to avoid overfitting.
Difference to RNN

It is not complicated Recurrent state-space model (RSSM). It is simpler than LSTM/GRU. It is just a simple recurrent process.

Reference
- Dirichlet noise
- Julian Schrittwieser
- MuZero Rollout
- MuZero-Pytorch
- Introduction by Sutton
- [UPenn CIS 522 Guest Lecture April 2020](https://www.seas.upenn.edu/~cis522/slides/CIS522_Lecture12T.pdf)

MuZero

Ideas

Training

Data Collection

Details

pUTC

Dirichlet Noise

TD(lambda)

Temperature

Network Training

Target

Output

Reanalysis

Difference to RNN

Reference