Skip to the content.

#Chapter03 ##3.3 Several factors:

For episodic case, T is limited, and G is limited, γ could be equal to 1.

3.5

The robot is hard to learn from local environment as all actions get reward and return as 0. It will take rather long for exploration mechanism to escape the maze. In worse case, the robot would be trapped in local optimal value (=0).

The drawback of this case is that the robot has not been told that there is a solution with good reward in time.

3.6

It depends on rules of game. If all objects just took responsibility of states in front of vision system, it could be an MDP.

When there has been broken, it is not MDP as there was no state. Once it repaired and turned on, it is still possible to access to MDP as it is an MDP.

3.8

qπ(s, a) = E[Gt | s, a]

= ∑s’ p(s’ s, a) * E[G s’]
= ∑s’ p(s’ s, a) * V(s’)
= ∑s’ p(s’ s, a) * ∑ra’p(r, a’) * E[r + R s’, a’]
= ∑s’ p(s’ s, a) * ∑ra’p(r, a’) * r + ∑s’ p(s’ s, a) * ∑ra’p(r, a’) * E[R s’, a’]
= ∑s’ p(s’ s, a) * ∑ra’p(r, a’) * r + ∑s’ p(s’ s, a) * ∑ra’p(r, a’) * qπ(s’, a’)

= ∑π(s, a) * ∑ra’p(r, a’) * r + ∑π(s, a) * ∑ra’p(r, a’) * qπ(s’, a’)

3.9

vπ(s) = ∑aπ(a|s) * ∑s’,rp(s’, r| s, a) * [r + γ * vπ(s’)]

π(a s) = 0.25 ∀a

r = 0 ∀(s,a)

p(s’,r s, a) = 1 ∀s’

vπ(s’) = {2.3, 0.7, -0.4, 0.4}

γ = 0.9

v(s) = π(upper) * 1 * (0.9 * v(upper)) + … + π(right) * 1 * (0.9 * v(right))

= 0.25 * 3 * 0.9 = 0.675

3.10

Gt = ∑kγk * (Rt+k+1 + c)

= ∑kγk * Rt+k+1 + γk * c

Delta of Gt depends merely on k. The c does not impact on relative v/q or other variables.

3.11

It impacts on relative values to the initial values. Take maze case for example, if original reward of state s is {-1, -2, -3, -4}, initial value of s is {0, 0, 0, 0}. In beginning rounds, the agent would try all actions and find -1 as best choice. Add 4 to all rewards, reward of s is {3, 2, 1, 0}, the agent would more likely to be trapped in actions with reward = {0, 1, 2}, and would find the best choice of {3} after relatively longer running with exploration.

It may also extend steps of episode.

3.12

v(s) = ∑aπ(a | s) * q(s, a)

3.13

q(s,a) = ∑s’p(s’|s,a)(rs->a->s’ + v(s’))

3.14

According to figure 3.6:

The better means taking action/value of corresponding action.

3.15

Same as above

3.16

q*(s,a) = ∑s’p(s’|s,a) * [r + γ * maxa’q(s’,a’)] (in this case, r is determined when s’ determined)

S = {h, l}

A = {s, w, r}

state action nextState P R
h s h p0 r0
h s l 1-p0 r1
h w h p1 r2
h w l 1-p1 r3
h r h p2 r4
h r l 1-p2 r5
l s h p3 r6
l s l 1-p3 r7
l w h p4 r8
l w l 1-p4 r9
l r h p5 r10
l r l 1-p5 r11

q*(h,s) = p0 * [r0 + γ * maxa’(q(h, s), q(h, w), q(h, r))] + (1 - p0) * [r1 + γ * maxa’(q(l,s), q(l,w), q(l,r))]

q*(l,w) = p4 * [r8 + γ * maxa’(q(h,s), q(h,w), q(h,r))] + (1- p4) * [r9 + γ * maxa’(q(l,s), q(l,w), q(l,r))]

3.17

= argmaxas’,rpπ(s’,r s,a) * (r + v(s’))