Skip to the content.

#Chapter04 SARSA

4.1

qπ(s,a) = ∑s’p(s’|s,a) * [r + γ * vπ(s’)]

p(s’ s,a) = 1, r = -1, γ = 1

qπ(11, down) = -1 + v(T)

qπ(7, down) = -1 + v(11)

4.2

|S|1|2|3| |—|—|—|—| |4|5|6|7| |8|9|10|11| |12|13|14|T| |X|15|X|X|

v(s) = ∑aπ(a s) ∑s’p(s’ s,a)[r + γ * vπ(s’)]
π(a s) = 1/4, p(s’ s,a) = 1, r = -1, γ = 1

v(15) = 1/4 * [(-1 + v(12)) + (-1 + v(13)) + (-1 + v(14)) + (-1 + v(15))]

= -1 + [v(12) + v(13) + v(14) + v(15)] / 4

v(13) = -1 + [v(9) + v(12) + v(14) + v(15)] / 4

Solve equations:

v(13) = -20, v(15) = -20

4.3

qπ(s,a) = Eπ[R + γG’ | s,a]

= Eπ[R + γv(s’) s,a]
= Eπ[R + γ∑a’π(a’ s’) * qπ(s’,a’) s,a]

= ∑s’p(s’|s,a) * [R + γ∑a’π(a’|s’) * qπ(s’,a’)]

4.4

So in s = 50, action = 50 get the best value as it has chance to win in 1 step with possibility = PH. Similarly, in s = 25, action = 25 get the best value. While as the 25 % 2 == 1, there is no similar spike in s = 12/13.

Then for s = 51, the best choice is to bet 1. In winning case, the player get positive reward; in lost case, the player reach s = 50, a good state. For s = 52, 53, …, the V(s) is looking for balance between going back to s = 50 with negative reward and going forward to relative better states.

ref1

ref2

4.9

codes

Figure for ph = 0.25

0.25

Figure for ph = 0.55 0.55

4.10

q(s,a) = ∑s’p(s’|s,a) * [r + γmaxa’q(s’,a’)]