Engineering, 07.03.2020 02:46 lukeperry

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP

Answers: 2

Show answers

Answers

Answer from: vondah4014

U(s) = maxa[R0

(s, a) + γ

pre T

(s, a, pre)(maxb[R0

(pre, b) + γ

0 T

(pre, b, s0

) ∗ U(s

))]]

U(s) = maxa[

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

)]

U(s) = R0

(s) + γ

2 maxa[

post T

(s, a, post)(R0

(post) + γ

2 maxb[

0 T

(post, b, s0

)U(s

))]]

U(s) = maxa[R(s, a) + γ

0 T(s, a, s0

)U(s

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

(pre(s, a, s0

), b, s0

) = 1

(s, a) = 0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

0 = γ

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

pre T

(s, a, pre)(maxb[R0

(pre, b) + γ

0 T

(pre, b, s0

) ∗ U(s

))]]

U(s) = maxa[

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

(s, a, post(s, a, s0

)) = 1

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

(s) = 0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

0 = γ

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

2 maxa[

post T

(s, a, post)(R0

(post) + γ

2 maxb[

0 T

(post, b, s0

)U(s

))]]

U(s) = maxa[R(s, a) + γ

0 T(s, a, s0

)U(s

)]

Answer from: Quest

how is this a question? in order for me to answer it i have to understand the question

explanation:

Answer from: Quest

yes indeed

mayonaise pizza

good

dad gone me

Another question on Engineering

Engineering, 04.07.2019 18:10

Refrigerant 134a enters an insulated compressor operating at steady state as saturated vapor at -26°c with a volumetric flow rate of 0.18 m3/s. refrigerant exits at 9 bar, 70°c. changes in kinetic and potential energy from inlet to exit can be ignored. determine the volumetric flow rate at the exit, in m3/s, and the compressor power, in kw.

Answers: 1

Answer

Engineering, 04.07.2019 18:10

The drive force for diffusion is 7 fick's first law can be used to solve the non-steady state diffusion. a)-true b)-false

Answers: 1

Answer

Engineering, 04.07.2019 18:10

At 12 noon, the count in a bacteria culture was 400; at 4: 00 pm the count was 1200 let p(t) denote the bacteria cou population growth law. find: (a) an expression for the bacteria count at any time t (b) the bacteria count at 10 am. (c) the time required for the bacteria count to reach 1800.

Answers: 1

Answer

Engineering, 04.07.2019 18:10

The flow rate of air through a through a pipe is 0.02 m5/s. a pitot static tube is placed in the flow. the radius of the pitot static tube is 1 mm. assuming the flow to be steady and the air to be at 300k, calculate the difference in total and static pressure if the diameter of the pipe is: (a) d 0.1 m d 0.05 m (c) d 0.01 m

Answers: 2

Answer

You know the right answer?

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with rewa...

Questions