The agent and the environment

In finite Markov Decision Process (MDP), we have three sets, a set of states, a set of actions, and a set of rewards.

The learner or decision maker is called agent and the outside system that the agent interacts with is called environment. Everyperiod, the agent takes actions and correspondingly the environment reacts to produce new states to the agent.

Each period, the environment presents \(S_t\) from a set \( S\). The agent selects action \(A_t\) from set \(A(s)\). In the next time step, the agent is rewarded for the action taken previously \(R_{t+1}\) and is encountered with new state \(S_{t+1}\), the state that was responsive to the agent’s previous action and previous state.

Then MDP gives rise to the following sequence or trajectory,

$$ S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, \cdots $$

State-transition probabilities

Suppose that \(R_t\) and \(S_t\), which represent rewards and states at a given time \(t\) have well defined discrete probability distributions that only depends on previous action and state. Let particular values of random variables \(R_t\) and \(S_t\) be \(s^\prime\) and \(r\). Then the probability of observing that particular state and reward values conditional on preceding state and action can be represented as follows:

$$ p(s^\prime, r | s, a) =\Pr(S_t=s^\prime, R_t=r | S_{t-1}=s, A_{t-1}=a) \newline \text{for all } s^\prime, s \in S, r \in R, \text{ and } a \in A(s), \newline \text{ where } p : S \times R \times S \times A \rightarrow [0, 1]. $$

Note that $$ \sum_{s^\prime \in S} \sum_{r \in R} p(s^\prime, r | s, a) = 1, \newline \text{for all } s \in S, \text{ and } a \in A(s). $$

This simply means that for each choice of \(s\) and \(a\), the sum of probability of landing in state \(s^\prime\) and specific reward \(r\) sums to 1. Markov decision process dictates the environment to only depend on its immediate preciding state and action, not the entire history of states and actions.

From the four-argument dynamic function represented by \(p(s^\prime, r | s, a)\), the state-transition probabilities can be calculated that is denoted by \(p(s^\prime | s, a)\), where \(p: S \times S \times A \rightarrow [0,1]\).

$$ p(s^\prime | s, a) = \Pr\left[S_t = s ^\prime | S_{t-1}=s, A_{t-1}=a\right] = \sum_{r \in R} p(s ^\prime, r | s, a) $$

This implies that $$ \sum_{s^\prime \in S} p(s^\prime | s, a) = 1. $$

Reference

Richard S. Sutton and Andrew G. Barto, “Reinforcement learning: An introduction”, Second Edition, MIT Press