The goal of the agent is to maximize the cumulative sum of the rewards of the long-run. The rewards could be arbitrarily chosen number that summarizes how one wants the agent to behave under specific state, action, and subsequent state.

The rewards function, formally represented by \(R(s)\) or \(R(s,a)\), or \(R(s, a, s^\prime)\) can depend on current state, the subsequent state as well as the action taken by the agent taken in the current state.

Suppose the sequence of rewards for each time step starting \(t+1\) until \(T\) is represented as follows: $$ R_{t+1}, R_{t+2}, R_{t+3}, R_{t+4}, \cdots, R_T $$

The agent’s objective is to choose action at each time step with given state (which we call “policy”) that maximizes the overall cumulative rewards. Let \(G_t\) represent the summation of the sum of the future rewards from \(t+1\) onwards.

$$ G_t =R_{t+1}+ R_{t+2}+ R_{t+3}+ R_{t+4}+ \cdots $$

Taking into account the discount factor \(\gamma\), $$ \begin{align*} G_t &=R_{t+1}+ \gamma R_{t+2}+ \gamma^2R_{t+3}+ \gamma^3R_{t+4}+ \cdots \newline &= R_{t+1} + \gamma G_{t+1} \end{align*} $$

Policies and Value Functions

Value functions are functions of states that summarize how good it is for the agent at a given state. The degree of how good is measured by the expected sum of future returns at a given state.

A policy is defined by the probability distribution of a discrete random variable \(A_t\) conditional on a given state \(S_t\) . It maps from states to actions represented by probabilities, \(\pi : S \rightarrow p(A) =[0,1]\). Deterministic policy maps from state space to action space, while stochastic policy maps from state space to probability distribution of action space.

If the agent takes policy \(\pi \) at time \(t\), \(\pi(a|s)\) represents the probability of the agent choosing action \(a\in A(s)\) given state \(s\in S\).

The value function of state \(s\) under policy \(\pi \) is then the expected future returns of the agent when it starts at state \(s\) and takes policy \(\pi\) thereafter.

$$ v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s] $$

$$ \begin{aligned} v_\pi(s) &= \sum_a\pi(a|s) \sum_{s^\prime, r}p(s^\prime, r | s, a)[r + \gamma v_\pi(s^\prime)], \text{ for all } s \in S \end{aligned} $$

Note that \(r\) is the realized value of \(R(s, a, s^\prime)\).


Richard S. Sutton and Andrew G. Barto, “Reinforcement learning: An introduction”, Second Edition, MIT Press