Value Iteration & Q-Iteration | Seunghyun (Joe) Lee

I’m currently self-studying reinforcement learning (RL), and I often find it challenging to retain everything I learn. To help with this, I’m keeping notes as I go.

The code examples and learning material I reference are based on the book: Deep Reinforcement Learning Hands-On 3rd edition.

I do not claim ownership of the original content—these notes are strictly for personal learning and reference purposes.

Policy

Policies are denoted by \(\pi_{\theta}\), where actions are sampled or follows the distribution of the policy as: \(a \sim \pi_{\theta}\). In Deep RL, policies can be represented as neural networks and could be categorical (discrete action space) or Gaussian (continuous action space).

Trajectories

Trajectories are denoted by \(\tau\) and consists the sequence of state/actions as: \(\tau (s_0, a_0, s_1, a_1, ...)\).

Bellman Expectation v.s Bellman Optimality/Equality

Bellman expectations evaulate the expected reward under a specific policy \(\pi\) while the Bellman optimality are the value under the assumtion of the optimal policy \(\pi^{*}\). I’ll discuss both in more detail but keep the difference between the two in mind.

Bellman Expectation

\[V^{\pi}(s) = \mathbb{E}_{a \sim \pi} \left[ r(s,a) + \gamma V^{\pi}(s') \right]\]

The value function provides the expected reward under following the currently policy starting at state \(s\).

Each state \(V_s\) can be represented recursively as the best action at each state that yields the immediate reward plus the discounted reward moving in to the next state \(s'\); or in other words, the total reward from selecting the optimal action at each state \(s\).

\[Q^{\pi}(s,a) = \mathbb{E}_{s' \sim S} \left[ r(s,a) + \gamma \mathbb{E}_{a' \sim \pi}[Q^{\pi}(s', a')]\right]\]

The state action value function provides the expected reward under the current policy starting at state \(s\) and taking action \(a\).

Bellman Optimality/Equality

\[V^{*}(s) = \max\limits_{a}\mathbb{E}_{a \sim \pi} \left[ r(s,a) + \gamma V^{*}(s') \right ]\]

Given that we have an optimal policy, the Bellman optimality returns the expected reward when following the optimal policy.

Similiarily, we can write the optimal state-action value function as follows.

\[Q^{*}(s,a) = \mathbb{E}_{s' \sim S} \left[ r(s,a) + \gamma \max\limits_{a'}Q^{*}(s', a')\right]\]

The only difference between the Bellman expectation and its optimality is that there is a \(max\) operator, that selects the action that leads to the highest discounted reward.

Also note that because the value function evaluates its expecations at each state, we can rewrite the value function as:

\[V^{\pi}(s) = \mathbb{E}_{a \sim \pi}[Q^{\pi}(s,a)]\]

Value Iteration in Practice (Frozen Lake)

For value iteration to work, we focus on 3 steps.

Initialize Value-Function table for all \(S\)
For each \(s\), update the value function: \(V^{\pi}(s) \leftarrow \mathbb{E}_{a \sim \pi} \left[ r(s,a) + \gamma V^{\pi}(s') \right]\)
Repeat step 2 until convergence.

State = int
Action = int
RewardKey = tt.Tuple[State, Action, State]
TransitKey = tt.Tuple[State, Action]

class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME) # Frozen Lake
        self.state, _ = self.env.reset()
        self.rewards: tt.Dict[RewardKey, float] = defaultdict(float) # Rewards Table 
        self.transits: tt.Dict[TransitKey, Counter] = defaultdict(Counter) # Transitions Table (Counter of Probabilities)
        self.values: tt.Dict[State, float] = defaultdict(float) # Value Table (State -> Value)

First we initialize our dictionaries to track our tables. Note that the \(\mathbb{E}_{a \sim \pi}\) is calculated through probabilities, which we keep a count of through the self.transits dictionary.

    def play_n_random_steps(self, n: int):
        for _ in range(n):
            action = self.env.action_space.sample()
            new_state, reward, is_done, is_trunc, _ = self.env.step(action)
            rw_key = (self.state, action, new_state)
            self.rewards[rw_key] = float(reward)
            tr_key = (self.state, action)
            self.transits[tr_key][new_state] += 1 # this is the dict that stores the counts of visits
            if is_done or is_trunc:
                self.state, _ = self.env.reset()
            else:
                self.state = new_state

We then define the random steps that we are running our agent in the environment, which is basically equivalent to the usual gymnasium codeflow. We can see that we store the rewards for every (s, a, s') pair and store its count. Every (s, a, s') pair will have the same value in the rewards table as the rewards are fixed in the environment. Therefore, this initializes our rewards and transition tables.

    def calc_action_value(self, state: State, action: Action) -> float:
            target_counts = self.transits[(state, action)]
            total = sum(target_counts.values())
            action_value = 0.0
            for tgt_state, count in target_counts.items(): # for each dict
                rw_key = (state, action, tgt_state)
                reward = self.rewards[rw_key]
                val = reward + GAMMA * self.values[tgt_state]
                action_value += (count / total) * val
            return action_value

Now, we define a function to calculate the value function. target_counts returns the dictionary of the transitions. For example, if for state action (s=1, a='move right') has the dictionary {2: 8, 3: 2}, it means that the agent moved to state 2 : 8 times and state 3: 2 times. Assuming that our value function at state 2 is 5 and state 3 is 3, we can query the immediate reward comes from our self.rewards table and the value function from self.values table, which is equivalent to calculating \([ r(s,a) + \gamma V^{\pi}(s')]\). Then, the transitions are used for the count/total section, which provides the probability for calculating the expection. In summary, this function returns the value function provided a state and its action: \(V(s) = \mathbb{E}_{a \sim \pi} \left[ r(s,a) + \gamma V(s') \right ]\).

   def select_action(self, state: State) -> Action:
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.calc_action_value(state, action)
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action

Here, we are proceding with the max part of \(V^{*}(s) = \max\limits_{a}\mathbb{E}_{a \sim A} \left[ r(s,a) + \gamma V^{*}(s') \right ]\), as the calc_action_value calculates the value function, we are iterating through all the actions for the state to get the value that has the highest value of \(V(s)\).

    def play_episode(self, env: gym.Env) -> float:
        total_reward = 0.0
        state, _ = env.reset()
        while True:
            action = self.select_action(state) # choose the highest value (state, action) pair 
            new_state, reward, is_done, is_trunc, _ = env.step(action)
            rw_key = (state, action, new_state)
            self.rewards[rw_key] = float(reward)
            tr_key = (state, action)
            self.transits[tr_key][new_state] += 1
            total_reward += reward
            if is_done or is_trunc:
                break
            state = new_state
        return total_reward

This is equivalent to play_random_n_steps with one change, which is that we are no longer sampling random actions, but choosing the action that maximizes the value function at each state. Keep in mind that we are still updating the rewards and transition tables during this process.

    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            state_values = [
                self.calc_action_value(state, action)
                for action in range(self.env.action_space.n)
            ]
            self.values[state] = max(state_values)

Lastly we do the value iteration, which updates the value function for each state as the max value across all its actions. The running code is as follows:

while True:
    agent.play_n_random_steps(100)
    agent.value_iteration()

    reward = 0.0
    for _ in range(20):
        reward += agent.play_episode(test_env)
    reward /= 20
    if reward > best_reward:
        best_reward = reward
    if reward > 0.80:
        break

We first play random steps by updating our rewards and transition tables; note that the value functions are not updated during this process.
We do update our state values based on the rewards and transitions table data generated by random steps.
We run 20 episodes and measure the average reward over the episodes.
If reward is greater than .8 we stop.