Policy and Value Functions in RL: REINFORCE AND SARSA

Introduction

Published in

Analytics Vidhya

4 min readNov 7, 2021

In Reinforcement Learning (RL) there are two important function types that dictate an agent’s action in its environment. The agent’s action can be controlled using a policy function or a value function. A policy-based RL system uses state/action probabilities to map out its actions on the environment whereas a reward-based RL system uses the reward function to determine its action in an environment. We will conceptually discuss these two algorithm types and go through some light math involved in the computation of these algorithms — namely REINFORCE and SARSA respectively. NOTE: These concepts can be reviewed thoroughly in “Foundations of Deep Reinforcement Learning Theory and Practice in Python” by Laura Graesser and Wah Loon Keng.

REINFORCE Algorithm

In Reinforcement Learning (RL), an agent interacts with its environment through a policy function. The policy function dictates what the agent will do in its environment. In our case, the policy (pi theta) maps out states to action probabilities. The first element to introduce in understanding the policy function is the objective function. An objective function tells us the goal of the RL system such as getting the highest possible score or winning the game. In this RL system, the objective function is:

Essentially, a maximization of the objective function using gradient ascent gives us the policy gradient which looks like:

The policy gradient is the mechanism by which action probabilities produced by the gradient are changed. Although the math is complex, understand that the differentiation (calculus) of the equation with respect to theta will give us the “path” of which the agent will arrive at its goal.

SARSA Algorithm

Next, we will talk about the Value Function in RL which can be written like:

The value function tries to (using gradient ascent) maximize the reward of an RL system so that the agent chooses its actions based on rewards. The branch of RL that deals with manipulating value functions or Q functions is called Q-Learning.

There are a few concepts that are a consequence of the Q function. First, the 𝛾 (gamma) term which determines the rate at which the system learns. Second, the exploration vs. exploitation tradeoff and the 𝜺(epsilon)-greedy algorithm.

Gamma term

Notice in the Q function there is a 𝛾 term which is also called the discount factor. It is a factor multiplier for the reward. If 𝛾 is small, the RL system only cares about the steps in the near future whereas a large 𝛾 would make the RL system care about the ultimate goal or more steps into the future. A small 𝛾 leads to faster learning. A .99 𝛾 is a good default value for a wide range of applications.

Exploration vs Exploitation Tradeoff & 𝜺 (epsilon) Greedy Algorithm

Because we maximize the Q function that will determine the RL system's action, we say the algorithm is “greedy.” The 𝜺(epsilon) is a factor that determines how greedy the function is. The reward function is determined by (1 - 𝜺) so if the 𝜺 is high (value=1.0), the function becomes less deterministic because the rewards are all equal to 0 and the RL system chooses randomly causing it to “explore.” A low 𝜺 the reward function to be highly sensitive in decision making leading to “exploitation.”

Conclusion

An RL system can be controlled using a policy (pi) or a value-based algorithm (REINFORCE and SARSA respectively). Policy algorithms utilize their objective function to dictate behavior while Value-based algorithms utilize the reward function to dictate behavior. Q-learning has been applied to computer systems where its configuration needs to be optimized. It can be applied to news recommendation systems where there is a dynamic approach to the user’s tastes (it takes in real-time user preferences and changes using Q-learning). Q-learning can be used to maximize traffic light controls to learn traffic congestion times and determine lights accordingly. In more intricate, complicated tasks, policy-based algorithms tend to work better because it learns stochastic policies while Q-learning involves a more continuous sample space. An example of how stochastic policies outperform Q-learning is in rock/paper/scissors. If a player plays rock every time, a Q-learning approach wouldn’t work since its reward would tell the system to output the same result every time whereas a stochastic policy can alter the output.

A combination of these policies can be used and research seems to point in that direction. I hope this brief walk-thru of RL algorithms sheds some light on the subject. Have a good day!