Deep Reinforcement Learning: A Quick Overview
Deep Reinforcement Learning (RL) can be broken down into two parts. The “deep” refers to the use of neural networks in its purpose while reinforcement learning (RL) is described to be the use of rewards to reinforce a behavior. It is an active branch of AI with new developments every few months. In this article, I will only explain the concepts behind RL and only get into the casual mathematics and general functions that describe a RL model. Deep RL applications are in playing video games, sports, driving, and robotics. In these cases, there is a goal and the machine receives continuous feedback from the environment before taking on the next action and so forth. There is a continuous feedback loop between an agent and environment. Almost all of the notions in this article can be referenced to “Foundations of Deep Reinforcement Learning Theory and Practice in Python” by Laura Graesser and Wah Loon Keng.
As mentioned before, there is an agent or subject if you will that acts on an environment. Each time stamp in the environment is called a state. This can be imagined as a slice in time or a frame in an animation sequence. The agent observes the state and selects an action which will bring it to another state. This state transition is associated with a reward. It’s called RL because reward is what drives the model into this continuous loop. The system has an objective and rewards reinforce good actions which help the system arrive at its objective.
A simple illustration of this concept is a game of CartPole where there is a cart that moves along a track trying to balance a pole on top. The objective is to keep the pole balanced for 200 time steps.

The state or information about the environment can be a tuple that describes [cart position, cart velocity, pole angle, pole angular velocity]. The action is a number 0–1 that tells the cart to move left (0) or right (1). The termination of the game is when the pole falls (> 12 degrees from vertical) or when the pole stays for ≥200 time steps.
There is a state space (all the possible states of the game) s, action space (all the actions of the game) a, and reward function with three parameters (state at time t, action at time t and state at t+1) that describes the basic unit of information of the RL system.
Given the above information our first function, the Markov Probability Rule is very useful.

This rule states that for a given state s, the given state’s probability is only determined by the current state and action. The previous time step is what determines the next time step. Translated to our CartPole example, the Markov Chain gives us a state that is determined by its previous state. Consequently, inputting s and a will gives us a new state which drives the algorithm forward. An important note is as we will see is how s and a relate to our below equations.
To begin, an important equation to realize is:

A variation of the above function is:

The three equations above describe the 3 main algorithms in deep RL. The first function is an environment model. The second equation is a policy function, the third a value function.
The first function the Markov model, is useful in the environment model. The environment model is used in AI chess games, Go, backgammon, etc. The model maps out or “imagines” all the states and actions using the Markov Chain Probability Rules before it makes an optimal move. An example of this idea is the Monte Carlo Tree Search (MCTS) where sample sequences of actions called Monte Carlo rollouts are explored and have estimated values. The computer selects the best course of action given these rollouts.
Let us explain the second function. Tau refers to the trajectory from an episode. An episode is the collection time steps from t=0 to the termination of the environment. The trajectory (tau) refers to the sequence of s, a, r… to time t. The importance of the policy function (pi) is that it drives the model forward and at the same time its maximization leads to the “learning” of the RL system. NOTE: In this case, the maximization of pi (policy function) is actually the result of running s and a numerous times until we find a maximum value of pi. We use gradient ascent and introduce the term theta to the equation.
Immediately related to this equation is the final function which is a value function which helps a RL model learn by maximizing its reward. It measures how “good” or “bad” a state action pair is and its maximization is also what makes the RL model “learn.” Likewise, maximization here is the same thing (gradient ascent using theta) as above.
Then finally, there are combined methods which are two or more of the above algorithms. The Actor-Critic algorithm uses a policy function to act and a value function to critique. Such algorithms that fall into this category are A3C, Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradients(DDPG), Proximal Policy Optimization (PPO), and Soft Actor Critic (SAC) with PPO being the most widely used. Methods using an environment model and/or policy/value functions include AlphaGo, and Dyna-Q.
Another grouping for the algorithms is off-policy vs on-policy. On-policy algorithms utilize data generated only within the same policy and is discarded after training, while off-policy algorithms allow for data reuse. Off-policy algorithms are much more memory intensive.
We have gone through the overview of the state of RL presently. The “deep” learning aspect of RL requires the use of neural networks to be applied in our three function types. The Markov Rule is an essential component to make these functions work and much research is being conducted on these models. Reinforcement Learning is being used in the field of video games, robotics, and NLP. Chatbots, text summarization, text translation, gaming, robotics, including self-driving cars are all exciting applications of RL. It does this by mimicking how humans learn, through feedback trial and error. Although this is only a light dive into this AI subfield, I hope it has inspired and made aware to the readers the possibilities and future developments of RL.