Today's post will introduce you to the model-free methods of Reinforcement Learning (RL). To have a model of the environment we need to store all the states and actions. To do so, we are limited to the infrastructure limits, that means we need to find another approach when the environment is too big and with too many variables. Therefore, model-free approaches can handle the problems where the world is too big to fit our infrastructures. To better comprehension, the Q-Learning algorithm will be presented and explained.
Robot infinite environment - Photo by Dominik Scythe on Unsplash |
Terminologies
Figure 1 - Agent-environment interaction |
Agent — The learner and the one that makes actions. The agent's goal is to maximise the cumulative reward across a set of actions and states.
Action — A set of actions which the agent can perform. Different environments allow the agent to perform distinct kinds of actions. The set of all valid actions in a given environment is usually denominated action space.
State — The state of the agent in the environment. A state is a total description of the environment's state.
Reward — For each action/state picked by the agent the environment gives a reward. Usually a scalar value.
Environment — Where the agent learns and chooses what actions to perform. The environment is the world where the agent lives and interacts.
Policy — A policy is a rule in which the agent bases itself to decide what actions to take. It can be deterministic or stochastic.
Value Function — Maps the state to a value. Returns how good it is to be in that state by evaluating the "future". The value represents the cumulative reward by starting in that state and use a given policy.
Model — The representation of the environment the user can see and interact with. It maps the state-action distribution over states. Not available in every RL method.
Markov decision process (MDP) — A probabilistic model of a decision problem, where the current state and selected action determine a probability distribution on future states.
When using a Model-free RL method, we do not have a model of the environment to simulate the environment response/reward and plan the best action to perform. The agent uses the experience of all actions it performed and all states where it stayed before, to learn and achieve its goal. Thus, the agent relies on trial-and-error experience so it can find the optimal policy. On the other hand, in Model-based RL we try to model the environment and then choose the optimal policy based on what the agent learned from the model simulation.
In RL, each state yields a reward from the environment. Please note that in model-free RL approaches the environment's rewards are estimated and updated as the agent explores/exploits the world. Although RL is based on the total cumulative rewards, sometimes the immediate reward can be low but the states onwards can yield the best cumulative reward. We calculate the value of a state, what can be seen as the future utility or how good it is to be in that state. We can use several approaches to do it. One is by averaging the values of future states/actions that start from the state we are evaluating. Therefore, we can choose a state with a low reward, but it really gives us access to better rewards in the future. That is why the trade-off between exploration versus exploitation is so debated. For now, to have better comprehension about the immediate and future rewards impact, let's analyse a Markov Decision Process with the two approaches.
Figure 2 shows how relying on immediate or future rewards change the value of each state. The first shown MDP (a) does not have the value of its states. There we only have the reward we get by going to that state (represented by R) and the probability of the agent to choose a given action (represent by P). Using the Bellman equation to calculate the averaging value of each state, we can choose to relly on immediate rewards or future rewards by using the discounting parameter of the equation. The discount Îł belongs to the interval [0, 1]. When Îł is set close to 1 we are taking into account the future rewards of future states and not only the current one. On the other hand, choosing a Îł close to 0 makes our value function to rely on immediate rewards.
To calculate the value of each state we sum the immediate reward to the value of the next states multiplied by the probability of the agent chose the action of each next state discounted by Îł. As an example, in Figure 2, the image (b) were the discount value is 1 (relies on future rewards), to calculate the value of the B state, we have the following calculus: -2 + 1 * (0.2 * 20 + 0.8 * 0) = 2. In the third image (c) we have -2 + 0.2 * (0.2 * 20 + 0.8 * 0) = -1.2 to calculate the B state's value. So, in this case, we can see that choosing immediate rewards (c) is not the best approach as the state with the highest value is a final state and have 0 as the reward. Choosing the future rewards (b), we end up in a final state with a higher cumulative reward.
To calculate the value of each state we sum the immediate reward to the value of the next states multiplied by the probability of the agent chose the action of each next state discounted by Îł. As an example, in Figure 2, the image (b) were the discount value is 1 (relies on future rewards), to calculate the value of the B state, we have the following calculus: -2 + 1 * (0.2 * 20 + 0.8 * 0) = 2. In the third image (c) we have -2 + 0.2 * (0.2 * 20 + 0.8 * 0) = -1.2 to calculate the B state's value. So, in this case, we can see that choosing immediate rewards (c) is not the best approach as the state with the highest value is a final state and have 0 as the reward. Choosing the future rewards (b), we end up in a final state with a higher cumulative reward.
Q-Learning
One of the most well known model-free RL algorithms is Q-Learning.
Figure 3 - Q-Learning life cycle |
The ‘Q’ stands for quality, which represents how useful a given state-action is in getting some future cumulative reward. In this algorithm, we create the so-called "q-table" in the shape of the state * actions. As an example, if our environment has states represented by position and velocity, [P, V], and we have 3 distinct actions available, [0, 1, 2], then the "Q-table" will be a matrix with shape [2, 3].
When we start the agent's learning phase, the "q-table" is filled by zeros or random values. Then the agent must explore and exploit the states and actions to update the "q-values" with the value of each state for every episode. This is calculated using the Q-learning equation to estimate the value function for all states. Then the agent performs based on the max "q-values" he finds in the "q-table." Thus, we are converging to the real states' value using this constant estimation of the "q-values" across time.
Summing all together, we do not have a model, but we estimate the value of each state in each episode, based on the trial-and-error experience, until the agent figures out the real value of the states, what yields the best cumulative reward for the agent.
Other model-free approaches use different equations to estimate the state's value, and some of them use deep neural networks to estimate it. Each one has advantages and disadvantages. The best approach is the most suitable one for your case.
Excellence blog! Thanks For Sharing, The information provided by you is really a worthy. I read this blog and I got the more information about
ReplyDeleteai course in noida