Skip to main content

Reinforcement Learning (Part II) - Model-free

Today's post will introduce you to the model-free methods of Reinforcement Learning (RL). To have a model of the environment we need to store all the states and actions. To do so, we are limited to the infrastructure limits, that means we need to find another approach when the environment is too big and with too many variables. Therefore, model-free approaches can handle the problems where the world is too big to fit our infrastructures. To better comprehension, the Q-Learning algorithm will be presented and explained.
Robot infinite environment - Photo by Dominik Scythe on Unsplash

Terminologies

Figure 1 - Agent-environment interaction
Agent — The learner and the one that makes actions. The agent's goal is to maximise the cumulative reward across a set of actions and states.
Action — A set of actions which the agent can perform. Different environments allow the agent to perform distinct kinds of actions. The set of all valid actions in a given environment is usually denominated action space.
State — The state of the agent in the environment. A state is a total description of the environment's state. 
Reward — For each action/state picked by the agent the environment gives a reward. Usually a scalar value. 
Environment — Where the agent learns and chooses what actions to perform. The environment is the world where the agent lives and interacts. 
Policy —  A policy is a rule in which the agent bases itself to decide what actions to take. It can be deterministic or stochastic.
Value Function — Maps the state to a value. Returns how good it is to be in that state by evaluating the "future". The value represents the cumulative reward by starting in that state and use a given policy.  
Model — The representation of the environment the user can see and interact with. It maps the state-action distribution over states. Not available in every RL method.
Markov decision process (MDP) —  A probabilistic model of a decision problem, where the current state and selected action determine a probability distribution on future states.


When using a Model-free RL method, we do not have a model of the environment to simulate the environment response/reward and plan the best action to perform. The agent uses the experience of all actions it performed and all states where it stayed before, to learn and achieve its goal. Thus,  the agent relies on trial-and-error experience so it can find the optimal policyOn the other hand, in Model-based RL we try to model the environment and then choose the optimal policy based on what the agent learned from the model simulation. 

In RL, each state yields a reward from the environment. Please note that in model-free RL approaches the environment's rewards are estimated and updated as the agent explores/exploits the world. Although RL is based on the total cumulative rewards, sometimes the immediate reward can be low but the states onwards can yield the best cumulative reward. We calculate the value of a state, what can be seen as the future utility or how good it is to be in that state. We can use several approaches to do it. One is by averaging the values of future states/actions that start from the state we are evaluating. Therefore, we can choose a state with a low reward, but it really gives us access to better rewards in the future. That is why the trade-off between exploration versus exploitation is so debated. For now, to have better comprehension about the immediate and future rewards impact, let's analyse a Markov Decision  Process with the two approaches.
Figure 2 - MDP example
Figure 2 shows how relying on immediate or future rewards change the value of each state. The first shown MDP (a) does not have the value of its states. There we only have the reward we get by going to that state (represented by R) and the probability of the agent to choose a given action (represent by P). Using the Bellman equation to calculate the averaging value of each state, we can choose to relly on immediate rewards or future rewards by using the discounting parameter of the equation. The discount Îł belongs to the interval [0, 1]. When Îł is set close to 1 we are taking into account the future rewards of future states and not only the current one. On the other hand, choosing a Îł close to 0 makes our value function to rely on immediate rewards. 
To calculate the value of each state we sum the immediate reward to the value of the next states multiplied by the probability of the agent chose the action of each next state discounted by Îł. As an example, in Figure 2, the image (b) were the discount value is 1 (relies on future rewards), to calculate the value of the B state, we have the following calculus: -2 + 1 * (0.2 * 20 + 0.8 * 0) = 2. In the third image (c) we have -2 + 0.2 * (0.2 * 20 + 0.8 * 0) = -1.2 to calculate the B state's value. So, in this case, we can see that choosing immediate rewards (c) is not the best approach as the state with the highest value is a final state and have 0 as the reward. Choosing the future rewards (b), we end up in a final state with a higher cumulative reward.

Q-Learning

One of the most well known model-free RL algorithms is Q-Learning.
Figure 3 - Q-Learning life cycle
The ‘Q’ stands for quality, which represents how useful a given state-action is in getting some future cumulative reward. In this algorithm, we create the so-called "q-table" in the shape of the state * actions. As an example, if our environment has states represented by position and velocity, [P, V], and we have 3 distinct actions available, [0, 1, 2], then the "Q-table" will be a matrix with shape [2, 3]. 
When we start the agent's learning phase, the "q-table" is filled by zeros or random values. Then the agent must explore and exploit the states and actions to update the "q-values" with the value of each state for every episode. This is calculated using the Q-learning equation to estimate the value function for all states. Then the agent performs based on the max "q-values" he finds in the "q-table." Thus, we are converging to the real states' value using this constant estimation of the "q-values" across time. 
Summing all together, we do not have a model, but we estimate the value of each state in each episode, based on the trial-and-error experience, until the agent figures out the real value of the states, what yields the best cumulative reward for the agent.

Other model-free approaches use different equations to estimate the state's value, and some of them use deep neural networks to estimate it. Each one has advantages and disadvantages. The best approach is the most suitable one for your case.




Comments

  1. Excellence blog! Thanks For Sharing, The information provided by you is really a worthy. I read this blog and I got the more information about
    ai course in noida

    ReplyDelete

Post a Comment

Popular posts from this blog

How does COVID-19 continue to spread? - A simulation 2.0 (How it was built)

 Unfortunately, the days we are living right now are still bad, or even worse than ever. Millions of people are being killed by this "new virus", as they called it once. COVID-19 is here and will be among us for too long. Some of us thought, incorrectly, 2021 will be the year, we will have vaccines, that's it! No more problems related to COVID-19! Let's start living as before!  No, no, no! If you still think this way, please stop it right now. By not respecting the known procedures to avoid the COVID-19 infection you will keep the virus spreading chain. Consequently, the virus will kill more people, being them related to you or not. Many apparently  healthy humans are having severe "side effects" by getting infected with this virus. Stop thinking the virus provokes just flu and help to stop the spread!  Millions of healthcare professionals are giving their lives to help in this war. You are neglecting them and all the people around you! Keep yourself safe

Artificial Intelligence History

As you know, AI today is a widely used tool in every kind of systems. However, how did it start? We had only one inventor or more people had invested in AI? AI is a recent discovery? When it became so powerful and why? Today's post will put you up to date to the Artificial Intelligence History. Alan Turing Well, everything started alongside the Second World War. Sadly, some of the human's biggest discoveries occurred during wars.  In 1943,  Warren McCulloch and Walter Pitts presented an initial mathematical and computer model of the biological neuron [2].  There was 1950 when John Von Neumann and Alan Turing created the technology behind AI.  Turing created the called Bombe machine to decipher messages exchanged between the German forces. That system was the pillar of today's machine learning [1]. Turing was a huge impact in the Artificial Intelligence field, and still today some of his statements are updated and used.  Turing questioned the possible intelligence of a ma

How does COVID-19 continue to spread? - A simulation 2.0 (Results)

This post shows some of the results we can find by using the simulation. As in the first version I made some tests, now I focused the new tests on the travelling and vaccination processes. These two processes were added in the last simulation version and represent some critical behaviour and processes in the virus spread. Photo by Sharon McCutcheon on Unsplash Vaccination process impact Using the standard static configuration values we can find the following results: The vaccination process does not have a considerable impact if we close our borders. By not receiving new agents with the infection, the simulation reaches the number of 0 infected agents on the 38th day using a vaccination percentage of 0.1 If we increase the vaccination percentage to 0.9 the 0 infected agents threshold is reached on the 39th day. Thus, we can infer that if we control the flow of agents in a city/simulation, the vaccination process does not have a considerable impact as it takes some time until the people