Reinforcement Learning (Part III) - Exploration vs Exploitation

In the Reinforcement Learning field, we face ourselves with the exploration and exploitation words. Moreover, many articles talk about the exploration vs exploitation trade-off. What do they mean? Why is this a thing in RL? Does this relationship have a big impact on the RL algorithms' outcome?

Figure 1 - Should I choose the well-known path or give a try to a new one? Photo by Jens Lelie on Unsplash

Exploration

Exploration is when the agent explores new steps and/or actions to find if other state-action pairs yield a better reward from the environment. You can explore the whole world, or a sample of it to find out the rewards you can get.

Imagine the case where you need to lunch somewhere in your city. You have two options, in the first one you go to the same restaurant you always go with that tasty food you like. The other option is choosing a different restaurant and only after being there you find out if the food is better, equal or worst. The second option leads you to a process of exploration as you will find out the reward of a new state Only by experience it you will find if the new restaurant as good food for your taste. On the other hand, the first option is the exploitation process.

Exploitation

In this case, the agent will always choose the state-action pair with the highest reward without trying to get information about other possibilities. Getting back to the example, we will choose to go to the same restaurant as we like the food they have there (in the agent's case, the reward is greater and well known). This is usually used in a greedy approach where we look for the best immediate rewards.

As you can see, the Exploration vs Exploitation trade-off has huge importance in RL algorithms. If you always explore the environment until you have all the state-action pairs reward's estimations, the algorithm will take too much time and will consume too many resources while running. If you only look for what you already know, you can get a restaurant with good food but you may never taste the best food for you because it is made in a restaurant that you did not explore.

Therefore, we have many algorithms that explore a sample of the environment and only then they begin to exploit. Or other ones where we exploit in the majority of the time and with a random probability we explore a new step-action pair.

Smart Insight

Search This Blog