Q-learning is a model-free reinforcement learning technique that may be used to learn the value of an action in a given state. It does not require an environment model (thus the term "model-free"), and it can handle issues with stochastic transitions and rewards without requiring modifications. Q-learning discovers an optimum policy for any finite Markov decision process (FMDP) by maximizing the anticipated value of the total reward across any and all consecutive steps beginning with the current state. Given infinite exploration time and a partly-random policy, Q-learning may determine an optimal action-selection policy for any given FMDP. "Q" refers to the function computed by the algorithm – the expected rewards for an action performed in a particular condition. See the basic pseudo code of Q-learning algorithm below:

The core of the Q-learning algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the old value and the new information:

Full Code Of Implementing Q-Learning Algorithm

We can practice the Q-learning algorithm below: