The state–action–reward–state–action (SARSA) method is a machine learning reinforcement learning approach for learning a Markov decision process policy. Rummery and Niranjan suggested it in a technical note as "Modified Connectionist Q-Learning" (MCQ-L). Rich Sutton's proposed alternate name, SARSA, was only noted as a footnote.

This name simply reflects the fact that the main function for updating the Q-value is dependent on the agent's current state "S1", the action "A1" that the agent chooses, the reward "R" that the agent receives for choosing this action, the state "S2" that the agent enters after taking that action, and finally the next action "A2" that the agent chooses in its new state. SARSA is an abbreviation for the quintuple (st, at, rt, st+1, at+1). Depending on which time step the prize is explicitly awarded, some writers use a slightly different approach and write the quintuple (st, at, rt+1, st+1, at+1). The former convention is used throughout the rest of the article.

Because a SARSA agent interacts with the environment and modifies the policy depending on actions done, this is referred to as an on-policy learning algorithm. An mistake, modified by the learning rate alpha, updates the Q-value for a state-action. Q-values reflect the potential reward obtained in the next time step for performing action an in state s, as well as the discounted future reward gained from the next state-activity observation. Watkin's Q-learning updates an estimate of the optimal state-action value function Q* based on the highest possible reward of available actions Watkin's Q-learning learns the Q-values associated with adopting the optimal policy while following an exploration/exploitation strategy, whereas SARSA learns the Q-values associated with taking the policy it follows.

Full Code Of Implementing SARSA Algorithm

The SARSA algorithm has been run in below: