Deep Deterministic Policy Gradient (DDPG) is an off-policy model-free approach for learning continuous actions. It combines DPG (Deterministic Policy Gradient) and DQN concepts (Deep Q-Network). It employs DQN's Experience Replay and slow-learning target networks, and it is built on DPG, which can function over continuous action spaces. This lesson is strongly related to the paper Continuous control with deep reinforcement learning.

We have two networks, similar to the Actor-Critic method: Actor - It offers an action in response to a situation. Given a condition and an action, the critic predicts whether the action is good (positive value) or bad (negative value). DDPG employs two additional strategies not included in the original DQN: For starters, it employs two Target networks. Why? Because it improves training stability. In brief, we are learning from estimated targets, and target networks are updated slowly, allowing us to maintain the stability of our estimated targets.


From the figure above. Critic loss = Mean Squared Error of y - Q(s, a), where y represents the expected return as observed by the Target network and Q(s, a) represents the action value projected by the Critic network. y is a moving objective that the critique model attempts to accomplish; we stabilize this target by gently updating the Target model. Actor loss is calculated by taking the mean of the Critic network's value for the Actor network's activities. We are attempting to maximize this quantity. As a result, we update the Actor network to create actions that have the highest anticipated value as viewed by the Critic for a particular state.

Full Code Of Implementing Deep Deterministic Policy Gradient (DDPG) Algorithm

We're working on a solution to the classic Inverted Pendulum control challenge. In this situation, we only have two options: swing left or swing right. The fact that actions are continuous rather than discrete makes this problem difficult for Q-Learning Algorithms. That example, instead of employing two discrete actions such as -1 or +1, we must choose among an unlimited number of actions spanning from -2 to +2.