Hybrid Policy Gradient for Deep Reinforcement Learning
In this paper, we propose an alternative way of updating the actor (policy) in DDPG algorithm to increase convergence and stable learning process. In the basic actor- critic architecture with TD (temporal difference) learning of critic, the actor parameters are updated in the gradient ascent direction of TD-error of critic. Similar to basic actor- critic approach, in our proposed hybrid method, Hybrid- DDPG (shortly H-DDPG), at one-time step actor is updated similar to DDPG (gradient of critic output with respect to policy parameters) and another time step, policy parameters are moved in the direction of TD-error of critic. The algorithm is tested on OpenAI gym’s Robo school Inverted Pendulum Swing up-v1 environment. Once among 5 trial runs, reward obtained at the early stage of training in H- DDPG is higher than DDPG. In Hybrid update, the policy gradients are weighted by TD-error. This results H-DDPG have higher reward than DDPG by pushes the policy parameters to move in a direction such that the actions with higher reward likely to occur more than the other. This implies if the policy explores at early stages good rewards, the policy may converge quickly otherwise vice versa. However, among the remaining trial runs, H-DDPG performed same as DDPG.
- There are currently no refbacks.