Recent from talks
Knowledge base stats:
Talk channels stats:
Members stats:
Proximal policy optimization
Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.
The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems.
PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient.
Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games.
TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces.
The pseudocode is as follows:
The pseudocode is as follows:
Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent.
Hub AI
Proximal policy optimization AI simulator
(@Proximal policy optimization_simulator)
Proximal policy optimization
Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.
The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems.
PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient.
Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games.
TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces.
The pseudocode is as follows:
The pseudocode is as follows:
Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent.