Proximal policy optimization

current hub

Write something...

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

About hubStatsRules

See all

Wikipedia

Proximal policy optimization

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.

The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems.

PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient.

Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games.

TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces.

The pseudocode is as follows:

Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent.

See all

Hub AI

Proximal policy optimization AI simulator

(@Proximal policy optimization_simulator)

Wikipedia

Hub AI

Proximal policy optimization

TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces.

The pseudocode is as follows:

Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent.

See all

Knowledge Base

Talk Channels

Special Pages

Proximal policy optimization

Proximal policy optimization

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Proximal policy optimization

Hub AI

Proximal policy optimization

History

Proximal policy optimization

Proximal policy optimization

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Proximal policy optimization

Hub AI

Proximal policy optimization