Recent from talks
Nothing was collected or created yet.
Reinforcement learning from human feedback
View on Wikipedia
| Part of a series on |
| Machine learning and data mining |
|---|
In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.
In classical reinforcement learning, an intelligent agent's goal is to learn a function that guides its behavior, called a policy. This function is iteratively updated to maximize rewards based on the agent's task performance.[1] However, explicitly defining a reward function that accurately approximates human preferences is challenging. Therefore, RLHF seeks to train a "reward model" directly from human feedback.[2] The reward model is first trained in a supervised manner to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human annotators. This model then serves as a reward function to improve an agent's policy through an optimization algorithm like proximal policy optimization.[3] [4] [5]
RLHF has applications in various domains in machine learning, including natural language processing tasks such as text summarization and conversational agents, computer vision tasks like text-to-image models, and the development of video game bots. While RLHF is an effective method of training models to act better in accordance with human preferences, it also faces challenges due to the way the human preference data is collected. Though RLHF does not require massive amounts of data to improve performance, sourcing high-quality preference data is still an expensive process. Furthermore, if the data is not carefully collected from a representative sample, the resulting model may exhibit unwanted biases.

Background and motivation
[edit]Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge.[6] For example, one may want to train a model to generate safe text that is both helpful and harmless (such as lacking bias, toxicity, or otherwise harmful content). Asking humans to manually create examples of harmless and harmful text would be difficult and time-consuming. However, humans are adept at swiftly assessing and comparing the harmfulness of different AI-generated text. Therefore, a more practical objective would be to allow the model to use this type of human feedback to improve its text generation.[7]
Despite the clear benefits of incorporating human feedback in training models, prior efforts—including some that leverage reinforcement learning—have encountered significant challenges. Most attempts were either narrow and difficult to generalize, breaking down on more complex tasks,[8][9][10][11] or they faced difficulties learning from sparse (lacking specific information and relating to large amounts of text at a time) or noisy (inconsistently rewarding similar outputs) reward functions.[12][13]
RLHF was not the first successful method of using human feedback for reinforcement learning, but it is one of the most widely used. The foundation for RLHF was introduced as an attempt to create a general algorithm for learning from a practical amount of human feedback.[6][3] The algorithm as used today was introduced by OpenAI in a paper on enhancing text continuation or summarization based on human feedback, and it began to gain popularity when the same method was reused in their paper on InstructGPT.[2][14][15] RLHF has also been shown to improve the robustness of RL agents and their capacity for exploration, which results in an optimization process more adept at handling uncertainty and efficiently exploring its environment in search of the highest reward.[16]
Collecting human feedback
[edit]Human feedback is commonly collected by prompting humans to rank instances of the agent's behavior.[15][17][18] These rankings can then be used to score outputs, for example, using the Elo rating system, which is an algorithm for calculating the relative skill levels of players in a game based only on the outcome of each game.[3] While ranking outputs is the most widely adopted form of feedback, recent research has explored other forms, such as numerical feedback, natural language feedback, and prompting for direct edits to the model's output.[19]
One initial motivation of RLHF was that it requires relatively small amounts of comparison data to be effective.[6] It has been shown that a small amount of data can lead to comparable results to a larger amount. In addition, increasing the amount of data tends to be less effective than proportionally increasing the size of the reward model.[14] Nevertheless, a larger and more diverse amount of data can be crucial for tasks where it is important to avoid bias from a partially representative group of annotators.[15]
When learning from human feedback through pairwise comparison under the Bradley–Terry–Luce model (or the Plackett–Luce model for K-wise comparisons over more than two comparisons), the maximum likelihood estimator (MLE) for linear reward functions has been shown to converge if the comparison data is generated under a well-specified linear model. This implies that, under certain conditions, if a model is trained to decide which choices people would prefer between pairs (or groups) of choices, it will necessarily improve at predicting future preferences. This improvement is expected as long as the comparisons it learns from are based on a consistent and simple rule.[20][21]
Both offline data collection models, where the model is learning by interacting with a static dataset and updating its policy in batches, as well as online data collection models, where the model directly interacts with the dynamic environment and updates its policy immediately, have been mathematically studied proving sample complexity bounds for RLHF under different feedback models.[20][22]
In the offline data collection model, when the objective is policy training, a pessimistic MLE that incorporates a lower confidence bound as the reward estimate is most effective. Moreover, when applicable, it has been shown that considering K-wise comparisons directly is asymptotically more efficient than converting them into pairwise comparisons for prediction purposes.[22][23][15]
In the online scenario, when human feedback is collected through pairwise comparisons under the Bradley–Terry–Luce model and the objective is to minimize the algorithm's regret (the difference in performance compared to an optimal agent), it has been shown that an optimistic MLE that incorporates an upper confidence bound as the reward estimate can be used to design sample efficient algorithms (meaning that they require relatively little training data). A key challenge in RLHF when learning from pairwise (or dueling) comparisons is associated with the non-Markovian nature of its optimal policies. Unlike simpler scenarios where the optimal strategy does not require memory of past actions, in RLHF, the best course of action often depends on previous events and decisions, making the strategy inherently memory-dependent.[21]
Applications
[edit]RLHF has been applied to various domains of natural language processing (NLP), such as conversational agents, text summarization, and natural language understanding.[24][14] Ordinary reinforcement learning, in which agents learn from their actions based on a predefined "reward function", is difficult to apply to NLP tasks because the rewards tend to be difficult to define or measure, especially when dealing with complex tasks that involve human values or preferences.[6] RLHF can steer NLP models, in particular language models, to provide answers that align with human preferences with regard to such tasks by capturing their preferences beforehand in the reward model. This results in a model capable of generating more relevant responses and rejecting inappropriate or irrelevant queries.[15][25] Some notable examples of RLHF-trained language models are OpenAI's ChatGPT (and its predecessor InstructGPT),[17][26][27] DeepMind's Sparrow,[28][29][30] Google's Gemini,[31] and Anthropic's Claude.[32]
In computer vision, RLHF has also been used to align text-to-image models. Studies that successfully used RLHF for this goal have noted that the use of KL regularization in RLHF, which aims to prevent the learned policy from straying too far from the unaligned model, helped to stabilize the training process by reducing overfitting to the reward model. The final image outputs from models trained with KL regularization were noted to be of significantly higher quality than those trained without.[33][34] Other methods tried to incorporate the feedback through more direct training—based on maximizing the reward without the use of reinforcement learning—but conceded that an RLHF-based approach would likely perform better due to the online sample generation used in RLHF during updates as well as the aforementioned KL regularization over the prior model, which mitigates overfitting to the reward function.[35]
RLHF was initially applied to other areas, such as the development of video game bots and tasks in simulated robotics. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences. In classical RL-based training of such bots, the reward function is simply correlated to how well the agent is performing in the game, usually using metrics like the in-game score. In comparison, in RLHF, a human is periodically presented with two clips of the agent's behavior in the game and must decide which one looks better. This approach can teach agents to perform at a competitive level without ever having access to their score. In fact, it was shown that RLHF can sometimes lead to superior performance over RL with score metrics because the human's preferences can contain more useful information than performance-based metrics.[6][36] The agents achieved strong performance in many of the environments tested, often surpassing human performance.[37]
Training
[edit]In RLHF, two different models are trained: a reward model and a reinforcement learning (RL) policy. The reward model learns to determine what behavior is desirable based on human feedback, while the policy is guided by the reward model to determine the agent's actions. Both models are commonly initialized using a pre-trained autoregressive language model. This model is then customarily trained in a supervised manner on a relatively small dataset of pairs of prompts to an assistant and their accompanying responses, written by human annotators.
Reward model
[edit]The reward model is usually initialized with a pre-trained model, as this initializes it with an understanding of language and focuses training explicitly on learning human preferences. In addition to being used to initialize the reward model and the RL policy, the model is then also used to sample data to be compared by annotators.[15][14]
The reward model is then trained by replacing the final layer of the previous model with a randomly initialized regression head. This change shifts the model from its original classification task over its vocabulary to simply outputting a number corresponding to the score of any given prompt and response. This model is trained on the human preference comparison data collected earlier from the supervised model. In particular, it is trained to minimize the following cross-entropy loss function:
where is the number of responses the labelers ranked, is the output of the reward model for prompt and completion , is the preferred completion over , denotes the sigmoid function, and denotes the expected value.[15] This can be thought of as a form of logistic regression, where the model predicts the probability that a response is preferred over .
This loss function essentially measures the difference between the reward model's predictions and the decisions made by humans. The goal is to make the model's guesses as close as possible to the humans' preferences by minimizing the difference measured by this equation. In the case of only pairwise comparisons, , so the factor of .[14] In general, all comparisons from each prompt are used for training as a single batch.[15]
After training, the outputs of the model are normalized such that the reference completions have a mean score of 0. That is,[14] for each query and reference pair by calculating the mean reward across the training dataset and setting it as the bias in the reward head.
Policy
[edit]Similarly to the reward model, the human feedback policy is also initialized from a pre-trained model.[14]
The key is to understand language generation as if it is a game to be learned by RL. In RL, a policy is a function that maps a game state to a game action. In RLHF, the "game" is the game of replying to prompts. A prompt is a game state, and a response is a game action. This is a fairly trivial kind of game, since every game lasts for exactly one step. Nevertheless, it is a game, and so RL algorithms can be applied to it.
The first step in its training is supervised fine-tuning (SFT). This step does not require the reward model. Instead, the pre-trained model is trained on a dataset that contains prompt-response pairs . Then, during SFT, the model is trained to auto-regressively generate the corresponding response when given a random prompt . The original paper recommends to SFT for only one epoch, since more than that causes overfitting.
The dataset is usually written by human contractors, who write both the prompts and responses.
The second step uses a policy gradient method to the reward model. It uses a dataset , which contains prompts, but not responses. Like most policy gradient methods, this algorithm has an outer loop and two inner loops:
- Initialize the policy to , the policy output from SFT.
- Loop for many steps.
- Initialize a new empty dataset .
- Loop for many steps
- Sample a random prompt from .
- Generate a response from the policy .
- Calculate the reward signal from the reward model .
- Add the triple to .
- Update by a policy gradient method to increase the objective function
Note that is equivalent to , which means "sample a prompt from , then sample a response from the policy".
The objective function has two parts. The first part is simply the expected reward , and is standard for any RL algorithm. The second part is a "penalty term" involving the KL divergence. The strength of the penalty term is determined by the hyperparameter .
This KL term works by penalizing the KL divergence (a measure of statistical distance between distributions) between the model being fine-tuned and the initial supervised model. By choosing an appropriate , the training can balance learning from new data while retaining useful information from the initial model, increasing generalization by avoiding fitting too closely to the new data. Aside from preventing the new model from producing outputs too dissimilar those of the initial model, a second motivation of including the KL term is to encourage the model to output high-entropy text, so as to prevent the model from collapsing to a small number of canned responses.[14]
In simpler terms, the objective function calculates how well the policy's responses are expected to align with human feedback. The policy generates responses to prompts, and each response is evaluated both on how well it matches human preferences (as measured by the reward model) and how similar it is to responses the model would naturally generate. The goal is to balance improving alignment with human preferences while ensuring the model's responses remain diverse and not too far removed from what it has learned during its initial training. This helps the model not only to provide answers that people find useful or agreeable but also to maintain a broad understanding and avoid overly narrow or repetitive responses.
Proximal policy optimization
[edit]The policy function is usually trained by proximal policy optimization (PPO) algorithm. That is, the parameter is trained by gradient ascent on the clipped surrogate function.[15][14]
Classically, the PPO algorithm employs generalized advantage estimation, which means that there is an extra value estimator , that updates concurrently with the policy during PPO training: . The value estimator is used only during training, and not outside of training.
The PPO uses gradient descent on the following clipped surrogate advantage:
where the advantage term is defined as . That is, the advantage is computed as the difference between the reward (the expected return) and the value estimation (the expected return from the policy). This is used to train the policy by gradient ascent on it, usually using a standard momentum-gradient optimizer, like the Adam optimizer.
The original paper initialized the value estimator from the trained reward model.[14] Since PPO is an actor-critic algorithm, the value estimator is updated concurrently with the policy, via minimizing the squared TD-error, which in this case equals the squared advantage term:which is minimized by gradient descent on it. Other methods than squared TD-error might be used. See the actor-critic algorithm page for details.
Mixing pretraining gradients
[edit]A third term is commonly added to the objective function to prevent the model from catastrophic forgetting. For example, if the model is only trained in customer service, then it might forget general knowledge in geography. To prevent this, the RLHF process incorporates the original language modeling objective. That is, some random texts are sampled from the original pretraining dataset , and the model is trained to maximize the log-likelihood of the text . The final objective function is written as:
where controls the strength of this pretraining term.[15] This combined objective function is called PPO-ptx, where "ptx" means "Mixing Pretraining Gradients".[7] It was first used in the InstructGPT paper.[15]
In total, this objective function defines the method for adjusting the RL policy, blending the aim of aligning with human feedback and maintaining the model's original language understanding.
So, writing out fully explicitly, the PPO-ptx objective function is:
which is optimized by gradient ascent on it.
Limitations
[edit]RLHF suffers from challenges with collecting human feedback, learning a reward model, and optimizing the policy.[38] Compared to data collection for techniques like unsupervised or self-supervised learning, collecting data for RLHF is less scalable and more expensive. Its quality and consistency may vary depending on the task, interface, and the preferences and biases of individual humans.[15][39]
The effectiveness of RLHF depends on the quality of human feedback. For instance, the model may become biased, favoring certain groups over others, if the feedback lacks impartiality, is inconsistent, or is incorrect.[3][40] There is a risk of overfitting, where the model memorizes specific feedback examples instead of learning to generalize. For instance, feedback predominantly from a specific demographic might lead the model to learn peculiarities or noise, along with the intended alignment. Excessive alignment to the specific feedback it received (that is, to the bias therein) can lead to the model performing sub-optimally in new contexts or when used by different groups.[41] A single reward function cannot always represent the opinions of diverse groups of people. Even with a representative sample, conflicting views and preferences may result in the reward model favoring the majority's opinion, potentially disadvantaging underrepresented groups.[38]
In some cases, as is possible in regular reinforcement learning, there may be a risk of the model learning to manipulate the feedback process or game the system to achieve higher rewards rather than genuinely improving its performance.[42] In the case of RLHF, a model may learn to exploit the fact that it is rewarded for what is evaluated positively and not necessarily for what is actually good, which can lead to it learning to persuade and manipulate. For example, models might learn that apparent confidence, even if inaccurate, garners higher rewards. Such behavior, if unchecked, is not just incentivized but can cause significant deployment issues due to the model's potential to mislead. Studies have found that humans are not skilled at identifying mistakes in LLM outputs in complex tasks; therefore, models learning to generate confident-sounding yet incorrect text can lead to significant issues when deployed.[38]
Alternatives
[edit]Reinforcement learning from AI feedback
[edit]Similarly to RLHF, reinforcement learning from AI feedback (RLAIF) relies on training a preference model, except that the feedback is automatically generated.[43] This is notably used in Anthropic's constitutional AI, where the AI feedback is based on the conformance to the principles of a constitution.[44]
Direct alignment algorithms
[edit]Direct alignment algorithms (DAA) have been proposed as a new class of algorithms[45][46] that seek to directly optimize large language models (LLMs) on human feedback data in a supervised manner instead of the traditional policy-gradient methods.
These algorithms aim to align models with human intent more transparently by removing the intermediate step of training a separate reward model. Instead of first predicting human preferences and then optimizing against those predictions, direct alignment methods train models end-to-end on human-labeled or curated outputs. This reduces potential misalignment risks introduced by proxy objectives or reward hacking.
By directly optimizing for the behavior preferred by humans, these approaches often enable tighter alignment with human values, improved interpretability, and simpler training pipelines compared to RLHF.
Direct preference optimization
[edit]Direct preference optimization (DPO) is a technique to learn human preferences. Like RLHF, it has been applied to align pre-trained large language models using human-generated preference data. Unlike RLHF, however, which first trains a separate intermediate model to understand what good outcomes look like and then teaches the main model how to achieve those outcomes, DPO simplifies the process by directly adjusting the main model according to people's preferences. It uses a change of variables to define the "preference loss" directly as a function of the policy and uses this loss to fine-tune the model, helping it understand and prioritize human preferences without needing a separate step. Essentially, this approach directly shapes the model's decisions based on positive or negative human feedback.
Recall, the pipeline of RLHF is as follows:
- We begin by gathering human preference dataset .
- We then fit a reward model to data, by maximum likelihood estimation using the Plackett–Luce model
- We finally train an optimal policy that maximizes the objective function:
However, instead of doing the intermediate step of the reward model, DPO directly optimizes for the final policy.
First, solve directly for the optimal policy, which can be done by Lagrange multipliers, as usual in statistical mechanics:
where is the partition function. This is unfortunately not tractable, since it requires summing over all possible responses:
Next, invert this relationship to express the reward implicitly in terms of the optimal policy:
Finally, plug it back to the maximum likelihood estimator, we obtain[47]: Appendix A
Usually, DPO is used for modeling human preference in pairwise comparisons, so that . In that case, we have
DPO eliminates the need for a separate reward model or reinforcement learning loop, treating alignment as a supervised learning problem over preference data. This is simpler to implement and train than RLHF and has been shown to produce comparable and sometimes superior results.[47] Nevertheless, RLHF has also been shown to beat DPO on some datasets, for example, on benchmarks that attempt to measure truthfulness. Therefore, the choice of method may vary depending on the features of the human preference data and the nature of the task.[48]
Identity preference optimization
[edit]Identity preference optimization (IPO)[49] is a modification to the original DPO objective that introduces a regularization term to reduce the chance of overfitting. It remains robust to overtraining by assuming noise in the preference data.
Foremost, IPO first applies a non-linear mapping over the probability distribution of preferences instead of the Bradley-Terry assumption to soften the probability of preferences and smooth the labels. Here, denotes the preference objective separate from the policy objective. This helps avoid the overfitting issue of the assumption that pairwise preferences can be substituted for point-wise rewards, which weakens the KL regularization by heavily skewing the preference distribution.
As with DPO, IPO is also formulated as an offline learning objective learned over a human preference dataset . In particular, the IPO introduces a new objective by applying a mapping over the preference probability distribution. Practically, is taken as the identity mapping, which results in IPO. Hence, IPO also directly optimizes for the final policy from the preference dataset and bypasses the reward modeling stage by the following objective:
where is preference distribution of the chosen responses over the rejected responses . However, since is not observed directly, we sample from a Bernoulli distribution from the offline preference dataset as:
To solve this objective, IPO minimizes the quadratic loss function:
where and is a function drawn from the Bernoulli distribution from the preference dataset. Here, is 1 if is preferred to which happens with probability , and 0 otherwise. As such, the simplification of the expression directly follows from exploiting the symmetry of and from the Bernoulli such that for each datapoint . In particular this symmetry can be represented as and with and .
In summary, IPO can control the gap between the log-likelihood ratios of the policy model and the reference by always regularizing the solution towards the reference model. It allows learning directly from preferences without a reward modelling stage and without relying on the Bradley-Terry modelisation assumption that assumes that pairwise preferences can be substituted with pointwise rewards.[49] Thus, it avoids overfitting to the preference dataset especially when preferences are near deterministic and the KL term fails.
Kahneman-Tversky optimization
[edit]Kahneman-Tversky optimization (KTO)[50] is another direct alignment algorithm drawing from prospect theory to model uncertainty in human decisions that may not maximize the expected value.
In general, KTO seeks to optimize a class of new loss functions proposed as “human-aware losses” (HALO) formulated under prospect theory to model “human values” of a query, response pair as . A function is defined as a human-aware loss for the value described by the general HALO objective:
where is the preference data, is some constant relevant to the dataset, and is some distribution representing the baseline or “reference”. Each training example is attached a label that tells us if the example is desirable (we want to push up its reward) and -1 if it’s undesirable (in order to push down its reward). Unlike previous definitions of the reward, KTO defines as the “implied reward” taken by the log-likelihood ratio between the policy model and the reference model . Here, the value function is a non-linear (typically concave) function that mimics human loss aversion and risk aversion. As opposed to previous preference optimization algorithms, the motivation of KTO lies in maximizing the utility of model outputs from a human perspective rather than maximizing the likelihood of a “better” label (chosen vs. rejected responses). Hence, it constructs a more relaxed generalization to preference distributions by requiring only a binary feedback signal instead of explicit preference pairs. For each example in the dataset , KTO explicitly optimizes the HALO objective as:
, where is a class-specific constant (e.g., ) controlling how strongly the model should push up good outputs vs. push down bad ones. The value function is defined piecewise depending on whether is desirable () or undesirable ():
and is a baseline given by the Kullback–Leibler divergence. Here, controls how “risk-averse” the value function is (larger = faster saturation in the logistic function ). Intuitively, desirable outputs push the model to increase so that becomes more positive. Undesirable ones push it in the opposite direction, so the reward is less than the reference. Since many real-world feedback pipelines yield "like/dislike" data more easily than pairwise comparisons, KTO is designed to be data-cheap and to reflect "loss aversion" more directly by using a straightforward notion of "good vs. bad" at the example level.
See also
[edit]References
[edit]- ^ Russell, Stuart J.; Norvig, Peter (2016). Artificial intelligence: a modern approach (Third, Global ed.). Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo: Pearson. pp. 830–831. ISBN 978-0-13-604259-4.
- ^ a b Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593 [cs.CL].
- ^ a b c d Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". huggingface.co. Retrieved 4 March 2023.
- ^ Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg (2017). "Proximal Policy Optimization Algorithms". arXiv:1707.06347 [cs.LG].
- ^ Tuan, Yi-Lin; Zhang, Jinzhi; Li, Yujia; Lee, Hung-yi (2018). "Proximal Policy Optimization and its Dynamic Version for Sequence Generation". arXiv:1808.07982 [cs.CL].
- ^ a b c d e Amodei, Dario; Christiano, Paul; Ray, Alex (13 June 2017). "Learning from human preferences". openai.com. Retrieved 4 March 2023.
- ^ a b Zheng, Rui; Dou, Shihan; Gao, Songyang; Hua, Yuan; Shen, Wei; Wang, Binghai; Liu, Yan; Jin, Senjie; Liu, Qin; Zhou, Yuhao; Xiong, Limao; Chen, Lu; Xi, Zhiheng; Xu, Nuo; Lai, Wenbin; Zhu, Minghao; Chang, Cheng; Yin, Zhangyue; Weng, Rongxiang; Cheng, Wensen; Huang, Haoran; Sun, Tianxiang; Yan, Hang; Gui, Tao; Zhang, Qi; Qiu, Xipeng; Huang, Xuanjing (2023). "Secrets of RLHF in Large Language Models Part I: PPO". arXiv:2307.04964 [cs.CL].
- ^ Knox, W. Bradley; Stone, Peter; Breazeal, Cynthia (2013). "Training a Robot via Human Feedback: A Case Study". Social Robotics. Lecture Notes in Computer Science. Vol. 8239. Springer International Publishing. pp. 460–470. doi:10.1007/978-3-319-02675-6_46. ISBN 978-3-319-02674-9. Retrieved 26 February 2024.
- ^ Akrour, Riad; Schoenauer, Marc; Sebag, Michèle (2012). "APRIL: Active Preference Learning-Based Reinforcement Learning". Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. Vol. 7524. Springer. pp. 116–131. arXiv:1208.0984. doi:10.1007/978-3-642-33486-3_8. ISBN 978-3-642-33485-6. Retrieved 26 February 2024.
- ^ Wilson, Aaron; Fern, Alan; Tadepalli, Prasad (2012). "A Bayesian Approach for Policy Learning from Trajectory Preference Queries". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc. Retrieved 26 February 2024.
- ^ Schoenauer, Marc; Akrour, Riad; Sebag, Michele; Souplet, Jean-Christophe (18 June 2014). "Programming by Feedback". Proceedings of the 31st International Conference on Machine Learning. PMLR: 1503–1511. Retrieved 26 February 2024.
- ^ Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID 4130751.
- ^ MacGlashan, James; Ho, Mark K.; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
- ^ a b c d e f g h i j Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.
- ^ a b c d e f g h i j k l Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. arXiv:2203.02155.
- ^ Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862 [cs.CL].
- ^ a b Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.
- ^ Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.
- ^ Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv:2305.00955 [cs.CL].
- ^ a b Xie, Tengyang; Jiang, Nan; Wang, Huan; Xiong, Caiming; Bai, Yu (2021). "Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 27395–27407. arXiv:2106.04895. Retrieved 10 March 2024.
- ^ a b Pacchiano, Aldo; Saha, Aadirupa; Lee, Jonathan (2023-03-03). "Dueling RL: Reinforcement Learning with Trajectory Preferences". Proceedings of the 26th International Conference on Artificial Intelligence and Statistics. PMLR: 6263–6289. arXiv:2111.04850.
- ^ a b Zhu, Banghua; Jordan, Michael; Jiao, Jiantao (2023-07-03). "Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons". Proceedings of the 40th International Conference on Machine Learning. PMLR: 43037–43067. arXiv:2301.11270.
- ^ Li, Zihao; Yang, Zhuoran; Wang, Mengdi (20 June 2023). "Reinforcement learning with Human Feedback: Learning Dynamic Choices via Pessimism". ILHF Workshop ICML 2023. arXiv:2305.18438. Retrieved 10 March 2024.
- ^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
- ^ Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.
- ^ Heikkilä, Melissa (21 February 2023). "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.
- ^ Douglas Heaven, Will (30 November 2022). "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.
- ^ Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv:2209.14375 [cs.LG].
- ^ Goldman, Sharon (23 September 2022). "Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. Retrieved 4 March 2023.
- ^ The Sparrow team (22 September 2022). "Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.
- ^ Pinchai, Sundar; Hassabis, Demis (6 December 2023). "Introducing Gemini: our largest and most capable AI model". Google. Retrieved 29 February 2024.
- ^ Henshall, Will (18 July 2023). "What to Know About Claude 2, Anthropic's Rival to ChatGPT". TIME. Retrieved 6 March 2024.
- ^ Fan, Ying; Watkins, Olivia; Du, Yuqing; Liu, Hao; Ryu, Moonkyung; Boutilier, Craig; Abbeel, Pieter; Ghavamzadeh, Mohammad; Lee, Kangwook; Lee, Kimin (2 November 2023). "DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models". NeurIPS 2023. arXiv:2305.16381. Retrieved 1 March 2024.
- ^ Xu, Jiazheng; Liu, Xiao; Wu, Yuchen; Tong, Yuxuan; Li, Qinkai; Ding, Ming; Tang, Jie; Dong, Yuxiao (15 December 2023). "ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation". Advances in Neural Information Processing Systems. 36: 15903–15935. arXiv:2304.05977. Retrieved 1 March 2024.
- ^ Lee, Kimin; Liu, Hao; Ryu, Moonkyung; Watkins, Olivia; Du, Yuqing; Boutilier, Craig; Abbeel, Pieter; Ghavamzadeh, Mohammad; Gu, Shixiang Shane (2023). "Aligning Text-to-Image Models using Human Feedback". arXiv:2302.12192 [cs.LG].
- ^ Leike, Jan; Martic, Miljan; Legg, Shane (12 June 2017). "Learning through human feedback". www.deepmind.com. Retrieved 4 March 2023.
- ^ Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. arXiv:1706.03741. Retrieved 4 March 2023.
- ^ a b c Casper, Stephen; Davies, Xander; Shi, Claudia; Gilbert, Thomas Krendl; Scheurer, Jérémy; Rando, Javier; Freedman, Rachel; Korbak, Tomasz; Lindner, David; Freire, Pedro; Wang, Tony Tong; Marks, Samuel; Segerie, Charbel-Raphael; Carroll, Micah; Peng, Andi; Christoffersen, Phillip; Damani, Mehul; Slocum, Stewart; Anwar, Usman; Siththaranjan, Anand; Nadeau, Max; Michaud, Eric J.; Pfau, Jacob; Krasheninnikov, Dmitrii; Chen, Xin; Langosco, Lauro; Hase, Peter; Biyik, Erdem; Dragan, Anca; Krueger, David; Sadigh, Dorsa; Hadfield-Menell, Dylan (18 September 2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". Transactions on Machine Learning Research. arXiv:2307.15217.
- ^ Christiano, Paul (25 January 2023). "Thoughts on the impact of RLHF research". Retrieved 4 March 2023.
- ^ Belenguer, Lorenzo (2022). "AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry". AI and Ethics. 2 (4). AI Ethics: 771–787. doi:10.1007/s43681-022-00138-8. PMC 8830968. PMID 35194591.
- ^ Zhang, Chiyuan; Bengio, Samy; Hardt, Moritz; Recht, Benjamin; Vinyals, Oriol (4 November 2016). "Understanding deep learning requires rethinking generalization". International Conference on Learning Representations.
- ^ Clark, Jack; Amodei, Dario (21 December 2016). "Faulty reward functions in the wild". OpenAI.
- ^ Lee, Harrison; Phatale, Samrat; Mansoor, Hassan; Lu, Kellie Ren; Mesnard, Thomas; Ferret, Johan; Bishop, Colton; Hall, Ethan; Carbune, Victor; Rastogi, Abhinav (2023-10-13). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback". ICLR.
- ^ Edwards, Benj (2023-05-09). "AI gains "values" with Anthropic's new Constitutional AI chatbot approach". Ars Technica. Retrieved 2024-04-27.
- ^ Rafailov, Rafael; Chittepu, Yaswanth; Park, Ryan; Sikchi, Harshit; Hejna, Joey; Knox, Bradley; Finn, Chelsea; Niekum, Scott (2024). "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". arXiv:2406.02900 [cs.LG].
- ^ Shi, Zhengyan; Land, Sander; Locatelli, Acyr; Geist, Matthieu; Bartolo, Max (2024). "Understanding Likelihood Over-optimisation in Direct Alignment Algorithms". arXiv:2410.11677 [cs.CL].
- ^ a b Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290 [cs.LG].
- ^ Wang, Zhilin; Dong, Yi; Zeng, Jiaqi; Adams, Virginia; Sreedhar, Makesh Narsimhan; Egert, Daniel; Delalleau, Olivier; Scowcroft, Jane Polak; Kant, Neel; Swope, Aidan; Kuchaiev, Oleksii (2023). "HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM". arXiv:2311.09528 [cs.CL].
- ^ a b Mohammad Gheshlaghi Azar; Rowland, Mark; Piot, Bilal; Guo, Daniel; Calandriello, Daniele; Valko, Michal; Munos, Rémi (2023). "A General Theoretical Paradigm to Understand Learning from Human Preferences". arXiv:2310.12036 [cs.AI].
- ^ Ethayarajh, Kawin; Xu, Winnie; Muennighoff, Niklas; Jurafsky, Dan; Kiela, Douwe (2024). "KTO: Model Alignment as Prospect Theoretic Optimization". arXiv:2402.01306 [cs.LG].
Further reading
[edit]- "Deep reinforcement learning from human preferences". NeurIPS. 2017.
- "Training language models to follow instructions with human feedback". NeurIPS. 2022.
- "The N Implementation Details of RLHF with PPO". huggingface.co. 2023-10-24.
- "Proximal Policy Optimization — Spinning Up documentation". spinningup.openai.com. Retrieved 2025-01-26.
- "The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization". COLM. 2024.
Reinforcement learning from human feedback
View on GrokipediaHistorical Development
Early Foundations in RL and Preference Learning
Reinforcement learning (RL) traditionally depends on explicitly defined reward functions to guide agent behavior toward desired outcomes, but specifying rewards that align with complex, human-like goals proves difficult, often resulting in suboptimal policies or unintended behaviors due to reward misspecification. To mitigate this, inverse reinforcement learning (IRL) emerged as a method to reverse-engineer reward functions from observed expert demonstrations, positing that experts act near-optimally under an inferred reward. Ng and Russell (2000) established foundational IRL algorithms for Markov decision processes, framing the problem as maximizing the likelihood of expert trajectories while ensuring the inferred reward differentiates optimal from alternative policies, thus avoiding degenerate solutions where any behavior could be deemed optimal.[4] Preference-based reinforcement learning (PbRL) built upon IRL by leveraging pairwise human comparisons—such as ranking one trajectory or action as preferable to another—which require less expertise and effort than generating full demonstrations or scalar rewards, while mitigating issues like arbitrary reward scaling or shaping. In PbRL, preferences inform reward inference without assuming full expert optimality, often using statistical models to aggregate comparisons into a coherent reward signal. Early frameworks formalized PbRL as an integration of ordinal preference learning with RL, enabling policy optimization through methods like preference-augmented value iteration, as surveyed in foundational reviews of the approach.[5] The 2017 work by Christiano et al. marked a key milestone in scaling PbRL to deep RL settings, demonstrating that humans could provide preferences on brief video clips of agent behaviors in environments like Atari games (e.g., Enduro, Breakout) and continuous control tasks (e.g., cartpole balancing). They trained a neural reward model via supervised learning on preference pairs, employing the Bradley-Terry model to estimate the probability of one outcome being preferred as , where is the logistic function and parameterizes the scalar reward difference; this model was then used to fine-tune policies with actor-critic methods like A3C or PPO, achieving performance comparable to or exceeding hand-crafted rewards on tasks where humans struggled to articulate precise objectives, such as avoiding falls without explicit penalties. This approach highlighted PbRL's potential for eliciting subtle human values, setting the stage for its application in aligning advanced AI systems.[6]Key Publications and Milestones (2019–2022)
In 2019, OpenAI published "Fine-Tuning Language Models from Human Preferences," which applied reinforcement learning from human feedback to language generation tasks such as text continuation and summarization.[7] The approach involved collecting human preferences over model outputs, training a reward model on those rankings, and using proximal policy optimization (PPO) to fine-tune a GPT-2-based policy, achieving up to 10% relative improvements in human-rated quality over supervised fine-tuning baselines on held-out prompts.[7] This work extended prior RLHF methods from low-dimensional control environments to high-dimensional language modeling, demonstrating that human feedback could guide models toward more desirable outputs without explicit reward engineering, though it highlighted challenges like reward model overfitting on small datasets.[7] Building on this, OpenAI's 2020 paper "Learning to Summarize from Human Feedback" represented a practical milestone in scaling RLHF for abstractive summarization.[8] Researchers fine-tuned a 1.3 billion parameter GPT-2 model using 15,000 human preference comparisons on summaries of online news articles, training a scalar reward model that predicted pairwise winner preferences with 59% accuracy.[8] Subsequent PPO optimization produced summaries that humans preferred over supervised fine-tuning outputs by 10-20% in blind pairwise comparisons, while maintaining factual consistency comparable to baselines; the method relied on 60,000 iterations of PPO with KL divergence penalties to prevent mode collapse.[8] This demonstrated RLHF's ability to elicit more helpful and concise language without dense rewards, though it required careful data collection to avoid biases in human labelers' preferences for verbosity.[8] By early 2022, OpenAI advanced RLHF to general instruction-following with the "Training Language Models to Follow Instructions with Human Feedback" paper, introducing InstructGPT.[1] The pipeline combined supervised fine-tuning on 13,000 prompt-response pairs with RLHF on preferences from over 30,000 comparisons across diverse tasks, yielding a 1.3 billion parameter model that outperformed the 175 billion parameter GPT-3 by 4-10% in human evaluations for helpfulness, truthfulness, and harmlessness.[1] Key innovations included a reward model ensemble to reduce variance and iterative data collection via the fine-tuned policy itself, enabling scaling; however, the work noted persistent issues like sycophancy and over-optimization toward rater biases.[1] This publication, accompanied by a January 2022 OpenAI announcement, marked RLHF's transition to aligning frontier-scale language models with broad user intent, setting the stage for subsequent deployments.[9][1]Post-ChatGPT Evolution and Commercial Scaling (2023–2025)
Following the release of ChatGPT in November 2022, reinforcement learning from human feedback (RLHF) became a cornerstone for aligning subsequent large language models with human preferences in commercial products. OpenAI's GPT-4, announced on March 14, 2023, integrated RLHF during fine-tuning to generate more helpful, honest, and harmless responses, building on techniques from InstructGPT by incorporating human-ranked preferences into reward modeling and proximal policy optimization.[10] Anthropic's Claude 1, launched in March 2023, advanced RLHF through Constitutional AI, a method that supplements human feedback with AI-generated self-critiques and revisions guided by a predefined set of ethical principles to minimize harmful outputs without relying solely on extensive human labeling.[11] This hybrid approach reduced dependence on human annotators while maintaining alignment efficacy, as evidenced by Claude's improved harmlessness scores in internal evaluations.[12] Major AI firms scaled RLHF commercially by assembling large annotation workforces and investing heavily in data pipelines, though human feedback costs posed significant barriers. Google applied RLHF to its Gemini models, released on December 6, 2023, to refine outputs for compliance with safety and utility preferences, leveraging cloud-based reward modeling and policy optimization workflows.[13] xAI's Grok-1, introduced on November 4, 2023, employed a tailored RLHF variant where human reviewers evaluated responses primarily for truthfulness and reduced sycophancy, diverging from standard helpfulness-focused metrics used by competitors.[14] Scaling efforts demanded substantial resources; instruction-tuning via RLHF typically incurs $6–10 million in data acquisition costs and requires teams of 5–20 engineers to manage preference datasets comprising millions of comparisons.[15] These investments enabled deployment in products serving billions of interactions, but annotation bottlenecks—exacerbated by the need for domain expertise and consistency—limited throughput for trillion-parameter models. To address scalability constraints, the field evolved toward alternatives like reinforcement learning from AI feedback (RLAIF), which substitutes LLMs for human labelers in generating preferences. A 2023 study demonstrated RLAIF achieving comparable alignment to RLHF on benchmarks such as helpfulness and harmlessness, while reducing costs by automating preference synthesis and enabling iterative self-improvement loops.[16] By 2024–2025, refinements in reward modeling, including dynamic weighting and physics-informed variants for specialized domains, enhanced training stability and data efficiency, allowing commercial entities to extend RLHF-like techniques to multimodal and reasoning-focused models despite ongoing issues like reward hacking and bias propagation from imperfect feedback sources.[17] These developments facilitated broader adoption, though empirical evidence indicates RLAIF's effectiveness varies by task complexity, with human oversight remaining essential for high-stakes reliability.[18]Theoretical Foundations
Core Principles of Reinforcement Learning
Reinforcement learning (RL) is a paradigm in machine learning where an agent learns to make sequential decisions by interacting with an environment, aiming to maximize the expected cumulative reward over time.[19] The agent's behavior is shaped through trial and error, receiving feedback in the form of rewards or penalties for actions taken in specific states, without requiring labeled data for every possible outcome.[19] This approach contrasts with supervised learning by emphasizing long-term consequences rather than immediate correctness, enabling adaptation to dynamic, partially observable settings.[20] The foundational mathematical framework for RL is the Markov Decision Process (MDP), formalized as a tuple , where denotes the state space, the action space, the transition probability to next state given state and action , the reward distribution, and the discount factor prioritizing immediate over delayed rewards.[19] The Markov property underpins this model, stipulating that the probability distribution over future states and rewards depends solely on the current state and action, not prior history, which simplifies computation while assuming sufficient state representation captures all relevant information.[21] In practice, MDPs model problems like game playing or robotics, where the agent observes state , selects action , receives reward , and transitions to .[22] Central to RL is the policy , which defines the agent's decision-making strategy as the probability of selecting action in state , potentially stochastic to balance exploration and exploitation.[19] The value function quantifies the expected return—discounted sum of future rewards—starting from state and following policy , given by .[20] Similarly, the action-value function evaluates the expected return from taking action in and then adhering to , , aiding in policy improvement by selecting high-Q actions.[20] Optimal policies maximize these functions, often derived via dynamic programming or learning algorithms.[19] The Bellman equation provides the recursive foundation for value functions, expressing as the expected immediate reward plus discounted value of the successor state: .[19] For action-values, , enabling iterative updates in methods like value iteration or Q-learning.[19] Optimality follows from the Bellman optimality equation, where the optimal value , converging under contraction mapping properties for finite MDPs.[19] These principles underpin model-free algorithms, which estimate values directly from samples without explicit transition models, as in policy gradient or temporal-difference methods.[19]Rationale for Incorporating Human Feedback
Reinforcement learning traditionally relies on predefined reward functions to signal desirable actions, but these functions prove inadequate for tasks involving nuanced, context-dependent outcomes, such as generating coherent and helpful natural language responses. In such scenarios, hand-engineering rewards fails to encapsulate the subtleties of human intent, leading to misaligned policies that optimize superficial metrics rather than substantive quality.[2] Human feedback circumvents this limitation by leveraging direct comparative judgments—e.g., ranking two model outputs for a given prompt—to infer a latent reward structure that reflects evaluator preferences, thereby enabling the training of a surrogate reward model without exhaustive specification.[1] This integration proves particularly valuable for aligning large language models (LLMs), where pretraining on vast internet corpora yields capabilities marred by tendencies toward unhelpful, verbose, rambling, incoherent, or toxic outputs that reflect and regurgitate diverse patterns from the training data. Supervised fine-tuning (SFT) on curated instruction-response pairs improves imitation but confines the model to the training distribution, limiting generalization to novel queries. RLHF, by contrast, employs human preferences to guide policy optimization via reinforcement learning algorithms like proximal policy optimization (PPO), suppressing these undesirable tendencies to produce more coherent, helpful, and aligned responses that exceed SFT baselines in human-rated usefulness and harmlessness, as demonstrated in empirical evaluations where RLHF-tuned models outperformed larger SFT counterparts on blind tests.[1] [2] Moreover, human feedback facilitates causal alignment with complex values—such as truthfulness and conciseness—that evade formalization, addressing the reward hacking risks inherent in sparse or proxy rewards. By iteratively refining the policy against a learned reward model derived from thousands of human annotations (e.g., 30,000-50,000 preference pairs in early implementations), RLHF enhances sample efficiency and robustness, though it introduces dependencies on annotator reliability and potential biases in feedback aggregation.[1] This method's efficacy stems from its ability to distill subjective human oversight into scalable signals, bridging the gap between autonomous optimization and intentional human desiderata in opaque reward landscapes.[2]Comparison to Supervised Fine-Tuning
Supervised fine-tuning (SFT) trains language models by maximizing the likelihood of generating responses matching a curated dataset of prompt-response pairs, effectively imitating high-quality demonstrations to adapt pretrained models for instruction-following.[1] In contrast, reinforcement learning from human feedback (RLHF) builds upon an initial SFT phase but incorporates a reward model trained on human pairwise preferences—where annotators rank multiple model-generated responses to the same prompt—to define a scalar reward signal for desired behaviors like helpfulness and harmlessness.[1] This reward model, often parameterized via a Bradley-Terry ranking loss, enables subsequent policy optimization using algorithms like proximal policy optimization (PPO), which maximizes expected reward while constraining deviation from the SFT policy via KL divergence to prevent collapse.[1] The core distinction lies in optimization objectives: SFT directly regresses to fixed demonstrations, risking overfitting to the training distribution and limitations in handling nuanced preferences not explicitly demonstrated, such as avoiding subtle harms or adapting to novel instructions.[1] RLHF, by learning a preference-based reward, facilitates generalization beyond imitation, as the policy can explore and reinforce outputs aligning with inferred human values rather than rote replication.[1] For instance, RLHF reduces issues like excessive repetition or sycophancy observed in SFT models, as the reward signal penalizes undesirable traits across varied outputs. Empirically, RLHF demonstrates superior performance in human evaluations. In OpenAI's InstructGPT experiments released in January 2022, a 1.3 billion-parameter model fine-tuned with RLHF achieved higher win rates against a 175 billion-parameter SFT baseline (e.g., GPT-3), particularly on out-of-distribution prompts, with preference satisfaction improving by up to 10-20% in categories like correctness and low toxicity.[1] Similarly, Anthropic's 2022 application of RLHF to a 52 billion-parameter model yielded a 15-25% relative gain in helpfulness and harmlessness ratings over SFT equivalents, as measured by crowd-sourced comparisons. These gains stem from RLHF's ability to iteratively refine policies using dense reward feedback, though it demands 2-5 times more annotation effort for preference pairs compared to SFT's response labeling.[1] Despite these advantages, RLHF introduces complexities absent in SFT, including reward model misgeneralization—where the proxy reward fails to capture true preferences—and higher computational costs from RL training loops, often requiring 10-100x more GPU hours.[1] SFT remains preferable for resource-constrained settings or when abundant high-quality demonstrations suffice, as recent analyses indicate that carefully curated SFT data can narrow the gap with RLHF in narrow domains, though RLHF consistently excels in broad alignment tasks.Methodology
Gathering and Structuring Human Feedback Data
In reinforcement learning from human feedback (RLHF), the initial gathering of feedback data begins with curating prompts, often sourced from existing instruction-tuning datasets or generated synthetically to cover diverse tasks such as question-answering, summarization, and creative writing.[23] Human annotators, typically professional contractors trained with detailed guidelines, then provide demonstrations by writing high-quality responses to these prompts, forming a supervised fine-tuning (SFT) dataset of prompt-response pairs.[1] For the preference data essential to RLHF, annotators evaluate multiple model-generated completions per prompt—usually 2 to 9 outputs from an SFT-trained model—and rank them by quality, helpfulness, and harmlessness.[1] This process yielded, for example, rankings on approximately 31,000 prompts in the InstructGPT pipeline, with each prompt receiving multiple annotations to improve reliability.[1] Pairwise comparisons dominate as the primary feedback format, where annotators select the superior response between two options, facilitating reward model training under the Bradley-Terry preference model, which estimates pairwise win probabilities.[2] Alternative formats include scalar ratings (e.g., on a 1-5 scale for overall quality) or full ordinal rankings, though pairwise methods reduce cognitive load and enhance consistency, with inter-annotator agreement rates around 60-70% in controlled studies.[2] Annotation platforms enforce structured interfaces, such as side-by-side response displays with criteria checklists, to minimize bias; OpenAI's contractors, for instance, underwent iterative guideline refinement based on pilot annotations to align judgments with desired model behaviors.[1] Structuring the collected data involves filtering for quality—discarding low-agreement or off-topic annotations—and formatting into tuples like (prompt , winning response , losing response ) for preference modeling.[23] Comprehensive pipelines incorporate pre-annotation steps, such as response generation via sampling from base or SFT models, followed by automated filtering (e.g., using perplexity scores or heuristics to remove incoherent outputs) before human review, which can reduce annotation volume by 20-50% while preserving preference signal.[23] Datasets are balanced across prompt types and augmented with metadata like annotator ID for downstream analysis of variance, ensuring the reward model's robustness to human judgment inconsistencies.[2] In practice, this structured data totals tens to hundreds of thousands of preferences per iteration, with costs scaling to thousands of labor hours due to the need for expert-level annotations over crowdsourced alternatives.[1]Training the Reward Model
The reward model in reinforcement learning from human feedback (RLHF) is trained to predict scalar rewards for prompt-response pairs, serving as a surrogate for human preferences during subsequent policy optimization. Training data consists of prompts paired with multiple model-generated responses, where humans provide rankings or pairwise comparisons indicating which responses are preferred. In the foundational InstructGPT implementation, approximately 33,000 prompts were curated from API user queries and labeler demonstrations, filtered to remove personally identifiable information and deduplicated across organizations; for each prompt, 4 to 9 responses were sampled from a supervised fine-tuned (SFT) language model, and labelers ranked them to yield up to \binom{K}{2} pairwise preferences per prompt, with K denoting the number of responses.[1] The reward model architecture is typically derived from the SFT checkpoint of a transformer-based language model, with the final unembedding layer replaced by a linear projection to a single scalar output r_θ(x, y) for a prompt x and response y. This setup leverages the model's understanding of language while adapting it to preference prediction; for stability, smaller variants like a 6-billion-parameter model were used instead of larger ones, which proved unstable during training. The objective follows the Bradley-Terry model, framing preferences as probabilistic outcomes where the probability that y_w is preferred to y_l given x is σ(r_θ(x, y_w) - r_θ(x, y_l)), with σ as the logistic sigmoid function; the loss is the average negative log-likelihood over comparisons: -1/\binom{K}{2} E[log σ(r_θ(x, y_w) - r_θ(x, y_l))], treating preferences as ground-truth labels.[1] Training hyperparameters emphasize efficiency and generalization: a single epoch over the full dataset prevents overfitting to noisy human judgments, with batches comprising all comparisons from 64 prompts (up to 2,304 pairs per batch) processed as single elements to preserve prompt-level context. A cosine learning rate schedule starts at 9×10^{-6}, decaying to 10% of the initial value; rewards are normalized post-training such that SFT demonstrations receive a mean reward of zero, aiding stability in downstream reinforcement learning. These practices, while sensitive to epoch count and learning rate (robust to ±50% variations), have been widely adopted, though simpler pairwise setups (K=2) reduce annotation costs at the potential expense of richer preference signals from full rankings.[1]Policy Optimization via Proximal Policy Optimization and Variants
Proximal Policy Optimization (PPO) serves as the primary algorithm for the reinforcement learning phase in RLHF, fine-tuning the policy—typically a large language model—to maximize expected rewards from the reward model while ensuring stable updates in high-dimensional action spaces like token generation.[1] Introduced by Schulman et al. in 2017, PPO builds on policy gradient methods by using a clipped surrogate objective that constrains the probability ratio between new and old policies within a trust region, approximated via importance sampling to avoid destructive large steps that could destabilize training.[24] This approach enhances sample efficiency compared to methods like REINFORCE, as it reuses data from on-policy rollouts across multiple epochs without requiring second-order optimizations like those in Trust Region Policy Optimization (TRPO).[24] In RLHF applications, PPO is adapted for sequential decision-making where states consist of prompts, actions are sampled tokens, and episodic rewards are derived from the reward model's scalar outputs on full responses, often augmented with intermediate token-level rewards via value function approximations.[1] The actor-critic setup involves the policy network generating trajectories, a value network estimating future rewards, and generalized advantage estimation for low-variance gradient signals; training proceeds in iterations of data collection, surrogate loss minimization with clipping (typically ε=0.2), and value loss with optional entropy regularization to encourage exploration.[24] OpenAI's InstructGPT implementation, for instance, applied PPO to 1.3 billion and 175 billion parameter models, achieving alignment gains over supervised fine-tuning by optimizing for human-preferred outputs while using a reference model for KL-divergence constraints, demonstrating a high performance ceiling especially in complex tasks like dialogue and reasoning.[1][25] Variants of PPO address specific challenges in RLHF, such as mode collapse or excessive deviation from pre-trained behaviors. A common adaptation incorporates a Kullback-Leibler (KL) divergence penalty between the updated policy and a reference policy (e.g., the supervised fine-tuned model), added to the clipped objective as -β * KL(π_θ || π_ref), where β is scheduled or fixed to balance reward maximization and conservatism; this mitigates reward hacking observed in unconstrained RL.[1] Another variant, PPO with adaptive KL control, dynamically adjusts the penalty coefficient to target a specific KL divergence threshold per batch, improving stability in long-horizon tasks like dialogue generation.[26] PPO-max, an enhanced version, modifies the clipping to prioritize high-reward updates more aggressively while retaining proximal constraints, demonstrating faster convergence in some LLM alignment experiments.[26] Group Relative Policy Optimization (GRPO), introduced in 2024, is an efficient variant that eliminates the need for a separate critic model while maintaining performance in RLHF.[27] In 2025-2026, leading platforms for implementing RLHF components, including reward modeling and PPO, are Hugging Face TRL with its RewardTrainer and PPOTrainer, OpenRLHF for high-performance scalable training with PPO and variants like DAPO, and Axolotl for user-friendly fine-tuning with TRL integration supporting PPO.[28][29][30] These modifications preserve PPO's computational tractability—requiring only first-order gradients and parallelizable rollouts—making it suitable for scaling to billion-parameter models despite high GPU demands, with reported training costs in InstructGPT exceeding those of initial pretraining phases.[1] Despite its prevalence, PPO's on-policy nature limits data efficiency, prompting ongoing research into off-policy extensions, though it remains the benchmark for RLHF policy optimization as of 2023 implementations in models like ChatGPT.[25]Integration with Pretraining and Fine-Tuning
Reinforcement learning from human feedback (RLHF) is typically integrated into the training pipeline of large language models (LLMs) following large-scale pretraining and supervised fine-tuning (SFT), forming a sequential progression that leverages each stage's strengths to progressively align models with human intent. Pretraining on vast unlabeled text corpora equips the base model with broad linguistic knowledge and predictive capabilities through next-token prediction, as demonstrated in models like GPT-3, which was pretrained on approximately 570 GB of filtered Common Crawl data.[1] SFT then refines this base by training on curated datasets of instruction-response pairs—such as the 13,000 prompts used in InstructGPT—enabling the model to generate coherent responses to specific tasks, serving as an initialization point for subsequent RLHF to mitigate instability in direct policy optimization from the raw pretrained model.[1] This staged approach ensures RLHF operates on a policy already attuned to instruction-following, reducing the risk of catastrophic forgetting or divergence during reinforcement learning.[31] In the RLHF phase, the SFT-initialized policy generates response candidates for prompts, which are ranked by human annotators to train a reward model (RM) that approximates preferences, often using Bradley-Terry modeling to score outputs relative to the SFT reference policy.[1] Policy optimization, commonly via proximal policy optimization (PPO), then updates the model to maximize expected rewards while constraining divergence from the SFT policy through KL-regularized objectives, preserving pretraining-derived capabilities like factual recall and fluency; for instance, InstructGPT-1.3B achieved a 6.2% improvement in human preference win rates over SFT baselines on held-out tasks while maintaining length-controlled performance.[1] This integration allows RLHF to refine subtle aspects of helpfulness and harmlessness that SFT overlooks, as pure supervised methods optimize for exact matches rather than ordinal preferences, though empirical results show RLHF's gains diminish without strong SFT priors, with direct RL on pretrained models yielding unstable training due to high-variance reward signals. Variations in integration have emerged, such as iterative RLHF loops where post-RLHF models undergo additional SFT on generated data to consolidate gains, as explored in subsequent OpenAI scaling efforts leading to GPT-4, or hybrid approaches combining RLHF with direct preference optimization (DPO) to bypass explicit RM training while still referencing SFT distributions.[1] However, the canonical pipeline—pretraining, SFT, then RLHF—remains dominant, as evidenced by its adoption in models like Anthropic's Claude series, where SFT on constitutional AI principles precedes preference-based RL to enforce value alignment without solely relying on post-hoc corrections. Empirical evaluations, including blind pairwise comparisons, confirm that RLHF-augmented models outperform SFT-only counterparts by 10-20% in downstream instruction adherence metrics, underscoring the necessity of this integration for scalable alignment beyond mere imitation learning.[31][1]Applications and Empirical Outcomes
Primary Use in Aligning Large Language Models
Reinforcement learning from human feedback (RLHF) serves as the primary technique for aligning large language models (LLMs) with human preferences, shifting outputs from mere prediction of next tokens in vast corpora toward generating helpful, honest, and harmless responses.[9] This alignment addresses the limitations of pretraining and supervised fine-tuning, where models often produce verbose, unhelpful, or unsafe content despite high factual accuracy.[1] In practice, RLHF integrates human judgments to train a reward model that scores model outputs, followed by reinforcement learning to optimize the policy for higher rewards while constraining deviation from the supervised baseline.[32] OpenAI pioneered this application in developing InstructGPT, released on January 27, 2022, which fine-tuned GPT-3 variants using RLHF on datasets of human-ranked prompt completions.[9] Human labelers ranked outputs for helpfulness, leading to a reward model that guided proximal policy optimization (PPO), resulting in models that better followed instructions and reduced issues like sycophancy or fabrication.[1] This approach scaled to ChatGPT, launched November 30, 2022, based on the GPT-3.5 architecture with extensive RLHF, enabling conversational coherence and preference alignment across diverse queries.[33] Subsequent models, including iterations of GPT-4, have relied on RLHF variants to enhance safety and utility, with human feedback collected from thousands of labelers via platforms like Scale AI.[31] Empirically, RLHF-aligned models demonstrate superior performance in blind human evaluations; for instance, the 1.3 billion parameter InstructGPT model outperformed the 175 billion parameter GPT-3 base model in preference rankings for instruction-following tasks.[1] This inversion—smaller aligned models surpassing larger unaligned ones—highlights RLHF's efficiency in leveraging human oversight to prioritize qualitative human values over raw scale.[9] While effective for deployment in chat interfaces and assistants, RLHF's reliance on aggregated preferences introduces variability, as labeler demographics influence reward signals, yet it remains the dominant method for commercial LLM alignment as of 2025.[34]Extensions to Other AI Domains
RLHF principles have been adapted to robotics, where human feedback guides agents in learning complex manipulation or navigation tasks amid sparse or ill-defined rewards. In a 2023 framework termed SEED, RLHF is integrated with primitive skill discovery to enable robots to refine behaviors based on pairwise human comparisons of trajectories, demonstrating improved performance on simulated manipulation benchmarks compared to pure RL baselines.[35] Subsequent work in 2025 introduced reinforcement learning from implicit human feedback (RLIHF) using non-invasive electroencephalography (EEG) signals to align robotic policies with subtle human intent, achieving up to 20% higher success rates in real-world object manipulation tasks without explicit verbal input.[36] These extensions highlight RLHF's utility in bridging the sim-to-real gap, though they require careful calibration to mitigate human fatigue in feedback provision.[37] In computer vision, particularly text-to-image generation, RLHF aligns diffusion models by training reward models on human preferences for output quality, such as aesthetic appeal or prompt fidelity. A 2023 study collected a dataset of 18,000 images with rich human annotations (RichHF-18K) to train multimodal transformers that predict feedback scores, enabling policy optimization that reduced misalignment artifacts like anatomical errors in generated humans by 15-25% on evaluation sets.[38] RLHF has also been applied to human pose estimation and image classification tasks through human-in-the-loop annotation, where feedback refines RL agents for accurate labeling of poses and related classifications, improving precision in keypoint detection and semantic understanding.[39][40] This approach has been applied to models like Stable Diffusion variants, where KL-regularized RLHF prevents mode collapse while incorporating judgments on realism and mood, outperforming supervised fine-tuning in human-rated preference metrics.[41] Extensions to multi-modal AI, combining vision and language, leverage RLHF to align models with holistic human preferences across modalities. The LLaVA-RLHF framework, released in 2024, applies RLHF to large vision-language models, using human-ranked response pairs to optimize for tasks like visual question answering, resulting in a 5-10% uplift in alignment scores over instruction-tuned baselines on benchmarks such as VQA-v2.[42] Factually augmented RLHF, proposed in 2023, enhances this by injecting image captions and verified facts into reward modeling, reducing hallucinations in multi-modal outputs by up to 30% while preserving generative diversity, as validated on datasets like ScienceQA.[43] These adaptations underscore RLHF's versatility but emphasize the need for scalable feedback mechanisms to handle high-dimensional inputs.[44]Quantifiable Achievements in Model Performance
In the seminal work on InstructGPT, released in March 2022, reinforcement learning from human feedback (RLHF) enabled a 1.3 billion parameter model to outperform the 175 billion parameter GPT-3 baseline in human preference evaluations, achieving a win rate of approximately 60% across diverse prompts.[1] Similarly, the 175 billion parameter InstructGPT variant surpassed the same-sized GPT-3 by a margin of 85 ± 3% in pairwise comparisons, and 71 ± 4% against few-shot prompted GPT-3, demonstrating RLHF's capacity to enhance instruction-following without relying solely on scale.[1] These gains stemmed from RLHF's iterative optimization using a reward model trained on human rankings, which prioritized helpful, honest, and harmless responses over supervised fine-tuning (SFT) alone.[1] RLHF also yielded measurable improvements in safety and reliability metrics. On the TruthfulQA benchmark, InstructGPT models exhibited roughly twice the truthfulness of GPT-3, with the 175 billion parameter RLHF variant scoring 81.5% on true and informative responses when prompted with instructions.[1] Hallucination rates dropped from 41% in GPT-3 to 21% in InstructGPT, while toxicity generation, as measured by RealToxicityPrompts, decreased by about 25% under respectful prompting conditions (e.g., expected toxicity score of 0.179 versus 0.228 for GPT-3).[1] In direct comparisons against SFT baselines, RLHF via proximal policy optimization (PPO) achieved higher win rates (ranging from 50% to 70% depending on hyperparameters and model size) in blind human evaluations for overall response quality.[1]| Metric | GPT-3 (175B) | InstructGPT (RLHF, 1.3B-175B) | Improvement |
|---|---|---|---|
| Human Preference Win Rate vs. GPT-3 | Baseline | 60-85% | +60-85% preference |
| TruthfulQA (True + Informative) | ~40-50% | Up to 81.5% (175B instructed) | ~2x |
| Hallucination Rate | 41% | 21% | -49% relative |
| Toxicity (RealToxicityPrompts, respectful prompt) | 0.228 | 0.179 (175B) | -21% absolute |
