Wonder what’s the difference between RLHF method with finetuning method? Fine-Tuning is actually the first step of a RLHF method. Continue to read.
The utilization of reinforcement learning from human feedback (RLHF) has proven to be an effective method in aligning foundation models with human preferences. This technique, which involves fine-tuning models, has played a crucial role in recent advancements in AI, exemplified by the success of OpenAI’s ChatGPT model and Anthropic’s Claude model.
The implementation of RLHF brings about subtle yet significant improvements in the usability and performance of models. These enhancements can include refining the tone, mitigating biases and toxic elements, and enabling domain-specific content generation. This article will delve into the application of RLHF in fine-tuning large language models (LLMs).
Understanding Reinforcement Learning from Human Feedback
RLHF originated from a fundamental challenge in reinforcement learning: the complexity, ambiguity, and difficulty in defining goals for many reinforcement learning tasks. This predicament leads to a misalignment between our values and the objectives of RL systems, as emphasized in the paper “Deep reinforcement learning from human preferences.”
Numerous AI applications, particularly in businesses, encounter goals that are challenging to specify. For instance, in content moderation, the nuanced context of moderation policies might conflict with the enforcement decisions made by algorithms. Similarly, content generation, such as automated support agents, faces hurdles in achieving optimal quality. Although generative AI enables cost-effective content creation, concerns about maintaining brand style and tone have impeded widespread adoption. How can a team establish a reward function that consistently adheres to brand guidelines? In situations where the risks associated with AI-generated content are significant, opting for a deterministic chatbot or human support agent might be a justifiable investment.
In traditional reinforcement learning, a clear reward function provides guidance to algorithms. However, for more intricate tasks, determining an appropriate reward function can be challenging. In such cases, human preferences can effectively guide AI systems towards making the right decisions. This is because people, even those without expertise, possess an intuitive understanding of navigating nuanced and contextual tasks. For example, given a sample of brand marketing copy, individuals can easily assess how well AI-generated copy aligns with the brand’s intended tone. The main challenge, however, lies in the time and cost required to incorporate human preferences directly into the reinforcement learning training process. As stated in the paper “Deep reinforcement learning from human preferences,” “Using human feedback directly as a reward function is prohibitively expensive for RL systems that require hundreds or thousands of hours of experience.”
To address this challenge, researchers introduced reinforcement learning from human feedback (RLHF), which involves training a reward predictor, or preference model, to estimate human preferences. Utilizing a reward predictor significantly enhances the cost-effectiveness and scalability of the process compared to directly supplying human feedback to an RL algorithm.
The RLHF Process: Insights from OpenAI
Leveraging RLHF to Enhance Large Language Models
RLHF serves as a powerful tool to improve the helpfulness, accuracy, and reduction of harmful biases in large language models. A comparison between GPT-3 and InstructGPT (a model fine-tuned using RLHF) conducted by OpenAI researchers revealed that labelers “significantly prefer” outputs from InstructGPT. InstructGPT also demonstrated improvements over GPT-3 in terms of truthfulness and toxicity benchmarks. Similarly, a research paper by Anthropic in 2022 documented similar benefits, stating that “RLHF improves helpfulness and harmlessness by a huge margin when compared to simply scaling models up.” These studies present a compelling case for leveraging RLHF to achieve various business objectives with large language models.
Let’s Explore the RLHF Workflow for Fine-tuning
Step 1: Gather Demonstration Data and Train a Supervised Policy
To initiate the fine-tuning of a large language model (LLM), the first step is to collect a dataset called demonstration data. This dataset consists of text prompts and their corresponding outputs, representing the desired behavior of the fine-tuned model. For example, in an email summarization task, the prompt could be a full email, and the completion would be a two-sentence summary. In a chat task, the prompt might be a question, and the completion would be the ideal response.
Demonstration data can be sourced from various channels, such as existing data, a labeling team, or even from a model itself, as demonstrated in the concept of Self-Instruct: Aligning Language Model with Self-Generated Instructions. According to OpenAI’s fine-tuning guidelines, a few hundred high-quality examples are typically required for successful fine-tuning. The performance of the model tends to improve linearly with the size of the dataset. It is crucial to manually review the demonstration datasets to ensure accuracy, avoid toxic content, mitigate biases, and provide helpful information, as recommended by OpenAI’s researchers.
Platforms like OpenAI and Cohere offer comprehensive guides on the technical steps involved in fine-tuning a large language model using supervised learning.
Step 2: Collect Comparison Data and Train a Reward Model
Once the large language model has been fine-tuned using supervised learning, it becomes capable of generating task-specific completions on its own. The next stage of the RLHF process involves collecting human feedback in the form of comparisons on these generated completions. This comparison data is then utilized to train a reward model, which will subsequently be employed to optimize the fine-tuned supervised learning model through reinforcement learning (as described in step 3).
To generate comparison data, a labeling team is presented with multiple completions produced by the model for the same prompt. The labelers rank these completions from best to worst. The number of completions can vary, ranging from a simple side-by-side comparison to the ranking of three or more completions. OpenAI found it effective to present labelers with a range of four to nine completions to rank at a time during the fine-tuning of InstructGPT.
There are third party vendor or tools that facilitate the execution of comparison tasks, enabling direct upload of model completions or real-time generation via a model endpoint.
It is crucial to test the fine-tuned LLM against baselines to assess honesty, helpfulness, bias, and toxicity. Standard LLM benchmarks like TruthfulQA, the Bias Benchmark for Question Answering, and RealToxicityPrompts for toxicity evaluation can be used for this purpose.
Step 3: Optimize the Supervised Policy Using Reinforcement Learning
In this step, the supervised learning baseline, which represents the fine-tuned LLM, is further optimized by leveraging a reinforcement learning (RL) algorithm. One prominent class of RL algorithms developed by OpenAI is Proximal Policy Optimization (PPO). Detailed information on PPO algorithms can be found on OpenAI’s website.
The RL process aligns the behavior of the supervised policy with the preferences expressed by the labelers. Through iterations of steps 2 and 3, the performance of the model can be continually enhanced.