Self-Rewarding Language Models: Groundbreaking Approach to Language Model Training

I am thrilled to talk about the recently published paper by researchers at Meta titled “Self-Rewarding Language Models.” This innovative language model training approach focuses on self-improvement and iterative training, and has the potential to revolutionize the way we experience AI.

Before exploring the significance of this research, let’s first comprehend how reward models are used in LLM training.

Rewards Model and LLM

Large Language Models (LLMs) use reward models to guide their training process towards desired outcomes. The reward model acts as a feedback mechanism by determining how well the language model’s responses align with specific objectives or criteria. Reward models provide rewards or penalties based on the quality of responses, thus shaping the model’s behavior and improving its accuracy, relevance, and effectiveness in generating human-like text.

In traditional LLM Training, reward models are first trained and then frozen (Parameters of the reward model are fixed and no longer updated) after initial training and optimization. This approach was used to stabilize the evaluation criteria during further language model training.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a new method that aims to improve the accuracy of large language models by directly integrating human feedback into the training process. Unlike traditional reward models, which rely on a separate reward function to evaluate and score outputs, DPO presents pairs of outputs to human raters and trains the model to predict which output the human would prefer.

If you see, both methods need human interference. The success of both approaches is limited by the amount and accuracy of the human feedback. In the case of Reinforcement Learning with Human Feedback(RLFH), the quality of the frozen reward model trained from the data also plays a significant role in its effectiveness.

The Researchers believe these approaches severely limit the ability to create Artificial General Intelligence (AGI) models as AGI models require super-human feedback to achieve the AGI abilities.

What is the Self-Rewarding Language Model?

The paper introduces “Self-Rewarding Language Models,” a revolutionary concept in language model training. These models are unique in their ability to generate and evaluate their own training data, improving iteratively through self-alignment.

In self-rewarding Language Models, agents will perform two activities.

Act as Instruction following models generating responses for given prompts;
Generate and evaluate new Instruction following examples to add to their own training set

Image Courtsey Self-Rewarding Language Models

If you look at the illustration image from the paper, this method consists of two steps.

Self-Instruction creation: newly created prompts are used to generate candidate responses from model Mt, which also predicts its own rewards via LLM-as-a-Judge prompting.
Instruction following training: preference pairs are selected from the generated data, which are used for training via DPO, resulting in model Mt+1

The process is iterated multiple times to improve model instruction the following capability (Better results) and reward modeling ability (Identification of quality of LLM Outputs).

How it works?

The model performs all of the following steps

Generate prompts/instructions to generate the content
Generate the output for the Instruction generated in step 1
Perform DPO by evaluating the response generated by Step 2 via LLM-as-a-Judge prompting
Iterate the process again

LLM-as-a-Judge Prompt

LLM-as-a-Judge prompts act as a reward model and provide self-rewards for the output generated by the model.

The rewarding scoring uses a 5-point scoring system. The scores assigned are additive, which means a total of 5 points for each point is added if the output satisfies certain criteria. To better understand how this is done, please review the screenshot of the Prompt below.

Image Courtesy : Self-Rewarding Language Models

Results:

They performed three iterations of self-rewarding training on the base model Llama 2 70B.

Iteration 1: The Self- Rewarding model (M1) performance in the first iteration was head-to-head with the supervised fine-tuned (SFT) model.
Iteration 2: The second Iteration Self Rewarding model (M2) provides superior Instruction following Iteration 1 (M1) with 55.5% wins for M2 compared to only 11.7% for M1 in a head-to-head evaluation. The win rate against the Baseline also went up by 55%.
Iteration 3: In the third iteration, M3 showed a further improvement over Iteration 2, with 47.7% victories compared to M2’s 12.5% in a head-to-head evaluation. Furthermore, M3’s win rate against the SFT Baseline increased to 62.5%, winning more often than the M2 model’s 9.8%.

Performance of Model on AlpacaEval2.

AlpacaEval2 is an automated system that evaluates language models based on their ability to follow instructions. The system uses a set of benchmarks known as the AlpacaFarm evaluation set to test the models. The responses of the models are then compared to reference responses, which are generated by GPT-4 Turbo for AlpacaEval 2.0.

The screenshot of the table shows the assessment of the Self Rewarding model on the AlpacaEval 2.0 leaderboard format. You can see that the model’s performance from Iteration 3 had scores that are on par with GPT4 March edition, Mistral Medium, and it outperformed models such as Claude 2, Gemini Pro, etc.

Why Is This Significant?

The Self- Rewarding model represents a significant departure from traditional models that depend on fixed reward systems derived from human-generated data. The self-rewarding method has the potential to surpass the limitations of human-based training, resulting in models that are more in line with desired outcomes and capable of continuously improving themselves. This could significantly accelerate the development of more efficient and autonomous language models, and it’s the right step in the direction of Artificial General Intelligence.

Research Paper: Self-Rewarding Language Models

Paper Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

Discover more from Ajith Vallath Prabhakar

Subscribe to get the latest posts sent to your email.

Rewards Model and LLM

Direct Preference Optimization (DPO)

What is the Self-Rewarding Language Model?

How it works?

LLM-as-a-Judge Prompt

Results:

Performance of Model on AlpacaEval2.

Why Is This Significant?

Share this:

Related

Discover more from Ajith Vallath Prabhakar

Discover more from Ajith Vallath Prabhakar