https://arxiv.org/pdf/2408.02666
Model-based evaluation is at the heart of successful model development – as a reward model for training, and as a replacement for human evaluation.
To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve.
In this work, we present an approach that aims to improve evaluators without human annotations, using synthetic training data only.
Starting from unlabeled instructions, our iterative selfimprovement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions.
Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench.
This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.
3 Method We consider the setting of pairwise evaluation using the LLM-as-a-Judge approach (Zheng et al., 2023) that takes:
The goal of the LLM-as-a-Judge model is to output a preference of which response y is better: A or B.
In order to do this it is common to output, prior to the final judgment, a chain-of-thought (or “reasoning chain”), which is a set of steps generated in natural language that helps the model decide its final judgment.
Such models can be used as pairwise reward models to build training data for preference optimization, e.g., for training methods like DPO (Rafailov et al., 2023), Iterative DPO and Self-Rewarding methods (Yuan et al., 2024). They can also be used for evaluation; e.g., many popular benchmark leaderboards are built by using a fixed LLM-as-a-Judge evaluation model (Li et al., 2023) such as GPT4 (Achiam et al., 2023). We propose a novel recipe for training such an evaluator. Our overall method is an iterative training scheme that bootstraps improvements by annotating the current model’s judgments using constructed synthetic data – so that the Self-Taught Evaluator is more performant on the next iteration. Our overall pipeline is thus as follows:
• Initialization: We assume access to a large set of human-written user instructions, e.g., of the type that is commonly collected in production systems, and an initial seed LLM.
• Instruction Selection: We next select a challenging, balanced distribution of user instructions from the uncurated set by categorizing them via LLM.
• Response Pair Construction: For each user instruction (example) we create a preference pair of two model responses (chosen & rejected), generating them via prompting such that the rejected response is likely of lower quality than the chosen response.
• Iterative Training: We then iterate the following two steps:
(i) Judgment Annotation: For each example, we sample from the current model up to N times LLM-as-a-Judge generated reasoning traces and judgments.
If we find a correct judgment we add that example to our training set, otherwise we discard it.
(ii) Model Fine-tuning: We fine-tune the model on the newly constructed training set which yields an updated model for the next iteration.
Note that in each iteration of training the size of the training set depends on the quality of the current model.
We expect that as the model improves, the size of the training set will increase as well, as the model will be able to find more correct judgments, giving the model a kind of automatic curriculum.
We next describe each of these steps in detail
3.1 Initialization
We assume we have access to a pool of user instructions {xi}.
Each sample xi can either be one single text instruction or a multi-turn dialog history of turns between the user and the assistant, with the last turn being an instruction or question from the user. Instructions typically involve different skills such as general knowledge and reasoning, coding, safety, and mathematical reasoning. 3.2 Instruction Selection Given a pool of human-written user instructions, there may be a large degree of noise, as well as an imbalance in terms of topic, variety, difficulty, and ability of the model to answer. We therefore aim to select a subset of instructions to generate highquality synthetic responses and judgments that can be further used for training. We classify each input using an LLM into a given category, for example coding, reasoning, brainstorming, etc. The precise prompt we use is given in Figure 7. We are then free to select data from within those categories, and to discard certain categories not deemed to be useful for training.
3.3 Response Pair Construction