The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods.
These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM-judges.
In this work, we attempt to answer the following question – do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not?
We define a concrete metric for alignment, and introduce SOS-BENCH (Substance Outweighs Style Benchmark), the largest standardized, reproducible LLM meta-benchmark to date.
We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors.
Our codebase and complete results can be found at https://github.com/penfever/sos-bench/.
3 LLM-JUDGE PREFERENCE BENCHMARKS
LLM-judge preference benchmarks (hereafter referred to as LLM-judge benchmarks), such as Arena-Hard-Auto, Alpaca Eval, and MT Bench, have become quite popular since their introduction shortly after the initial release of Llama.
Today, many papers report solely or primarily on these benchmarks. This abrupt surge in their use invites closer consideration of their properties and design.
Pairwise preference benchmark scores are inconsistent with static benchmarks, but consistent with each other.
In Table 1, we compare the ranking of eight top-performing LLMs on LiveBench, HELM, Arena-Hard-Auto, and ChatBot Arena (White et al., 2024; Liang et al., 2023).

We find that LLM-judges of alignment indeed have preferences that closely (albeit imperfectly) correlate with those of humans, but that their correlation with static benchmarks is weak.
Similar effects can be observed on the leaderboard of BenchBench, a recent paper which aims to standardize benchmark agreement testing (Perlitz et al., 2024).
When we aggregate using standard benchmarks (HELM, HuggingFace OpenLLM Leaderboard 2, LiveBench, and OpenCompass), the highest overall BenchBench score is 2.2, and the highest pairwise preference score is 0.69.
Conversely, if we instead aggregate using preference benchmarks (Alpaca Eval, MT-Bench, Arena-Hard, Chatbot Arena), the highest overall score is 1.8, and the highest standard benchmark score is 1.4.
Such measures, however, cannot tell us which benchmark regime we ought to trust more, or why (Perlitz et al., 2024). In order to understand this point better, we introduce a framework for interpreting and comparing them to one another.
Towards a Universal Framework for LLM-Judge Benchmarks
We observe that all pipelines which employ LLMs as surrogates for human pairwise preferences necessarily make use of certain components.
In order to facilitate analysis of these systems, we diagram their common architectural flow in Figure 1.

From this, it quickly becomes clear that there are several key structural distinctions to be made between LLM-judge benchmarks and standard NLP benchmarks:
Most standard benchmark metrics are model-free; all LLM-judge benchmarks require a judge model.Despite this, it is not yet standard practice to ablate this choice or ensemble across judges, most likely for reasons of cost.
The most commonly used standard benchmarking metrics, such as BLEU, are model-free. While both the choice of metric and the choice of judge can be a potential confound, the opacity of LLMs makes their behavior much more challenging to interpret, compared to deterministic model-free metrics.
Standard benchmark scores are reference-based; LLM-judge benchmarks are comparison-based.
The choice of baseline response to which assistants are compared represents another potential confound, as pairwise preferences over texts do not obey the transitive property.
Standard benchmarks contain many questions on a narrow subject; LLM-judge benchmarks contain a small number of questions on a wide range of subjects.
The authors have justified this choice by demonstrating that their benchmarks nevertheless correlate strongly with preference reports on ChatBot Arena.
To the extent that strong correlations are achievable, they would only be so if the judgment criteria were consistent across all tasks and all judges, which would tend to favor stylistic factors.
But this does not guarantee that a preference score on a broad range of topics and people will correlate well with individual use cases, as capabilities are vector-valued, not scalar-valued.
Because of their high unit costs, LLM-judge benchmarks are smaller than standard NLP benchmarks like GLUE and MMLU; Arena-Hard has a test set of 500 questions.
Standard Benchmarks vs. LLM-Judge Benchmarks
Standard benchmarks specify a metric, but LLM-judge benchmarks must specify both a metric and judging criteria.
This introduces a new potential confound not present in standard benchmarks; the instructions to the judge may be underspecified, unclear, or may simply not reflect best practices in model alignment with human preference.
While the first three concerns might be expected to ameliorate over time as LLMs become less expensive to operate, the fourth concern is foundational – there is no way for a judge to complete a preference-selection task without some degree of inductive bias (Ethayarajh et al., 2024).
We observe that such a bias may be explicit (i.e., it may be introduced via the instructions to the judge) or implicit (i.e., representing a fixed value system the judge brings to the task either independently of, or in violation of, the instructions).
A reasonable desideratum for an objective LLM-judge benchmark would be to make as many biases as possible explicit, and to curb the use of implicit bias.
In service of this goal, we devote our next section to developing an understanding of implicit bias in LLM judges.
4 IMPLICIT BIAS IN LLM-JUDGES
Intuitively, we might expect that given a set of judging criteria and no explicit instructions on how to prioritize them, LLMs would assign equal weight to all criteria.
However, we find that this is not the case.
Rather, LLMs demonstrate powerful implicit biases between judging criteria. They heavily reweight criteria while determining their final preference, all but ignoring some and emphasizing others.
LLMs also exhibit implicit bias within judging criteria, with certain kinds of violations scored far more harshly than others.
Experimental Setting
We conduct our experiments on a series of post-trained LLAMA-3-8B base models, LLAMA-3 base without post-training, opt-125m, and several GPT checkpoints (Dubey et al., 2024; Brown et al., 2020; Zhang et al., 2022; Xu et al., 2024b; Meng et al., 2024).
As of this writing, all of the checkpoints are available on HuggingFace; we provide names and references for all the post-training methods we consider in Appendix C.
Our LLM-judge benchmark is Arena-Hard-Auto, from Li et al. (2024c).
We choose to make a case study of this LLM-judge benchmark because it is very popular in the literature and it makes some of the strongest claims to alignment with human preference.
Unless otherwise noted, we use the standard settings for Arena-Hard-Auto, which as of this writing uses GPT-4-0314 as a baseline model and GPT-4-1106-preview as a judge.
For reasons of cost, in Table 3 and Table 2, we substitute gpt-4o-mini-2024-07-18 for the standard judge.
In order to conduct our experiment, we also modify the judge template. The judge template we use can be found in Appendix F.2.
Following the authors, we report scores in the form of win-rates over a baseline model and report pairwise preferences in the form of Likert scores.
4.1 GIVEN EXPLICIT JUDGMENT CRITERIA, LLM-JUDGES IMPLICITLY REWEIGHT THEM
In order to conduct these experiments, we alter the judge template to provide the judge with explicit biases while leaving room for implicit ones as well.
In addition to an overall preference, we instruct the judge to state a preference on five fine-grained criteria: completeness, conciseness, style, safety, and correctness.
Correctness and completeness assess honesty, safety assesses harmlessness, and completeness, style, and conciseness assess helpfulness.

In Table 2, we show the results. The scores for completeness, style, and the overall score are nearly identical; the rank order of correctness is the same as that for completeness and style, but the scores are compressed, perhaps indicating more ambivalence.
Safety scores for most models are nearly identical, and conciseness is moderately anti-correlated. Given a set of criteria, LLM-judges implicitly favor completeness and style.
When seen in this light, the unintuitive ranking of an 8B fine-tune above GPT-4 makes more sense; the fine-tune has produced verbose, didactic, blandly polite responses to every prompt—a set of behaviors we could collectively term stylistic reward hacking.
4.2 LLM-JUDGES IMPLICITLY WEIGHT CRITERIA VIOLATIONS DIFFERENTLY
In these experiments, we introduce systematic criteria violations into the model responses for the top-performing model (Magpie+DPO) and recompute the model’s LLM-judge score, while leaving the rest of the pipeline unaffected.
We hope to gain some indication of whether the model has understood the explicit instructions for each criterion and how it will weight violations of those criteria.
We provide samples of the altered responses in Appendix G.
For all of our interventions, the transforming model was GPT-4o-mini, and it was given instructions not to change or remove any factual claims in the original response.
To create our undiverse intervention, we prompted GPT to transform each response and make it as repetitive as possible, eliminating synonyms and unusual words.
The exact prompts we use can be found in Appendix H.
To create our wrong intervention, the research team reviewed each response and changed one salient fact in the model response; for example, if a model asserted that a condition always held, we altered it to say that it never held.
For our concise intervention, we prompted GPT to compress each response as much as possible.
Finally, for our sarcastic response, we instructed GPT to rewrite the responses in an obnoxious and irritating tone (without writing anything offensive or harmful).
The results, in Table 3, show that, far from treating all violations equally, LLM judges are highly critical of unconventional stylistic changes, such as making the assistant’s tone sarcastic, but fairly lenient on major factual errors in the response.
It is not clear that these implicit biases accurately reflect our priorities in model alignment; sarcasm, while probably not desirable in most cases, is only a minor violation of helpfulness, whereas incorrect answers strongly violate the honesty principle.
5 SOS-BENCH
New, objective measures of progress in alignment can help the research community make progress faster.
Fortunately, there exist many useful static benchmarks of various aspects of LLM behavior and performance; by categorizing and unifying these disparate works, we can produce a large-scale meta-benchmark to measure progress on certain key aspects of alignment.
SOS-BENCH (Substance Over Style Benchmark) combines 19 existing world knowledge, instruction-following, and safety benchmarks for a holistic view of model performance.
For the complete list of benchmarks we use, please refer to Table 8.
All of the questions in our benchmark contain ground truth answers, and aggregates are reported using the average of normalized accuracies (by the size of the test set), with 95% confidence intervals.
Comparing SOS-BENCH to Existing Evaluations
All in all, we test models on 152,380 data points; to the best of our knowledge, this is almost 7x larger than the largest previous open-source LLM benchmark which end users can run themselves, HuggingFace’s OpenLLM Leaderboard 2.
While individual model releases and technical reports also release meta-benchmark results, they suffer from two failings; they are not always reproducible, and the aggregations of results, which are different for each model release, are vulnerable to cherry-picking.
By combining scale and standardization, we hope to create a reliable, concrete alignment measure.
Measuring Alignment
We subdivide our benchmark into three factors (world knowledge, instruction-following, and safety) and report the results for each.
For the sake of comparison, we also report results from Arena-Hard-Auto.
For results on individual benchmarks, we refer the reader to https://github.com/penfever/sos-bench/.
A Concrete Measure of Alignment
No measure of alignment can cover all of the factors worthy of consideration; prior work has variously emphasized accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, among other factors (Liang et al., 2023).
We choose to focus on a representative subset of tasks, namely, the widely disseminated Helpful, Honest, and Harmless (HHH) principles (Askell et al., 2021).
These principles, popularized by the AI startup Anthropic, have the virtues of being widely recognized and largely uncontroversial, and therefore make a suitable starting point.
Concretely, we propose that if model A is better on objective measurements of all three factors with respect to model B, then model A is better aligned than model B.
As objective measurements of HHH principles remain aspirational for the time being, we propose the following conceptual mapping:
Model A is more honest than model B
if it exhibits statistically superior performance on measures of world knowledge. Although there is more to honesty than world knowledge, it is not possible for a model to intentionally tell the truth if it does not know what the truth is
- Model A is more helpful than model B
- if it exhibits statistically superior performance on measures of instruction-following, because a model that correctly understands instructions is always at least as helpful as a model which fails to understand them, all else equal.
- Model A is more harmless than model B
- if it exhibits statistically superior performance on measures of safety, such as red-teaming or refusal benchmarks.