https://arxiv.org/pdf/2411.11504
The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals.
However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities.
Consequently, there is an urgent need to explore novel supervision signals and technical approaches.
In this paper, we propose verifier engineering, a novel posttraining paradigm specifically designed for the era of foundation models.
The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models.
We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage.
We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.
Machine learning's evolution has been marked by continuous advancement in both model capabilities and supervision signals.
This relationship has been synergistic -as models grew more powerful, they required more sophisticated supervision, and as supervision signals expanded, models needed greater capacity to utilize them effectively.
The field's early period was dominated by feature engineering, where domain experts manually designed and extracted features.
Classical algorithms like Support Vector Machines and Decision Trees relied heavily on these handcrafted features due to their architectural limitations. While these approaches worked well with carefully designed features (like TF-IDF), they showed clear limitations when tackling more complex problems.
Deep learning's emergence about two decades ago ushered in the data engineering era.
This marked a shift away from manual feature crafting toward curating high-quality datasets and annotations for automated learning. Projects like ImageNet and BERT demonstrated the effectiveness of this data-centric approach.
However, the rise of foundation models has made it increasingly difficult to enhance capabilities through data engineering alone.
While foundation models (especially LLMs) have shown remarkable abilities matching or exceeding human performance, traditional data engineering faces two major challenges:
1. The high cost and difficulty of obtaining quality human annotations
2. The complexity of providing meaningful guidance for further improvement
To address these challenges, this paper proposes verifier engineering - a novel posttraining paradigm designed for the foundation model era. Rather than relying on manual feature extraction or data annotation, it employs automated verifiers to perform verification tasks and provide feedback to foundation models.
The verifier engineering process consists of three core stages:
1. Search: Sampling from model output distribution to understand performance boundaries
2. Verify: Using various verifiers to evaluate candidate responses
3. Feedback: Using verification results to enhance model performance
This paper provides a comprehensive exploration of verifier engineering:
- Section 2: Formal definition and preliminaries
- Sections 3-5: Detailed examination of search, verify, and feedback stages
- Section 6: Current trends and limitations
- Section 7: Concluding remarks
Unlike previous approaches like RLHF that rely on limited verifier sources, verifier engineering integrates multiple verifiers for more accurate and generalizable feedback.
By shifting from a data-centric to a systematic engineering approach, it represents a crucial step toward advancing artificial general intelligence.
2. Verifier Engineering
In this section, we formalize verifier engineering as a Goal-Conditioned Markov Decision Process (GC-MDP), providing a unified and systematic perspective. We'll explore how search, verify, and feedback concepts fit within this framework, with examples. Table 3 summarizes the categorization of existing post-training approaches into three core stages.
2.1. Preliminary
While LLMs are typically trained to maximize generation likelihood given input, this objective alone cannot guarantee desired post-training capabilities.
To bridge this gap, we formalize verifier engineering as a GC-MDP, denoted as tuple (S, A, T, G, Rg, pg), where:
State Space S:
- Represents the model's state during interaction
- Includes input context, internal states, and intermediate outputs
Action Space A:
- Represents possible token selections at each generation step
- A = {a1, a2, . . . , aN}
Transition Function T:
- Defines probability distribution over next states, given current state s ∈ S and action a ∈ A
- In large language models, this is a deterministic function
- Once current state and selected action (generated token) are given, next state is fully determined
- The search stage can be regarded as exploration in the action-state space under T
Goal Space G:
- Represents various goals related to model capabilities
- Each goal g ∈ G corresponds to a specific model capability (code, math, writing, etc.)
- Multi-dimensional space encompassing various aspects of model capabilities
Goal Distribution pg:
- Probability distribution over goals from goal space G
- Represents likelihood of specific goal g ∈ G being selected
- Can be learned from human feedback or other external signals
Reward Function R_g(s, a) or R_g(s′):
- Represents reward model receives given goal g
- Applied when taking action a from state s or at transformed state s′
- Reflects verification result for specific goal g
- Example: For goal "fairness," reward might be based on avoiding bias
The objective for improving model capabilities is defined as a goal-conditioned policy π : S × G × A → [0, 1] that maximizes the expectation of the cumulative return over the goal distribution.
Let me help you format this text with each sentence on a new line:
where S_g represents selected sub-functions for goal g, and F combines their evaluations.
Following Probably Approximately Correct Theory (PAC) (Vapnik, 2000), even with imperfect sub-functions, we can achieve reliable overall evaluation by combining multiple weak verifiers.
2.2. Verifier Engineering Overview
Building upon the GC-MDP framework, we demonstrate how the three stages of verifier engineering—Search, Verify, and Feedback—naturally align with specific components of this formalism.
This mapping provides both theoretical grounding and insights into how each stage contributes to optimizing the policy distribution π toward desired goals.
As illustrated in Figure 1, the search stage can be divided into linear search and tree search methods.
The verify stage selects appropriate verifiers and invokes them to provide verification results based on the candidate responses.
The feedback stage improve the model's output using either training-based methods or inference-based methods.
For example, RLHF utilizes linear search to generate batches of responses, employs a reward model as the verifier, and applies the Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm to optimize the model based on the reward model's verification result.
OmegaPRM (Luo et al., 2024a) employs a process reward model as a verifier and searches for the best result based on the PRM by maximizing the process reward scores.
Experimental Co-Learning (Qian et al., 2023) collaborative verifiers through multiple LLMs, enhancing both verify and model performance through historical interaction data.
In the GC-MDP framework, the three stages manifest as follows:
Search corresponds to action selection, where state s represents the current context or partial sequence, and action a denotes the next tokens to generate.
Verify maps to the reward function R_g(s, a), evaluating how well the generated outputs align with the specified goal g.
Feedback relates to policy distribution π optimization, maximizing the expected cumulative return J(π) based on verification results.
3. Search
Search aims to identify high-quality generated sequences that align with the intended goals, forming the foundation of verifier engineering, which is critical for evaluating and refining foundation models.
However, it is impractical to exhaustively search the entire state-action space due to its exponential growth-driven by the large vocabulary size N and the maximum generation length T.
To address this challenge, efficient searching, which seeks to navigate this vast space by prioritizing diverse and goal-oriented exploration, plays a critical role in improving model's performance.
In this section, we first introduce how to implement diverse search from the perspective of search structure and discuss additional methods for further enhancing search diversity after the search structure is determined.
3.1. Search Structure
Search structure denotes the framework or strategy used to navigate the state-action space, which significantly influences the effectiveness and efficiency of the search process.
Currently, there are two widely adopted structures for implementing search: linear search and tree search.
Linear search progresses sequentially, making it effective for tasks involving step-by-step actions, while tree search examines multiple paths at each decision point, making it well-suited for tasks requiring complex reasoning.
• Linear Search is a widely used search method where the model starts from an initial state and proceeds step by step, selecting one token at a time until reaching the terminal state (Brown et al., 2020b; Wang & Zhou, 2024).
The key advantage of linear search is its low computational cost, which makes it efficient in scenarios where actions can be selected sequentially to progress toward a goal.
However, a notable limitation is that if a suboptimal action is chosen early in the process, it may be challenging to rectify the subsequent sequence.
Thus, during the progress of verifier engineering, careful verification at each step is crucial to ensure the overall effectiveness of the generation path.
• Tree Search involves exploring multiple potential actions at each step of generation, allowing for a broader exploration of the state-action space.
For instance, techniques like Beam Search and ToT (Yao et al., 2024) incorporate tree structures to enhance exploration, improving model performance in reasoning tasks.
TouT (Mo & Xin, 2023) introduces uncertainty measurement based on Monte Carlo dropout, providing a more accurate evaluation of intermediate reasoning processes.
This approach significantly improves the likelihood of discovering global optimal solution, particularly in environments with complex state spaces.
By simultaneously considering multiple paths, tree search mitigates the risk of being locked into sub-optimal decisions made early on, making it more robust in guiding the model toward an optimal outcome.
However, this increased exploration comes at a higher computational cost, making tree search more suitable for scenarios when the optimal path is challenging to identify.
To make tree search effective, the model must be continuously verified and prioritize paths that align better with the goal conditions.
3.2. Additional Enhancement
While search structures provide the foundational framework for navigating the state-action space, further enhancement is also crucial to enhance search performance.
These enhancements address challenges such as balancing exploration and exploitation, escaping local optima, and improving the diversity of generated results.
The enhancement strategies can be broadly categorized into two approaches: adjusting exploration parameters and intervening in the original state.
Adjusting Exploration Parameters:
Techniques such as Monte Carlo Tree Search (MCTS), Beam Search, and Reject Sampling focus on refining the exploration process by adjusting parameters like temperature, Top-k (Fan et al., 2018), or Top-p (Holtzman et al., 2020).
The challenge lies in balancing the trade-off between generating diverse outputs and maintaining high-quality sequences.
For example, increasing the temperature parameter promotes greater randomness, fostering diversity but potentially reducing coherence.
Intervening in the Original State:
Another enhancement approach involves modifying the initial state to guide the search process toward specific goals.
Methods like Chain of Thought (CoT) (Wei et al., 2022), Logical Inference via Neurosymbolic Computation (LINC) (Olausson et al., 2023), and Program of Thoughts (PoT) (Chen et al., 2023a) exemplify this strategy.
These interventions address the challenge of overcoming biases in the default state distribution.
CoT enhances reasoning by introducing intermediate steps, improving interpretability and depth in generated sequences.
LINC uses logical scenarios to encourage more structured and goal-oriented outputs.
Similarly, PoT provides programmatic examples that guide models toward systematic problem-solving, effectively expanding the scope of exploration beyond the original state distribution.
4. Verify
Due to the long delay and high cost associated with human feedback, we cannot directly employ human efforts to evaluate each candidate response sampled by the model during training (Leike et al., 2018a).
Therefore, we employ verifiers as proxies for human supervision in the training of foundation models.
The verifiers play a crucial role in search-verify-feedback pipeline, and the quality and robustness of the verifiers directly impact the performance of the downstream policy (Wen et al., 2024c).
In the context of GC-MDP, verify is typically defined as using a verifier that provides verification results based on the current state and a predefined goal: where F represents the verification results provided by the verifier.
g denotes the predefined goal we aim to achieve (e.g., helpfulness, honesty).
The state s is typically composed of two concatenated components: the user's query or input and the model's output content {a1, . . . , at}.
In this section, we classify individual verifiers across several key dimensions and summarize the representative type of verifiers in Table 2.
Let me help format this text with each sentence on a new line:
4.1. A Comprehensive Taxonomy of Verifiers
Perspective of Verification Form
The verification result form of verifiers can be divided into four categories: binary feedback (Gao et al., 2023), score feedback (Bai et al., 2022a), ranking feedback (Jiang et al., 2023a), and text feedback (Saunders et al., 2022).
These categories represent an increasing gradient of information richness, providing more information to the optimization algorithm.
For instance, classic Bradley–Terry reward models (Bradley & Terry, 1952) can provide continuous score feedback which simply indicates correctness, while text-based feedback from generative reward models (Zhang et al., 2024d) and critique models (Sun et al., 2024b) offer more detailed information, potentially including rationals for scores or critiques.
Perspective of Verify Granularity
The verify granularity of verifiers can be divided into three levels: token-level (Chen et al., 2024b), thought-level (Lightman et al., 2023a), and trajectory-level (Cobbe et al., 2021).
These levels correspond to the scope at which the verifier engages with the model's generation.
Token-level verifiers focus on individual next-token predictions, thought-level verifiers assess entire thought steps or sentences, and trajectory-level verifiers evaluate the overall sequence of actions.
Although most RLHF practices (Ouyang et al., 2022b; Bai et al., 2022a) currently rely on full-trajectory scoring, coarse-grained ratings are challenging to obtain accurately (Wen et al., 2024b), as they involve aggregating finer-grained verification.
Generally, from the perspective of human annotators, assigning finer-grained scores is easier when the full trajectory is visible.
From a machine learning perspective, fine-grained verification is preferable (Lightman et al., 2023a), as it mitigates the risks of shortcut learning and bias associated with coarse-grained verification, thereby enhancing generalization.
A credit-assignment mechanism (Leike et al., 2018b) can bridge the gap between coarse-grained ratings and fine-grained verification.
Perspective of Verifier Source
Verifiers can be divided into program-based and model-based from the perspective of verifier source.
Program-based verifiers provide deterministic verification, typically generated by predefined rules or logic embedded in fixed programs.
These program-based verifiers offer consistent and interpretable evaluations but may lack flexibility when dealing with complex, dynamic environments.
On the other hand, model-based verifiers rely on probabilistic models to generate verification results.
These verifiers adapt to varying contexts and tasks through learning, allowing for more nuanced and context-sensitive evaluations.
However, model-based verifiers introduce an element of uncertainty and can require significant training and computational resources to ensure accuracy and robustness.
Perspective of Extra Training
Verifiers can also be divided into two categories based on whether they require additional specialized training.
Verifiers requiring additional training are typically fine-tuned on specific task-related data, allowing them to achieve higher accuracy in particular problem domains (Markov et al., 2023).
However, their performance can be heavily influenced by the distribution of the training data, potentially making them less generalizable to different contexts.
On the other hand, verifiers that do not require extra training are often based on pre-existing models (Zelikman et al., 2024; Zheng et al., 2024a).
While they may not reach the same level of accuracy as their task-specific counterparts, they are generally more robust to variations in data distribution, making them less dependent on specific training sets.
This trade-off between accuracy and data sensitivity is a key consideration when selecting a verifier for a given application.
4.2. Constructing Verifiers across Tasks
Different tasks have varying requirements for verifiers, serving as critical tools to enhance the performance of foundation models.
In this section, we will highlight key practices for constructing verifiers in representative fields, including safety and reasoning.
Safety
In safety-critical applications, verifiers play a crucial role in ensuring that foundation models adhere to ethical standards and avoid generating harmful or inappropriate content.
Program-based verifiers can enforce strict guidelines by filtering out outputs that include prohibited content, such as hate speech or sensitive personal information.
For instance, a content moderation system might employ predefined keywords and patterns to identify and block offensive language.
However, a limitation of program-based approaches is their vulnerability to adversarial attacks, as paraphrased content can often bypass these filters (Krishna et al., 2020).
In contrast, model-based verifiers, such as toxicity classifiers (Lees et al., 2022; Markov et al., 2023; Inan et al., 2023), offer probabilistic assessments of content safety, enabling more nuanced evaluations.
A middleground approach is rule-based reward models (RRMs) (Mu et al., 2024), which balance interpretability with generalization capabilities.
Integrating verifiers into both the training and deployment phases allows foundation models to align more closely with safety requirements, reducing the likelihood of unintended or harmful outputs.
Reasoning
In tasks requiring logical deduction or problem-solving, verifiers can assess the correctness and coherence of each reasoning step.
Token-level verifiers offer fine-grained evaluations of individual tokens or symbols, which is particularly useful in mathematical computations and code generation (Wang et al., 2023b).
Thought-level verifiers, on the other hand, evaluate entire sentences or reasoning steps to ensure that each component of the argument is both valid and logically consistent (Lightman et al., 2023a; Li et al., 2023b; Xie et al., 2024).
Trajectory-level verifiers assess the overall solution or proof, providing comprehensive verification results on the coherence of the model's reasoning (Cobbe et al., 2021; Yu et al., 2024; Wang et al., 2024a).
For instance, in mathematical theorem proving, a program-based verifier like Lean (de Moura et al., 2015) can check each proof step's validity against formal logic rules (Lin et al., 2024), while a model-based verifier can assess the plausibility of reasoning steps through scores and natural language explanations (Zhang et al., 2024c), offering critiques for further refinement (Kumar et al., 2024).
A simpler yet widely-used approach involves using manually annotated correct answers as verifiers to filter model outputs, progressively enhancing model performance (Zelikman et al., 2022b).
5. Feedback
After obtaining the verification result, we aim to enhance the capabilities of the foundation model, we define this process as the feedback stage.
In this paper, feedback specifically refers to enhancing the foundation model's capabilities based on the verification results.
The feedback stage is critical, as the effectiveness of the feedback method directly determines whether the foundation model's capabilities can be appropriately enhanced in response to the verification results.
In this section, we explore how verifier engineering utilizes search algorithms and verifiers to feedback verification results on foundation models.
To maximize the objective function J(π), the distribution of the policy π can be optimized by adjusting st or the parameters of π.
This leads to two distinct feedback approaches: training-based feedback and inference-based feedback.
Training-based feedback involves updating model parameters using data efficiently obtained through searching and verifying.
In contrast, inference-based feedback modifies the output distribution by incorporating search and verification results as auxiliary information during inference, without altering the model parameters.
5.1. Training-based Feedback
We categorize the common training strategies for training-based feedback into three types, based on the nature and organization of the data utilized:
Imitation Learning:
Imitation Learning typically uses high-quality data selected by verifiers, employing supervised fine-tuning objectives like cross-entropy or knowledge distillation training objectives like Kullback-Leibler divergence (Hinton, 2015) to optimize the model.
Various approaches are employed to enhance specific capabilities of the foundation model through imitation learning.
To enhance mathematical reasoning capabilities in LLMs, approaches like STaR (Zelikman et al., 2022b) and RFT (Yuan et al., 2023b) use rule-based verifiers to compare solution outcomes, while WizardMath (Luo et al., 2023a), MARIO (Liao et al., 2024), and MetaMath (Yu et al., 2023a) leverage verifiers to select responses from advanced LLMs or human inputs.
Other methods such as MAmmoTH (Yue et al., 2023) and MathCoder (Wang et al., 2023a) utilize programmatic tools to verify comprehensive and detailed solutions.
For coding capability improvement, various methods including Code Alpaca (Chaudhary, 2023), WizardCoder (Luo et al., 2023b), WaveCoder (Yu et al., 2023b), and OpenCodeInterpreter (Zheng et al., 2024b) construct code instruction-following datasets by distilling knowledge from advanced LLMs.
To enhance instruction-following abilities, approaches like LLaMA-GPT4 (Peng et al., 2023), Baize (Xu et al., 2023), and Ultrachat (Ding et al., 2023) employ verifiers to distill responses from advanced LLMs for supervised fine-tuning.
Other methods such as Deita (Liu et al., 2024) and MoDS (Du et al., 2023) implement a pipeline of verifiers to check complexity, quality, and diversity before selecting suitable data for SFT.
Preference Learning:
Preference Learning leverages verification results to construct pairwise comparison data and employs optimization methods like DPO (Rafailov et al., 2024), KTO (Ethayarajh et al., 2024), IPO (Azar et al., 2024), and PRO (Song et al., 2024).
Through this approach, models learn to align their outputs with verifier-provided preferences.
Various techniques are adopted to boost the foundation model's capabilities in specific areas through preference learning.
For mathematical reasoning enhancement, MCTS-DPO (Xie et al., 2024) combines Monte Carlo Tree Search (Coulom, 2006; Kocsis & Szepesvari, 2006) with preference learning to generate and learn from step-level pairwise comparisons in an iterative online manner.
For coding capability improvement, CodeUltraFeedback (Weyssow et al., 2024) constructs pairwise training data by using LLM verifiers to rank code outputs, then applies preference learning algorithms to optimize the model's performance.
For instruction-following enhancement, Self-Rewarding (Yuan et al., 2024) enables models to generate their own verification results for creating pairwise comparison data, followed by iterative self-improvement using the DPO method.
Reinforcement Learning:
Reinforcement Learning optimizes models using reward signals from verifiers.
Through environmental interaction and policy updates using algorithms like PPO (Schulman et al., 2017), PPO-max (Zheng et al., 2023), models iteratively improve their generation quality.
Multiple approaches are used to enhance the foundation model's capabilities in specific domains using reinforcement learning.
For mathematical reasoning enhancement, Math-Shepherd (Wang et al., 2023c) implements step-wise reward mechanisms to guide progressive improvement in mathematical problem-solving capabilities.
For coding capability improvement, methods like RLTF (Liu et al., 2023a) and PPOCoder (Shojaee et al., 2023a) leverage code execution results as reward signals to guide models toward more effective coding solutions.
For instruction-following enhancement, approaches like InstructGPT (Ouyang et al., 2022a) and Llama (Touvron et al., 2023a;b; Dubey et al., 2024) employ reward models trained to evaluate response helpfulness, optimizing models for better instruction adherence.
5.2. Inference-based Feedback
In inference-based feedback, we modify inputs or inference strategies to obtain better outputs without changing model parameters.
This approach is divided into two categories based on the visibility of verification results to the model: verifier-guided feedback and verifier-aware feedback.
Verifier-Guided:
In verifier-guided feedback, verifiers evaluate and select the most appropriate outputs from model-generated content without direct model interaction.
For example, Lightman et al. (2023a) and Snell et al. (2024) implement tree search algorithms guided by progress rewards, while ToT (Yao et al., 2024) employs language model verifiers to direct its tree search process.
In the realm of contrastive decoding (Li et al., 2022; O'Brien & Lewis, 2023), advanced language models serve as token logits verifiers to optimize output distributions.
Verifier-Aware:
Verifier-Aware feedback integrates verifier feedback directly into a model's operational context to enhance the content generation process.
This approach allows the model to actively consider and incorporate verifier feedback while producing its output.
Various strategies are employed to enhance specific capabilities of foundation models through verifier-aware feedback.
For mathematical and coding enhancement, CRITIC (Gou et al., 2023) utilizes feedback from calculators and code interpreters to refine solutions, while Self-debug (Chen et al., 2023b) improves code quality through execution result analysis.
For hallucination mitigation, approaches like ReAct (Yao et al., 2022), KGR (Guan et al., 2024), and CRITIC (Gou et al., 2023) integrate continuous feedback from search engines and knowledge graphs to ensure factual accuracy.
In a similar vein, Self-Refine (Madaan et al., 2024) employs language model verifiers to iteratively improve response quality.