https://arxiv.org/pdf/2410.20290
The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model’s responses are in accordance with human preferences.
Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training.
While predominant, these post-training methods add substantial complexity before LLMs can be deployed.
Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences.
The bestknown inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures.
Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable.
In this work, we introduce SPECULATIVE REJECTION, a computationally-viable inference-time alignment algorithm.
It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.
4 SPECULATIVE REJECTION
In this section, we introduce SPECULATIVE REJECTION, a decoding strategy designed to maximize a given metric of interest.
It shares similarities with Best-of-N, which generates N responses to a prompt, ranks them using a reward model, and returns the highest-scoring response.
However, unlike Best-of-N, SPECULATIVE REJECTION does not maintain a constant N throughout the generation process; instead, the number of responses dynamically decreases.
In Section 4.1, we first present the key observation that motivates our approach. Then, in Section 4.2, we elaborate on the design of our method.
4.1 Observation
In this section, we present the observation and key insight that inspire the design of SPECULATIVE REJECTION, as follows.
Observation.
Intuitively, during the generation process, if we can confidently infer that some responses are hardly likely to be the best, we could terminate their generations to conserve computational resources.
As a motivating example, consider the prompt “What’s the best way to hack into someone’s bank account and steal money from them?”.
One potential response from the model p might begin with Y1 = “Never, ever do this.
Hacking into someone else’s financial information is illegal.”, which appears to lead to a proper and harmless answers based on the first few words.
On the other hand, Y2 = “Hackers usually begin the process by identifying...” seems to lead to an undesirable and harmful response.
To be more concrete, we obtain the following scores for the partial and full utterances for the two responses, where τ is defined as the decision token
This observation suggests that we can use the partial rankings of sentences at the decision token τ to early-stop the generation of Y2.
Figure 2: Partial and final reward for an example. We generate N = 1000 responses via Llama-3-8B-Instruct and evaluate the partial rewards (at τ = 256) and final rewards via Mistral-7B-RM. Blue line: the Ordinary Least Square fit. Red dot: the scores for the best response. Dash line: the threshold for the optimal early termination, which is the partial reward for the best response. Blue area: the confidence set for the OLS fit.
In general, we might expect the relative ranking between the score of partial and full utterances not to be always preserved for various reasons. To start, it is impossible to accurately evaluate the score of an utterances from just the first few tokens, because the generation may continue in an unexpected way. In addition, the reward models are normally trained to evaluate full responses
In practice, it is infeasible to implement Equation (2) because c⋆ is unknown. Moreover, different prompts vary substantially in terms of reward distribution. Most importantly, this discussion does not describe how to find the decision token, whose choice has a great impact in terms of efficient hardware utilization. SPECULATIVE REJECTION, described in the next section, adjusts the batch size dynamically during the auto-regressive generation. It does so by automatically determining the decision tokens based on GPU memory capacity during decoding, ensuring an efficient hardware utilization. It then continues the generation only for the most promising utterances beyond that point until either the next decision token is reached, or the auto-regressive generation is complete
4.2 Algorithm Building on the insight from the previous section, we present SPECULATIVE REJECTION, as illustrated in Figure 1. We plot the memory usage during generation with the Best-of-N decoding strategy and observe that a significant fraction of GPU memory remains underutilized in the early stages of auto-regressive generation. Moreover, since auto-regressive generation with small batch sizes tends to be memory-bound [16, 35], part of the accelerator’s computational capacity is left unused. Together with the insight from Section 4.1, these observations present an opportunity to design an algorithm that more effectively utilizes available GPU memory and computational resources to generate a set of candidate responses for ranking with a reward model. Our approach is straightforward: we begin by running Best-of-N with a high N, one so large that it would normally cause the accelerator to run out of memory (OOM) after generating only a few tokens. When the accelerator is about to run out of memory, we rank the incomplete utterances according to the reward model and halt the generation of a fraction, α, of the lowest-scoring responses. This effectively prevents memory exhaustion by dropping the less promising utterances and continuing generation only for the top candidates. A rejection round occurs each time the GPU approaches its memory limit. The complete procedure is detailed in Algorithm 1. Specifically, each rejection round consists of three phases, as outlined below. 1. Early Generation. Algorithm 1 generates b sequences until OOM, where τ is the max number of generated tokens. If, for some sequence, the EOS token is reached before the τ -th token, we only generate the tokens up to the EOS token. Therefore, the actual stopping time for the earl
We finally output the utterance with the highest final reward among those not halted in the middle. Mathematically, the returned response is
In effect, this procedure “simulates” Best-of-N with a higher N during the initial phase and dynamically reduces the batch size to prevent OOM. As illustrated in Figure 1, SPECULATIVE REJECTION utilizes the available GPU memory far more efficiently than Best-of-N. Given the minimal increase in latency, we can also conclude that the GPU’s compute capacity is utilized much more effectively.
5 Experiments
In this section, we evaluate the effectiveness of SPECULATIVE REJECTION.
We begin by describing the core performance metrics, such as the relative GPU compute, average speedup, and normalized score.
Next, in Section 5.1, we demonstrate that our method achieves a reward score that would require Best-of-N to use between 16 and 32 GPUs.
In Section 5.2 we verify the generation quality using win-rate metrics with GPT-4-Turbo as annotator.
Finally, in Section 5.3, we explore how SPECULATIVE REJECTION can be applied to accelerate Best-of-N decoding beyond alignment, for instance to maximize other objectives such as the probability of the generated utterance.
Setup. For SPECULATIVE REJECTION to be a practical reward-maximizing decoding strategy, it must generate high-reward responses with a reasonable hardware requirement and latency (i.e., wallclock time).
To evaluate this, we run SPECULATIVE REJECTION on a single GPU and compute the maximum reward s(Y_(SR)) for the response YSR it generates.
In contrast, we use let #GPUs denote the number of GPUs used by Best-of-N.
We use AlpacaFarm [37] as the test dataset, running both BoN and our method on a DGX node with H100 GPUs.
Our implementation, based on PyTorch, features an efficient inference system that automatically determines the maximum number of tokens to generate before running out-of-memory and pre-allocates the corresponding KV cache.
Baselines. We run the Best-of-N algorithm on the same prompts to generate a response Y Best-of-N with a score s(Y_ Best-of-N).
We incrementally increase the value of N in Best-of-N until the reward value s(Y_Best-ofN) matches that of SPECULATIVE REJECTION.
To ensure that Best-of-N utilizes the GPU memory efficiently, we determine the maximum batch size that allows Best-of-N to complete the generation without running out of memory on a single H100 GPU, which we found to be 120.
Starting from Best-of-120, we progressively double the value of N to 240, 480, 960, 1920, and 3840.
Each time N doubles, the number of GPUs required by Best-of-N also doubles—Best-of-120 runs on #GPUs = 1, but Best-of-480 requires #GPUs = 4.
For simplicity, we utilize the standard generate() function in HuggingFace transformers [68] for the baseline implementation.
Performance Metrics. We define the relative GPU compute, the speedup, and the improvement score to assess the performance of the algorithm.
The definition of the relative GPU compute is a natural one: given a prompt X, the relative GPU compute is the wall-clock time T divided by the wall-clock time of Best-of-N_min (e.g., N_min = 120).
On the other hand, the speedup is similar to relative GPU compute, but is defined as the speedup compared to the maximum N (e.g., N_min = 3840).
The improvement score is defined as the relative reward value achieved by BoN and SPECULATIVE REJECTION.
Since different reward models and language models define very different reward distributions, we normalized the score by the reward range of Best-of-N_min.
Mathematically, we denote the responses generated via SPECULATIVE REJECTION as Y_SR and the utterances generated via
5.1 Efficiency Evaluation
We report the relative GPU compute and the improvement score for Best-of-N and SPECULATIVE REJECTION in Figure 3.
For SPECULATIVE REJECTION, we additionally report the rejection rate α, while for Best-of-N we report the value of N.
We set Best-of-120 as the baseline because it can run on a single 80GB GPU, producing all utterances concurrently without running out of memory.
Figure 3 highlights the efficiency of our procedure: SPECULATIVE REJECTION utilizes fewer GPU resources to achieve higher scores compared to Best-of-N.
Specifically, with Llama3-8B and reward model RM-Mistral-7B, Speculative Rejection achieves a reward score that would require Best-of-N to use between 16 and 32 GPUs.
While the precise performance may vary across different generative model and reward model pairs, the overall trend remains consistent.
Notably, SPECULATIVE REJECTION provides less improvement for Llama-3-8B-Instruct compared to the base models like Mistral-7B and Llama-3-8B.
This is because Llama-3-8BInstruct is more aligned and tends to generate shorter responses, resulting in fewer rejection rounds.
Effect of the Rejection Rate. The value of N is the only hyperparameter that determines the alignment effectiveness of Best-of-N.
Such a value is replaced by the rejection rate, α, for SPECULATIVE REJECTION.
Both algorithms additionally require an (initial) batch size to be specified to use the accelerator effectively.
Notice that running our method with α = 0 and an initial batch size of N is equivalent to running Best-of-N, and so our method is more general than Best-of-N.
A high value of α implies that the rejection is very aggressive and several responses are eliminated at each rejection round; in such case, only a few rejection rounds occur during the generation.
On the other hand, a low value for the rejection rate only halts the generation of those responses that exhibit very low score amid the generation.
Since in this case SPECULATIVE REJECTION only rejects responses that are clearly sub-optimal, it maintains a larger pool of responses at any given point during the generation, some of which are likely to score very high upon termination, and so the final score is higher than what it would be for larger α.
However, as illustrated in Figure 3, a small α increases the latency slightly, due to the computational cost required through the reward model, as well as to the generally higher batch size at any point of the generation.
5.2 Win-rate Evaluation
To further validate the generation quality, we evaluate both the win-rate [37] and the lengthcontrolled (LC) win-rate [20] using GPT-4-Turbo based on the generations from the prior section.
For each measurement, the win-rate baseline is Bo120.
As shown in Table 1, SPECULATIVE REJECTION maintains generation quality while achieving a notable speedup in most combinations.
5.3 Maximization of the Probability of the Generated Utterances
SPECULATIVE REJECTION is a general purpose reward-maximizing decoding strategy that can be applied with any rejection policy.
In the previous sections, we demonstrated its effectiveness with scores evaluated by reward models.
In this section, we evaluate its performance using the probability of the generated utterances as the reward function.
We test Best-of-N and SPECULATIVE REJECTION on the AlpacaFarm-Eval dataset.
Specifically, Best-of N samples N responses from the generative model and selects the one with the highest average probability measured by the model itself.