Fast Best-of-N Decoding via Speculative Rejection
https://arxiv.org/pdf/2410.20290The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model’s responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predomina..