카테고리 없음 Regularized Best-of-N Sampling to Mitigate Reward Hacking forLanguage Model Alignment 논문리뷰 jinuklee 2024. 8. 29. 23:12 https://openreview.net/pdf?id=ewRlZPAReR