카테고리 없음

Regularized Best-of-N Sampling to Mitigate Reward Hacking forLanguage Model Alignment 논문리뷰

jinuklee 2024. 8. 29. 23:12

https://openreview.net/pdf?id=ewRlZPAReR