Beyond Human Data: Scaling Self-Training forProblem-Solving with Language Models 논문리뷰
https://arxiv.org/pdf/2312.06585Generate (E-step): 모델을 사용해 샘플링후 필터링(binary feedback) Improve (M-step): 이 샘플을 통해 finetuning이 과정을 몇번 반복 We make some modifications to ReST (detailed in Section 3), and call our approach ReST𝐸𝑀. We show that ReST𝐸𝑀 can be viewed as applying expectation-maximization for reinforcement learning models fine-tuned on model-generated synthetic data exhibit remarkably ..