Semi-Supervised Reward Modeling via Iterative Self-Training

카테고리 없음

jinuklee 2024. 9. 21. 03:48

x = question , a1 = 첫번째 대답, a2 = 두번째 대답

y = a1, a2 중에 뭐가 좋은지 = pseudo-labeling

we only select those data where the model exhibits high confidence