분류 전체보기 252

Self-Rewarding Language Models 논문리뷰

https://arxiv.org/pdf/2401.10020요약 : self-alignment, AI 피드백, DPO pair dataset 생성 iterativelyRM의 preference data의 학습bottlenecked by human performance level(2024 초기논문 관계상)또한 이 RM은 frozen -> policy(LLM)의 학습중 cannot improveDPO가 RM이 필요하지는 않지만  pair dataset을 준비하기 위해 LLM-as-a-Judge prompting을 통해 스스로 reward 즉 pair 데이터셋을 생성 즉 프롬프트에 답변을 llm이 생성하고 이 llm을 사용해 evaluate해 reward를 매긴후 select한 데이터셋을 DPO로 사용하는 구조..

카테고리 없음 2024.08.18

Tree Search for Language Model Agents

https://arxiv.org/abs/2407.01476 Tree Search for Language Model AgentsAutonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generaarxiv.orgbest first tree search 추론 알고리즘웹 자동화 프로세스와 같은 decision-making ta..

카테고리 없음 2024.08.18

Are More LM Calls All You Need?Towards the Scaling Properties of Compound AI Systems 논문리뷰

https://arxiv.org/abs/2403.02419 Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference SystemsMany recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when askarxiv.orgMany recent state..

카테고리 없음 2024.08.18

Let’s Verify Step by Step 논문리뷰

https://arxiv.org/pdf/2305.200500. Abstractoutcome supervision final result에 피드백ㄷ을 제공 process supervisioneach intermediate reasoning step에 피드백을 제공 process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset (PRM이 ORM보다 좋은것의 맥락) 2. Methods   2.1 scope각 모델 scale에서, 모든 솔루션을 생성하기 위해 하나의 고정된 모델을 사용이 모델을 생성기(generator)라고 함강화..

카테고리 없음 2024.08.18