https://arxiv.org/pdf/2411.08147
Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning.
Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements.
To address this issue, we investigate the potential for LLMs to selfimprove in long-context reasoning and propose SEALONG, an approach specifically designed for this purpose.
This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs.
Extensive experiments on several leading LLMs demonstrate the effectiveness of SEALONG, with an absolute improvement of 4.2 points for Llama-3.1-8B-Instruct.
Furthermore, SEALONG achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models.
We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.
Section 2, "Understanding the Potential of LLMs in Long-context Reasoning," outlines several key points:
The study explores LLM capabilities through three reasoning tasks from LongBench: HotpotQA, MuSiQue, and 2WikiMQA.
These tasks require handling multiple documents and answering multi-hop questions across paragraphs.
Section 2.1 discusses prompting strategies:
- Default: Basic context and question
- Direct Answer: Explicit instruction to answer
- Think Step-by-step: Guided reasoning process
- Fact-and-reflection: Two-stage approach of information gathering and reasoning
- Plan-and-solve: Strategic planning followed by step-by-step execution
The research shows that prompting strategies significantly impact performance in long-context reasoning.
Section 2.2 examines LLMs' potential through temperature sampling:
- Multiple outputs were generated per question
- Using 8 outputs showed improved performance over greedy search
- Scaling to 128 samples achieved over 90% correct answers
Section 3 introduces SEALONG:
- A self-improving method for long-context reasoning
- Uses two stages: self-supervision and model fine-tuning
- Based on the principle that correct reasoning shows higher semantic consistency
- Employs Minimum Bayes Risk (MBR) to evaluate output quality
- Prioritizes outputs showing higher consistency with others
The approach demonstrates that proper prompting and sampling strategies can significantly enhance LLMs' long-context reasoning capabilities.