self-correct
https://arxiv.org/abs/2211.00053
https://arxiv.org/abs/2406.01297
response stage
----------------------------------------
self-refine
https://arxiv.org/abs/2303.17651
feedback으로 다른 모델 사용
Personalized distillation: Empowering open-sourced LLMs with adaptive learning for code generation,”
https://arxiv.org/pdf/2310.18628
self contradict - 할루시네이션 줄이기위함
https://arxiv.org/abs/2305.15852
https://arxiv.org/abs/2401.06855 - fine grained hallucination detection
self-consistency - simple majority voting strategy + CoT
https://arxiv.org/abs/2203.11171
이는 response strategy와 같은 CoT뿐만이 아닌 decoding strategie (e.g., Self-Evaluation Decoding https://arxiv.org/abs/2405.16552 ), 그리고 latent state strategies (e.g., Inference-Time Intervention https://arxiv.org/abs/2306.03341 ) 까지 포함
하지만 이런 SC- CoT는 QA와 같은 task에서만 국한됨(fixed answer set)
그리고 exploratory capability의 부족
->
to address this limitation
reasoning 을 thought을 연결한 path로 보는 ToT
->
Graph-of-Thought (GoT) further extended this line of work by providing aggregation among different thought nodes, enhancing the utilization of reasoning chains.
++
Most X-of-Thought methods require sampling and aggregation of thoughts, often limited to queries with fixed label sets during aggregation
->
더 나아가 Multi-Perspective Self-Consistency (MPSC)
Universal Self-Consistency (Universal SC)
Soft Self-Consistency (Soft SC)
은 majority voting 의 scoring function을 더 발전 시킴
마지막으로
Quiet Self-Taught Reasoner (QuietSTaR) [82] addresses the issue mentioned in Section II-B, where “although complex reasoning in responses is beneficial for solving intricate problems, they may disrupt model’s latent reasoning due to redundant reasoning text, thereby increasing response-level inconsistency
input stage
-----------------------------
프롬프팅 optimizer
DIVERSE
Making language models better reasoners with step-aware verifier
https://aclanthology.org/2023.acl-long.291/
Promptbreeder
https://arxiv.org/abs/2309.16797
DSPy: Compiling declarative language model calls into state-of-the-art pipelines,
https://arxiv.org/abs/2310.03714
decoding stage
-----------------------------------------
Chain-of-thought reasoning without prompting
https://arxiv.org/abs/2402.10200
ToT Decoding [74] integrates ToT concepts into decoding, replacing beam search criteria with Self-Evaluation [42], where each token’s selection depends on confidence scores C(·), achieving better reasoning,
“Dola: Decoding by contrasting layers improves factuality in large language models,
https://arxiv.org/pdf/2309.03883
“Diver: Large language model decoding with span-level mutual information verification,
https://arxiv.org/abs/2406.02120
LLM의 reasoning 능력 부족과 할루시네이션은 본질적으로 같은 문제이다
Indeed, LLMs don’t comprehend reasoning or hallucinations; they only predict the next token based on probabilistic principles.
A Survey on Self-Evolution of Large Language Models - multi agent
https://arxiv.org/pdf/2404.14387
multi agent
------------------------
Improving factuality and reasoning in language models through multiagent debate,
https://arxiv.org/abs/2305.14325
Examining inter-consistency of large language models collaboration: An indepth analysis via debate (FORD)
https://arxiv.org/abs/2305.11595
The consensus game: Language model generation via equilibrium search
https://arxiv.org/abs/2310.09139
Scaling large-language-model-based multi-agent collaboration (macnet)
https://arxiv.org/abs/2406.07155
Knowledge-Enhanced Reasoning in Multi-Agent Debate System
https://arxiv.org/pdf/2312.04854v2
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
https://arxiv.org/pdf/2305.19118
REFINER: Reasoning feedback on intermediate representations
장점
1. ) Internal Consistency theoretical framework.
2. Self-Feedback theoretical framework.
Self-Evaluation, Consistency Signal Acquisition, and Self-Update.
3.better response to “Does Self-Feedback Really Work?”
Many surveys discuss this question but often provide biased
low consistency 원인
D. Sources of Low Internal Consistency Why do models exhibit low consistency?
Simply put, some prompt designs result in low consistency between different latent layers, with significant differences in the probability distribution of the next token prediction
hallucination 원인
hallucinations emerge when LLMs dealing with long context. They analyzed that this is caused by the soft attention mechanism, where attention weights become overly dispersed as sequence length increases, leading to poor consistency in reasoning paths - https://openreview.net/forum?id=VzmpXQAn6E
이를 위한 노력
Fine-grained hallucination detection and editing for language models
multi-step 에 관해
“Do large language models latently perform multi-hop reasoning?”
가장 최근 논문
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
https://arxiv.org/pdf/2407.18219v1
Very Large-Scale Multi-Agent Simulation in AgentScope
https://arxiv.org/pdf/2407.17789
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents