카테고리 없음

Internal Consistency and Self-Feedback inLarge Language Models: A Survey 논문리뷰

jinuklee 2024. 7. 27. 01:20

consistency 관련 gpt 4o의 일관적이지 못한 답변

self-correct

https://arxiv.org/abs/2211.00053

https://arxiv.org/abs/2406.01297

 

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference. Prior work has proposed various self-correction frameworks using different sources of feedback, including self-ev

arxiv.org

 

response stage

----------------------------------------

self-refine

https://arxiv.org/abs/2303.17651

feedback으로 다른 모델 사용

Personalized distillation: Empowering open-sourced LLMs with adaptive learning for code generation,”

https://arxiv.org/pdf/2310.18628

self contradict - 할루시네이션 줄이기위함

https://arxiv.org/abs/2305.15852 

https://arxiv.org/abs/2401.06855 - fine grained hallucination detection

 

self-consistency - simple majority voting strategy + CoT

https://arxiv.org/abs/2203.11171

 

 

이는 response strategy와 같은 CoT뿐만이 아닌 decoding strategie (e.g., Self-Evaluation Decoding https://arxiv.org/abs/2405.16552 ), 그리고 latent state strategies (e.g., Inference-Time Intervention https://arxiv.org/abs/2306.03341 ) 까지 포함

 

하지만 이런 SC- CoT는 QA와 같은 task에서만 국한됨(fixed answer set)

그리고 exploratory capability의 부족

 

->

to address this limitation

 reasoning 을 thought을 연결한 path로 보는 ToT

 

->

Graph-of-Thought (GoT) further extended this line of work by providing aggregation among different thought nodes, enhancing the utilization of reasoning chains.

 

++

Most X-of-Thought methods require sampling and aggregation of thoughts, often limited to queries with fixed label sets during aggregation

 

->

더 나아가 Multi-Perspective Self-Consistency (MPSC)

Universal Self-Consistency (Universal SC)

Soft Self-Consistency (Soft SC)

은 majority voting 의 scoring function을 더 발전 시킴

 

 

마지막으로

Quiet Self-Taught Reasoner (QuietSTaR) [82] addresses the issue mentioned in Section II-B, where “although complex reasoning in responses is beneficial for solving intricate problems, they may disrupt model’s latent reasoning due to redundant reasoning text, thereby increasing response-level inconsistency

 

input stage

-----------------------------

프롬프팅 optimizer

DIVERSE

Making language models better reasoners with step-aware verifier

https://aclanthology.org/2023.acl-long.291/

Promptbreeder

https://arxiv.org/abs/2309.16797
DSPy: Compiling declarative language model calls into state-of-the-art pipelines,

https://arxiv.org/abs/2310.03714

 

 

decoding stage 

-----------------------------------------

Chain-of-thought reasoning without prompting

https://arxiv.org/abs/2402.10200

ToT Decoding [74] integrates ToT concepts into decoding, replacing beam search criteria with Self-Evaluation [42], where each token’s selection depends on confidence scores C(·), achieving better reasoning,

 

“Dola: Decoding by contrasting layers improves factuality in large language models,

https://arxiv.org/pdf/2309.03883

“Diver: Large language model decoding with span-level mutual information verification,

https://arxiv.org/abs/2406.02120

 

 

LLM의 reasoning 능력 부족과 할루시네이션은 본질적으로 같은 문제이다

Indeed, LLMs don’t comprehend reasoning or hallucinations; they only predict the next token based on probabilistic principles.

 

A Survey on Self-Evolution of Large Language Models - multi agent

https://arxiv.org/pdf/2404.14387

 

 

multi agent

------------------------

Improving factuality and reasoning in language models through multiagent debate,

https://arxiv.org/abs/2305.14325

Examining inter-consistency of large language models collaboration: An indepth analysis via debate (FORD)

https://arxiv.org/abs/2305.11595

The consensus game: Language model generation via equilibrium search

https://arxiv.org/abs/2310.09139

Scaling large-language-model-based multi-agent collaboration (macnet)

https://arxiv.org/abs/2406.07155

Knowledge-Enhanced Reasoning in Multi-Agent Debate System

https://arxiv.org/pdf/2312.04854v2

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

https://arxiv.org/pdf/2305.19118

REFINER: Reasoning feedback on intermediate representations

 

 

장점

1. ) Internal Consistency theoretical framework.

 

2. Self-Feedback theoretical framework.

Self-Evaluation, Consistency Signal Acquisition, and Self-Update.

 

3.better response to “Does Self-Feedback Really Work?”

Many surveys discuss this question but often provide biased

 

low consistency 원인

D. Sources of Low Internal Consistency Why do models exhibit low consistency?

Simply put, some prompt designs result in low consistency between different latent layers, with significant differences in the probability distribution of the next token prediction

 

hallucination 원인

hallucinations emerge when LLMs dealing with long context. They analyzed that this is caused by the soft attention mechanism, where attention weights become overly dispersed as sequence length increases, leading to poor consistency in reasoning paths - https://openreview.net/forum?id=VzmpXQAn6E

이를 위한 노력

Fine-grained hallucination detection and editing for language models

 

 

multi-step 에 관해

“Do large language models latently perform multi-hop reasoning?” 

 

 

 

 

 

 

가장 최근 논문

 

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

https://arxiv.org/pdf/2407.18219v1

Very Large-Scale Multi-Agent Simulation in AgentScope

https://arxiv.org/pdf/2407.17789

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

https://arxiv.org/pdf/2407.17490