https://openreview.net/pdf?id=CYmF38ysDa
Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction.
However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition.
In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation.
Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations
3 FLASK: FINE-GRAINED LANGUAGE MODEL EVALUATION PROTOCOL
We introduce FLASK, a fine-grained skill set-based evaluation protocol for assessing the alignment of language models.
We define 4 primary abilities, divided into 12 skills, that are necessary to follow user instructions in a desirable manner (Section 3.1).
We specify the process of the evaluation dataset construction (Section 3.2) and the evaluation process (Section 3.3).
Additionally, for a challenging scenario, we introduce FLASK-HARD (Section 3.4).
The illustration of the overall process is shown in Figure 21 in the Appendix.
We emphasize that applying instance-wise multi-metric evaluation is what mainly distinguishes our work from previous evaluation settings, enabling task-agnostic evaluation.
In this work, we consider two types of evaluators: human evaluators and EVAL LM, one of the state-of-the-art LLMs used for evaluation.
3.1 SKILL SET CATEGORIZATION
Building on previous research in language model evaluation, (Sugawara & Aizawa, 2016; Sugawara et al., 2017; Radziwill & Benton, 2017; Schlegel et al., 2020; Rogers et al., 2021), we aim to develop a comprehensive taxonomy for assessing the performance of LLMs.
This taxonomy is designed as a systematic framework to categorize the essential skills for understanding and responding to a wide range of single-turn English instructions.
Based on the skill categorization of Rogers et al. (2021) which was specifically proposed for question answering and reading comprehension, we recategorize skills suitable for LLM alignment.
Our proposed categorization includes four primary abilities, each of which is further divided into 2-4 skills, resulting in a total of 12 skills:
---------------------------------
• Logical Thinking refers to the ability to apply reasoning, critical thinking, and deductive skills when processing and responding to instructions.
In order to do so, models should generate a logically correct final answer (LOGICAL CORRECTNESS) while preserving generalizability during the step-by-step logical process without any contradiction (LOGICAL ROBUSTNESS).
Also, the logical process should be efficient and not contain any unnecessary steps (LOGICAL EFFICIENCY).
---------------------------------
• Background Knowledge comprises the capacity to generate responses by accessing a broad repository of general and domain-specific information.
This ability requires the model to provide accurate and contextually relevant responses to instructions requiring factual (FACTUALITY) or commonsense knowledge (COMMONSENSE UNDERSTANDING).
---------------------------------
• Problem Handling pertains to the proficiency in addressing challenges that emerge while processing and responding to user instructions.
This category encompasses the capacity to understand the implicit and explicit purpose and requirements of the instruction (COMPREHENSION), develop creative perspectives or interpretations of the instruction (INSIGHTFULNESS), handle the instruction by providing in-depth and in-breadth information (COMPLETENESS), and be aware of its own capability to answer the instruction (METACOGNITION).
• User Alignment represents the ability to empathize with the user and align its responses to the user’s intentions, preferences, and expectations.
This category encompasses the model’s ability to structure the answer to promote the users’ readability (READABILITY), presenting a concise response for the reader without unnecessary information (CONCISENESS), and considering potential risks to user safety (HARMLESSNESS).
We ensure that each skill offers a wide range of criteria for a holistic evaluation of various LLMs.
We provide the specific definition for each skill in Table 11 in the Appendix.
3.2 EVALUATION DATA CONSTRUCTION
The process of constructing the evaluation data involves several steps,
1) collecting input-output pairs from various datasets,
2) modifying the collected instances, and
3) filtering based on length criteria, resulting in a total of 1,740 instances sourced from 122 datasets.
We first collect input (instruction) and output (reference answer) pairs from various English NLP datasets, both multitask datasets (e.g. MMLU (Hendrycks et al., 2020)) and single-task datasets (e.g. GSM8K (Cobbe et al., 2021)).
For single-task datasets, we restrict them to account for at most 20 instances per dataset for diversity.
After collection, we modify the instances by manually writing instructions for datasets that do not include instructions.
Lastly, we remove instances where the input length exceeds 2048. More details including the list of source datasets are provided in Appendix J.
For each evaluation instance, we annotate the metadata which consists of
1) the essential skills to follow the instruction,
2) target domains,
3) the difficulty level of the instructions.
We first validate that human labelers and EVAL LM have a high correlation for the metadata annotation on a subset of 200 instances.
We have observed a 95.22% acceptance rate for skill annotation, an 81.32% acceptance rate for domain annotation, and a Pearson correlation coefficient of 0.774 for difficulty annotation.
Since the model-based annotation has acceptable noise and high correlation to human labelers, we utilize the EVAL LM for metadata annotation to reduce the burden of human annotations.
We provide more details on validating the annotation of EVAL LM in Appendix G.2.
For the selection of necessary skills, the EVAL LM selects the top-3 essential skills required to follow the instructions for each instance, from the 12 skills defined in Section 3.1.
We achieve this by providing the EVAL LM with the instruction, reference answer, and descriptions of all 12 skills.
For domain annotation, we identify 10 domains: Humanities, Language, Culture, Health, History, Natural Science, Math, Social Science, Technology, and Coding by modifying the Wikipedia categorization of Reid et al. (2022).
Lastly, for difficulty level annotation, we divide the difficulty level into 5 levels based on the extent of required domain knowledge by referencing Webb’s depth of knowledge (Webb, 1997; 1999) and NIH proficiency scale3 : simple lifestyle knowledge, advanced lifestyle knowledge, formal education knowledge, major-level knowledge, and expert-level knowledge where we map each level into a level from 1 to 5.
Details of the metadata annotation process are provided in Appendix E and the statistics of the evaluation dataset are provided in Appendix F.
3.3 EVALUATION PROCESS
Utilizing the annotated metadata for each instance, we evaluate and analyze the target model response in a fine-grained manner.
Evaluators, either human annotators or EVAL LM, are given the evaluation instruction, reference answer, response of the target model, and pre-defined score rubric for each selected skill from Section 3.2.
The evaluators assess the target model’s response by assigning scores ranging from 1 to 5, following skill-specific scoring rubrics, which include detailed descriptions for each score.
For model-based evaluation, we enforce the EVAL LM to generate a rationale before assigning a score, inspired by the effectiveness of CoT prompting (Wei et al., 2022b) for the evaluation of LLMs (Liu et al., 2023).
Once the evaluators have scored each skill of the instance, we aggregate the scores based on the skill, domain, and difficulty level for fine-grained analysis.
This analysis allows for an in-depth understanding of how the target model performs across various metadata compositions.
The illustration of the evaluation process and the score rubric for each skill is provided in Figure 1 and Appendix K.1.
3.4 FLASK-HARD
To assess state-of-the-art LLMs in challenging scenarios, we additionally introduce FLASK-HARD subset.
This subset comprises 89 instances that are annotated as expert-level knowledge difficulty (Level 5), including tasks such as predicting chess checkmates and solving advanced mathematics problems.
Due to the intricate nature of FLASK-HARD tasks which may prevent reliable evaluation, we explore a more fine-grained evaluation setting for FLASK-HARD.
Instead of using a fixed score rubric for each skill, we introduce an instance-specific score rubric for each skill.
Specifically, EVAL LM first generates at most 5 subquestions (checklists) that correspond to one of the related skills annotated in Section 3.2 for each instance.
Then, we manually remove duplicates or subquestions unrelated to the annotated skillset.
After we annotate subquestions for each instance, evaluators give a score ranging from 1 to 5 based on the judgment of whether the model response fulfilled the specific criteria of the subquestions.
We specify the illustration in Figure 1 and the prompt in Figure 35 (Appendix) for the instance-specific score rubric, respectively.