카테고리 없음

Inferaligner 논문리뷰: Inference-time alignment for harmlessness through cross-model guidance, 2024

jinuklee 2024. 10. 29. 21:16

https://arxiv.org/abs/2401.11206

 

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

With the rapid development of large language models (LLMs), they are not only used as general-purpose AI assistants but are also customized through further fine-tuning to meet the requirements of different applications. A pivotal factor in the success of c

arxiv.org

Abstract

 

With the rapid development of large language models (LLMs), they are not only used as general-purpose AI assistants but are also customized through further fine-tuning to meet the requirements of different applications.

 

A pivotal factor in the success of current LLMs is the alignment process.

 

Current alignment methods, such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), focus on training-time alignment and are often complex and cumbersome to implement.

 

Therefore, we develop InferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.

 

InferAligner utilizes safety steering vectors extracted from safety-aligned model to modify the activations of the target model when responding to harmful inputs, thereby guiding the target model to provide harmless responses.

 

Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics, as well as to multimodal large language models (MLLMs) such as LLaVA.

 

It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.

3 Method

 

3.1 Safety Related Vector

 

The key idea behind InferAligner is to extract safety related vectors (SRVs) which effectively sense the input intent and can shift output distribution towards harmlessness during inference.

 

These SRVs are created using two types of instructions: one demonstrating harmful intent and another demonstrating harmless intent.

 

We use these instructions with the conversation template to form harmful and harmless prompts.

 

Then, SRVs are obtained by calculating the mean activation difference of the last token between harmful and harmless prompts.

 

Where al() represents the activations of the last token at layer l for the given prompt P.

 

This approach of extracting the safety related vector is called Mean Difference (MD). Specifically, we utilize the SRVs extracted from models aligned for harmlessness as safety steering vectors (SSVs).

 

3.2 InferAligner

 

Inspired by the research of Lin et al. (2023), we speculate that even models not specifically aligned for harmlessness may inherently possess the capability to perceive harmful intents and refuse to respond to harmful queries. However, they may not be able to effectively utilize these abilities.

 

Considering that models aligned for harmlessness have already mastered how to respond to harmful questions, it is possible to extract SSVs from aligned models and effectively use these vectors to guide inference-time alignment for harmlessness.

 

In the following detailed description, we use the term target model to refer to poorly aligned or unaligned models that need to be aligned for harmlessness.

 

Compared to the simple activation shifts used in ITI or RepE, InferAligner involves a more complex process. This method selectively targets only those inputs with harmful intent.

 

So, firstly, we utilize SRVs extracted from the target model to discern the intent of inputs and apply a guidance gate to precisely control the activation shift.

 

The calculation for the guidance gate gl at layer l is as follows:

Here, P is the input prompt, sl is the SRV of the l-th layer of the target model, and bl is the bias, used to determine the boundary between harmful and harmless intents.

 

This step is very important. We only need to intervene on inputs with harmful intents and not on harmless intents, ensuring that the model's capability in other tasks is not affected.

 

Since the guidance gate only provides simple binary signal, for ease of operation in practical use, we can choose to select the most accurate guidance gate gl0 as the guidance gate for any layer.

 

Then we shift the activations across all token positions using SSVs extracted from aligned models and the guidance gate. Suppose that the set of transformer layers need to be shifted is L(G).

 

For each layer l ∈ L(G), the activations are shifted as follows:

Here, x′l and xl respectively represent the original and shifted activations of the lth layer of the target model, α is the intervention strength, and θl is the SSV of the l-th layer of the aligned model.

InferAligner comprises three kind of parameters: bl ∈ R, which determines the boundary between harmful and harmless intents; α ∈ R+, representing the strength of the intervention; and LG ⊆ L, indicating the transformer layers requiring activation shifting. 

To estimate bl, we calculate it as the mean of all training samples' negative projections on sl. This is a simple but effective approach. 

In fact, bl can be flexibly adjusted: if we desire the target model to be extremely harmless, then bl can be set higher. 

Regarding LG, we heuristically choose layers that accurately judge harmful intentions in both the target model and the aligned model. 

As for α, although we lack a theoretical argument for the best values, we explore their effects experimentally and determine optimal values through a standard hyperparameter sweep.

4 Experimental Setup

4.1 Datasets

Datasets for Safety Related Vectors. We use the Harmful Behaviors from AdvBench (Zou et al., 2023b) as the Harmful Instruction Dataset. 

It consists of 520 harmful instructions covering a wide spectrum of detrimental content such as profanity, graphic depictions, etc. 

We collect harmless instructions from the generation subset of TruthfulQA (Lin et al., 2021), which has 817 questions spanning 38 subcategories. 

Specifically, we randomly sample 520 instructions to serve as the harmless Instruction Dataset. 

From these, we randomly select 64 harmful instructions and 64 benign instructions to extract SRVs and SSVs as mentioned in Section 3.1. 

The remaining data is then used as the harmfulness test set.

Datasets for Domain-Specific Fine-tuning. To evaluate the effectiveness of InferAligner, we fine-tune base models on domain-specific data in three different domains: finance, medicine, and mathematics. 

(a) Finance data: We use the instruction tuning datasets collected by (Yang et al., 2023) as the training data. 

It includes a variety of instructions, such as financial relation extraction, financial Q&A, etc. 

We also add 10,000 conversations gathered from UltraChat (Ding et al., 2023) to ensure the model's conversational abilities. 

(b) Medicine data: We use the MEDQA dataset (Jin et al., 2021) as the training data for the medicine domain. 

Each entry in this dataset provides a detailed patient profile and associated medical questions, which aligns more with how medical models are used in practice. 

Similarly, we add an equivalent amount of conversations.

(c) Mathematics data: We use the training set of the GSM8K (Cobbe et al., 2021) as the training data for the mathematics domain. 

The core of mathematical ability is reasoning, so during training, we focus not just on producing the correct answer but also on teaching the model the reasoning process. 

Similarly, we also added an equivalent amount of conversations from UltraChat.

Datasets for Security Evaluation. 

(a) Harmfulness test set: This test set is designed to measure the model's harmlessness when directly confronted with harmful questions. 

As mentioned earlier, we use the remaining data from the Harmful Instruction Dataset as the test set. 

(b) Jailbreak test set: This test set further assesses the model's safety when faced with carefully crafted deceptive jailbreak prompts. 

We collect 10 highly representative jailbreak prompts, including role playing, privilege escalation, attention shifting, automatic generation, gradient optimized, adversarial suffix, etc., and sample 50 harmful instructions from the test set, forming a jailbreak dataset with 500 jailbreak instructions. 

(c) Multimodal Harmfulness test set: As far as we know, there is currently no dataset for assessing the harmlessness of multimodal models. 

Therefore, we construct a multimodal dataset, MMHarmful Bench, which consists of 100 harmful instructions that require the combination of both input images and text for response. 

MMHarmful Bench encompasses ten different types of malicious intentions, including discrimination, sabotage, theft, defamation, illegal Weapons, fraud, self harm, psychological manipulation, misinformation, and cybercrime. 

We create MMHarmful Bench to enable a more comprehensive evaluation of our approach's adaptability and effectiveness.

Datasets for Utility Evaluation. These datasets are used to evaluate whether the alignment method would result in a decline in downstream tasks. 

(a) For finance, we evaluate on the three publicly available tasks: FPB (Malo et al., 2014), FiQA SA (Maia et al., 2018) and Headline (Yang et al., 2023). 

(b) For medicine, we evaluate on the test set of MEDQA. 

(c) For mathematics, we evaluate on the test set of GSM8K.