使用消融技術解除任何LLM的審查

社群文章釋出於 2024年6月13日

Llama 模型第三代提供了經過微調（Instruct）的版本，這些版本在理解和遵循指令方面表現出色。然而，這些模型受到嚴格審查，旨在透過諸如“作為一名AI助手，我無法幫助您”之類的回應拒絕被視為有害的請求。儘管這種安全功能對於防止濫用至關重要，但它限制了模型的靈活性和響應能力。

在本文中，我們將探討一種名為“消融”（abliteration）的技術，它可以在不重新訓練的情況下解除任何大型語言模型的審查。這項技術有效地移除了模型的內建拒絕機制，使其能夠響應所有型別的提示。

程式碼可在 Google Colab 和 GitHub 上的 LLM 課程中獲取。

✂️ 什麼是消融技術？

現代大型語言模型經過安全和指令遵循的微調，這意味著它們被訓練來拒絕有害請求。在他們的部落格文章中，Arditi 等人表明，這種拒絕行為是由模型殘差流中的一個特定方向介導的。如果我們阻止模型表示這個方向，它就會**失去拒絕請求的能力**。反之，人為地新增這個方向甚至可以導致模型拒絕無害的請求。

在傳統的僅解碼器類 Llama 架構中，我們可以針對三個殘差流：每個塊的開始（“pre”）、注意力層和 MLP 層之間（“mid”），以及 MLP 之後（“post”）。下圖說明了每個殘差流的位置。

為了解除 LLM 的審查，我們首先需要識別模型中的“拒絕方向”。這個過程涉及一些技術步驟：

資料收集：在有害指令集和無害指令集上執行模型，記錄每種指令在最後一個詞元位置的殘差流啟用。
平均差：計算有害指令和無害指令啟用之間的平均差。這為我們提供了模型每個層的“拒絕方向”向量。
選擇：對這些向量進行歸一化並評估它們，以選擇單個最佳的“拒絕方向”。

一旦我們確定了拒絕方向，我們就可以“消融”它，有效地移除模型表示此特徵的能力。這可以透過**推理時干預**或透過**權重正交化**永久完成。

我們首先討論推理時干預。對於寫入殘差流的每個元件（例如注意力頭），我們計算其輸出在拒絕方向上的投影並減去此投影。此減法應用於每個詞元和每個層，確保模型從不表示拒絕方向。

另一方面，權重正交化涉及直接修改模型權重。透過對元件權重進行正交化以消除拒絕方向，可以完全阻止模型寫入此方向。這是透過調整寫入殘差流的矩陣來實現的，確保它們不會對拒絕方向產生貢獻。

在下一節中，我們將使用權重正交化實現消融。

💻 實現

以下消融的實現基於FailSpy 的筆記本，而該筆記本又基於原作者的筆記本。我主要對其進行了修改和簡化，使其更易於理解。本節程式碼量較大，以便您瞭解其工作原理，但如果您對技術細節不感興趣，可以使用 FailSpy 的消融器庫（還可以檢視他在 Hugging Face 上的消融模型集合）。

此程式碼依賴於出色的 TransformerLens 庫（以前稱為 EasyTransformer）來完成繁重的工作。它專為機械可解釋性而設計，並在此處用於干預啟用。感謝 Neel Nanda 和 Joseph Bloom 建立和維護此庫。

首先，我們安裝必要的軟體包並匯入它們。所有這些步驟都可以在這個 Google Colab 筆記本中找到。

!pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping

import torch
import functools
import einops
import gc

from datasets import load_dataset
from tqdm import tqdm
from torch import Tensor
from typing import List
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint
from transformers import AutoModelForCausalLM, AutoTokenizer
from jaxtyping import Float, Int
from collections import defaultdict

# Turn automatic differentiation off to save GPU memory (credit: Undi95)
torch.set_grad_enabled(False)

我們需要兩個資料集：一個包含無害指令，另一個包含有害指令。我們將使用 tatsu-lab/alpaca 以及來自 llm-attacks 的資料。為了方便起見，我將它們重新打包為兩個 Hugging Face 資料集：mlabonne/harmless_alpaca 和 mlabonne/harmful_behaviors。這樣，您可以輕鬆地用自己的資料集替換它們。

我們將載入指令並將其重新格式化為帶有“role”和“content”鍵的字典列表。這使其與 apply_chat_tokenizer() 方法相容，我們將使用該方法遵循 Llama 3 的聊天模板。

def reformat_texts(texts):
    return [[{"role": "user", "content": text}] for text in texts]

# Get harmful and harmless datasets
def get_harmful_instructions():
    dataset = load_dataset('mlabonne/harmful_behaviors')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

def get_harmless_instructions():
    dataset = load_dataset('mlabonne/harmless_alpaca')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

harmful_inst_train, harmful_inst_test = get_harmful_instructions()
harmless_inst_train, harmless_inst_test = get_harmless_instructions()

現在我們有了資料集，我們可以載入要消融的模型。不幸的是，您無法使用 HookedTransformer 直接載入自定義模型。在這裡，我使用 FailSpy 筆記本中描述的一個技巧來下載自定義模型並將其重新命名為 meta-llama/Meta-Llama-3-8B-Instruct。如果您的 GPU 不相容 BF16，請以 torch.float16 格式載入。

在這個例子中，我們將使用 mlabonne/Daredevil-8B，這是一個使用 DARE TIES 建立的超融合模型（請參閱我關於模型融合的文章），它在 8B 類別中在 Open LLM Leaderboard 上擁有最高的 MMLU 分數。

MODEL_ID = "mlabonne/Daredevil-8B"
MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"

# Download and load model
!git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}

# Load model and tokenizer
model = HookedTransformer.from_pretrained_no_processing(
    MODEL_TYPE,
    local_files_only=True,
    dtype=torch.bfloat16,
    default_padding_side='left'
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)
tokenizer.padding_side = 'left'
tokenizer.pad_token = tokenizer.eos_token

我們現在可以對資料集進行標記化。我們對無害指令和有害指令使用相同數量的樣本。請注意，大量的樣本可能會耗盡所有 RAM/VRAM，這就是為什麼我在這裡將其限制為 256。

def tokenize_instructions(tokenizer, instructions):
    return tokenizer.apply_chat_template(
        instructions,
        padding=True,
        truncation=False,
        return_tensors="pt",
        return_dict=True,
        add_generation_prompt=True,
    ).input_ids

n_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))

# Tokenize datasets
harmful_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmful_inst_train[:n_inst_train],
)
harmless_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmless_inst_train[:n_inst_train],
)

一切準備就緒，我們現在可以實現消融的第一步：資料收集。我們希望處理這些標記化的資料集並將殘差流啟用儲存在 harmful 和 harmless 中。這由 transformer_lens 庫管理。

# Define batch size based on available VRAM
batch_size = 32

# Initialize defaultdicts to store activations
harmful = defaultdict(list)
harmless = defaultdict(list)

# Process the training data in batches
num_batches = (n_inst_train + batch_size - 1) // batch_size
for i in tqdm(range(num_batches)):
    print(i)
    start_idx = i * batch_size
    end_idx = min(n_inst_train, start_idx + batch_size)

    # Run models on harmful and harmless prompts, cache activations
    harmful_logits, harmful_cache = model.run_with_cache(
        harmful_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device='cpu',
        reset_hooks_end=True
    )
    harmless_logits, harmless_cache = model.run_with_cache(
        harmless_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device='cpu',
        reset_hooks_end=True
    )

    # Collect and store the activations
    for key in harmful_cache:
        harmful[key].append(harmful_cache[key])
        harmless[key].append(harmless_cache[key])

    # Flush RAM and VRAM
    del harmful_logits, harmless_logits, harmful_cache, harmless_cache
    gc.collect()
    torch.cuda.empty_cache()

# Concatenate the cached activations
harmful = {k: torch.cat(v) for k, v in harmful.items()}
harmless = {k: torch.cat(v) for k, v in harmless.items()}

我們現在可以計算每個層的拒絕方向。這對應於有害指令和無害指令啟用之間的平均差，然後對其進行歸一化。我們將它們在 activation_scored 中按降序排序。

# Helper function to get activation index
def get_act_idx(cache_dict, act_name, layer):
    key = (act_name, layer)
    return cache_dict[utils.get_act_name(*key)]

# Compute difference of means between harmful and harmless activations at intermediate layers
activation_layers = ["resid_pre", "resid_mid", "resid_post"]
activation_refusals = defaultdict(list)

for layer_num in range(1, model.cfg.n_layers):
    pos = -1  # Position index

    for layer in activation_layers:
        harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)
        harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(
            dim=0
        )

        refusal_dir = harmful_mean_act - harmless_mean_act
        refusal_dir = refusal_dir / refusal_dir.norm()
        activation_refusals[layer].append(refusal_dir)

# Get all calculated potential refusal directions, sort them in descending order based on their mean
# Use a subset of layers if certain activations are not promising
selected_layers = ["resid_pre"]
activation_scored = sorted(
    [
        activation_refusals[layer][l - 1]
        for l in range(1, model.cfg.n_layers)
        for layer in selected_layers
    ],
    key=lambda x: abs(x.mean()),
    reverse=True,
)

該過程的最後一步是評估我們計算出的拒絕方向。為此，我們將在推理過程中將拒絕方向應用於每個殘差流和每個塊。在以下片段中，我們獲取了四個測試有害指令和 20 個塊（或層）的生成。

def _generate_with_hooks(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    tokens: Int[Tensor, "batch_size seq_len"],
    max_tokens_generated: int = 64,
    fwd_hooks=[],
) -> List[str]:
    all_tokens = torch.zeros(
        (tokens.shape[0], tokens.shape[1] + max_tokens_generated),
        dtype=torch.long,
        device=tokens.device,
    )
    all_tokens[:, : tokens.shape[1]] = tokens
    for i in range(max_tokens_generated):
        with model.hooks(fwd_hooks=fwd_hooks):
            logits = model(all_tokens[:, : -max_tokens_generated + i])
            next_tokens = logits[:, -1, :].argmax(
                dim=-1
            )  # greedy sampling (temperature=0)
            all_tokens[:, -max_tokens_generated + i] = next_tokens
    return tokenizer.batch_decode(
        all_tokens[:, tokens.shape[1] :], skip_special_tokens=True
    )

def get_generations(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    instructions: List[str],
    fwd_hooks=[],
    max_tokens_generated: int = 64,
    batch_size: int = 4,
) -> List[str]:
    generations = []
    for i in tqdm(range(0, len(instructions), batch_size)):
        tokens = tokenize_instructions(
            tokenizer, instructions=instructions[i : i + batch_size]
        )
        generation = _generate_with_hooks(
            model,
            tokenizer,
            tokens,
            max_tokens_generated=max_tokens_generated,
            fwd_hooks=fwd_hooks,
        )
        generations.extend(generation)
    return generations

# Inference-time intervention hook
def direction_ablation_hook(
    activation: Float[Tensor, "... d_act"],
    hook: HookPoint,
    direction: Float[Tensor, "d_act"],
):
    if activation.device != direction.device:
        direction = direction.to(activation.device)
    proj = (
        einops.einsum(
            activation, direction.view(-1, 1), "... d_act, d_act single -> ... single"
        )
        * direction
    )
    return activation - proj

# Testing baseline
N_INST_TEST = 4
baseline_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)
EVAL_N = 20  # Evaluate how many of the top N potential directions
evals = []
for refusal_dir in tqdm(activation_scored[:EVAL_N]):
    hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)
    fwd_hooks = [
        (utils.get_act_name(act_name, layer), hook_fn)
        for layer in list(range(model.cfg.n_layers))
        for act_name in activation_layers
    ]
    intervention_generations = get_generations(
        model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks
    )
    evals.append(intervention_generations)

我們已將所有生成儲存在 evals 列表中。現在我們可以列印它們並手動選擇為每個指令提供未審查響應的層（塊）。我自動排除了包含“I cannot”和“I can't”的響應，以過濾掉不需要的答案。

如果您找不到滿足這些要求的層，您可能需要測試前面 selected_layers 列表中的其他殘差流、其他指令、其他塊等。

# Print generations for human evaluation
blacklist = ["I cannot", "I can't"]
for i in range(N_INST_TEST):
    print(f"\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}")
    print(f"\nBASELINE COMPLETION:\n{baseline_generations[i]}\033[0m")
    for layer_candidate in range(EVAL_N):
        if not any(word in evals[layer_candidate][i] for word in blacklist):
            print(f"\n---\n\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:")
            print(evals[layer_candidate][i])

在我的例子中，候選層 9 成功地為所有四個指令提供了未審查的答案。這就是我們將選擇的拒絕方向。在下文中，我們實現了權重正交化以修改權重並防止模型建立具有此方向的輸出。您可以透過列印補全來驗證模型是否已成功解除審查。

def get_orthogonalized_matrix(
    matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]
) -> Float[Tensor, "... d_model"]:
    proj = (
        einops.einsum(
            matrix, vec.view(-1, 1), "... d_model, d_model single -> ... single"
        )
        * vec
    )
    return matrix - proj

# Select the layer with the highest potential refusal direction
LAYER_CANDIDATE = 9
refusal_dir = activation_scored[LAYER_CANDIDATE]

# Orthogonalize the model's weights
if refusal_dir.device != model.W_E.device:
    refusal_dir = refusal_dir.to(model.W_E.device)
model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)

for block in tqdm(model.blocks):
    if refusal_dir.device != block.attn.W_O.device:
        refusal_dir = refusal_dir.to(block.attn.W_O.device)
    block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)
    block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)

# Generate text with abliterated model
orthogonalized_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Print generations
for i in range(N_INST_TEST):
    if len(baseline_generations) > i:
        print(f"INSTRUCTION {i}: {harmful_inst_test[i]}")
        print(f"\033[92mBASELINE COMPLETION:\n{baseline_generations[i]}")
    print(f"\033[91mINTERVENTION COMPLETION:\n{evals[LAYER_CANDIDATE][i]}")
    print(f"\033[95mORTHOGONALIZED COMPLETION:\n{orthogonalized_generations[i]}\n")

我們現在已準備好使用該模型。我們將其轉換回 Hugging Face 格式並上傳到 HF hub。

# Convert model back to HF safetensors
hf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)
lm_model = hf_model.model

state_dict = model.state_dict()
lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())

for l in range(model.cfg.n_layers):
    lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(
        einops.rearrange(
            state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=model.cfg.n_heads
        ).contiguous()
    )
    lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(
        torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous()
    )

hf_model.push_to_hub(f"{MODEL_ID}-abliterated")
# hf_model.push_to_hub(f"{MODEL_ID}-abliterated")

⚖️ DPO 微調

我評估了上一節中消融模型和源模型在 Open LLM Leaderboard 和 Nous 基準套件上的表現。結果如下：

如您所見，源模型顯著優於 Llama 3 8B Instruct。然而，我們觀察到消融版本在所有基準測試中都出現了效能下降。消融過程成功地解除了審查，但也降低了模型的質量。

為了解決這個問題，一個想法是進一步訓練我們的消融模型以使其恢復。與大多數微調模型一樣，Llama 3 8B Instruct 在監督微調方面非常脆弱。額外的 SFT 可能會破壞模型的效能。

或者，偏好對齊相當輕量級，不應該對我們的消融模型進行“切除”。DPO 在這裡是一個不錯的選擇，因為它易於使用且記錄良好。為了實現它，我使用了 LazyAxolotl 和 mlabonne/orpo-dpo-mix-40k 資料集。以下是我使用的配置：

base_model: mlabonne/Daredevil-8B-abliterated
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false
save_safetensors: true

rl: dpo
chat_template: chatml
datasets:
  - path: mlabonne/orpo-dpo-mix-40k-flat
    split: train
    type: chatml.intel

dataset_prepared_path:
val_set_size: 0.0
output_dir: ./out

adapter: qlora
lora_model_dir:

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false

lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 5e-6
train_on_inputs: false
group_by_length: false

bf16: auto
fp16:
tf32:

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 0
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero2.json
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

我使用 6 塊 A6000 GPU 和 DeepSpeed ZeRO-2 訓練了它。訓練耗時約 6 小時 45 分鐘。以下是我從 W&B 獲得的訓練曲線：

它自動上傳了 DPO 微調模型，名為 mlabonne/NeuralDaredevil-8B-abliterated。為了檢視它是否修復了我們的消融版本，我在相同的基準上評估了它：

我們可以看到，這種額外的訓練使我們能夠恢復因消融而造成的大部分效能損失。模型沒有改進的一個領域是 GSM8K（一個數學資料集），這可能意味著 orpo-dpo-mix-40k 將受益於更多的數學樣本。

最終模型是一個未經審查的 LLM，在 8B 類別中具有最先進的效能。當您不需要審查時，我推薦它作為 Llama 3 8B Instruct 的改進版本。您可以在 LM Studio 中使用量化版本，例如 GGUF。

結論

在本文中，我們介紹了消融的概念。這項技術利用模型在無害和有害提示上的啟用來計算拒絕方向。然後，它使用此方向修改模型的權重，以確保我們停止輸出拒絕。這項技術還展示了安全微調的脆弱性並引發了倫理考量。

我們將消融應用於 Daredevil-8B 以解除其審查，這也降低了模型的效能。然後我們使用 DPO 修復了它，建立了 NeuralDaredevil-8B 模型，這是一個完全未經審查且高質量的 8B LLM。消融不僅限於移除對齊，還應被視為一種無需重新訓練的微調形式。實際上，它可以創造性地應用於其他目標，例如 FailSpy 的 MopeyMule，它採用了一種憂鬱的對話風格。

希望您喜歡這篇文章。如果您想檢視更多內容，請在 Hugging Face 和 Twitter @maximelabonne 上關注我。