開源 AI 食譜文件

使用 Cleanlab 透過主動學習標註文字資料

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 Cleanlab 透過主動學習標註文字資料

作者: Aravind Putrevu

在本 notebook 中，我將重點介紹如何使用主動學習來改進一個經過微調的 Hugging Face Transformer 文字分類模型，同時將從人工標註員那裡收集的標籤總數保持在較低水平。當資源有限，無法為全部資料獲取標籤時，主動學習旨在透過選擇資料標註員應投入精力標註的樣本來節省時間和金錢。

什麼是主動學習？

主動學習幫助確定哪些資料應優先標註，以便最大化在標註資料上訓練的監督機器學習模型的效能。這個過程通常是迭代的——在每一輪中，主動學習會告訴我們應該為哪些樣本收集額外的標註，以便在有限的標註預算下最大限度地改進我們當前的模型。ActiveLab 是一種主動學習演算法，當來自人工標註員的標籤存在噪聲時，以及當我們應該為一個之前已標註的樣本（其標籤看起來可疑）而不是一個尚未標註的樣本收集更多標註時，它特別有用。在為一批資料收集了這些新的標註以增加我們的訓練資料集後，我們會重新訓練模型並評估其測試準確率。

ActiveLab thumb.webp

在本 notebook 中，我將考慮一個二元文字分類任務：預測特定短語是有禮貌還是不禮貌。

在為 Transformer 模型收集額外標註時，使用 ActiveLab 的主動學習比隨機選擇要好得多。無論總標註預算如何，它始終能產生更好的模型，錯誤率大約降低 50%。

本 notebook 的其餘部分將詳細介紹您可以用來實現這些結果的開原始碼。

設定環境

!pip install datasets==2.20.0 transformers==4.25.1 scikit-learn==1.1.2 matplotlib==3.5.3 cleanlab

import pandas as pd

pd.set_option("max_colwidth", None)
import numpy as np
import random
import transformers
import datasets
import matplotlib.pyplot as plt

from cleanlab.multiannotator import (
    get_majority_vote_label,
    get_active_learning_scores,
    get_label_quality_multiannotator,
)
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, Dataset, DatasetDict, ClassLabel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from scipy.special import softmax
from datetime import datetime

收集和組織資料

這裡我們下載本 notebook 所需的資料。

labeled_data_file = {"labeled": "X_labeled_full.csv"}
unlabeled_data_file = {"unlabeled": "X_labeled_full.csv"}
test_data_file = {"test": "test.csv"}

X_labeled_full = load_dataset("Cleanlab/stanford-politeness", split="labeled", data_files=labeled_data_file)
X_unlabeled = load_dataset("Cleanlab/stanford-politeness", split="unlabeled", data_files=unlabeled_data_file)
test = load_dataset("Cleanlab/stanford-politeness", split="test", data_files=test_data_file)

!wget -nc -O 'extra_annotations.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/extra_annotations.npy?download=true'

extra_annotations = np.load("extra_annotations.npy",allow_pickle=True).item()

X_labeled_full = X_labeled_full.to_pandas()
X_labeled_full.set_index("id", inplace=True)
X_unlabeled = X_unlabeled.to_pandas()
X_unlabeled.set_index("id", inplace=True)
test = test.to_pandas()

文字禮貌性分類

我們使用的是斯坦福禮貌語料庫作為資料集。

它被構建為一個二元文字分類任務，用於判斷每個短語是禮貌還是不禮貌。人工標註員會收到一個選定的文字短語，並提供一個關於其禮貌性的 (不完美的) 標註：0 表示不禮貌，1 表示禮貌。

我們在標註資料上訓練一個 Transformer 分類器，並在一組留出的測試樣本上測量模型準確率。我對這些測試樣本的真實標籤非常有信心，因為它們來自於 5 位標註員對每個樣本進行標註後達成的共識。

至於訓練資料，我們有

X_labeled_full：我們的初始訓練集，只包含 100 個文字樣本，每個樣本有 2 個標註。
X_unlabeled：一個包含 1900 個未標註文字樣本的大資料集，我們可以考慮讓標註員對其進行標註。
extra_annotations：當請求對某個樣本進行標註時，我們從中提取額外標註的池子

視覺化資料

# Multi-annotated Data
X_labeled_full.head()

# Unlabeled Data
X_unlabeled.head()

# extra_annotations contains the annotations that we will use when an additional annotation is requested.
extra_annotations

# Random sample of extra_annotations to see format.
{k: extra_annotations[k] for k in random.sample(extra_annotations.keys(), 5)}

檢視測試集中的一些示例

>>> num_to_label = {0: "Impolite", 1: "Polite"}
>>> for i in range(2):
...     print(f"{num_to_label[i]} examples:")
...     subset = test[test.label == i][["text"]].sample(n=3, random_state=2)
...     print(subset)

Impolite examples:

不禮貌的例子

	文字
120	而且也在浪費我們的時間。我只能重複一遍：你為什麼不透過新增關於你心愛的馬其頓的內容來做些建設性的工作呢？
150	與其告訴我關閉某些 afd 是多麼錯誤，也許你的時間更好地花在處理當前的 afd 積壓工作上。如果我的決定如此錯誤，你為什麼沒有重新開啟它們？
326	根據 CFD，這本應該被移動到。為什麼沒有移動？

禮貌的例子

	文字
498	你好，我已經提出了取消對 tamazepam 頁面保護的可能性。你有什麼想法？
132	由於某些編輯，頁面排版發生了變化。您能幫忙嗎？
131	很高興你對整體外觀感到滿意。在我標註所有街道之前，文字大小、字型樣式等是否可以？

輔助方法

以下部分包含了本 notebook 所需的所有輔助方法。

get_idx_to_label 專為主動學習場景設計，尤其是在處理混合了已標註和未標註資料的情況下。其主要目標是根據它們的主動學習分數，確定應選擇哪些樣本（來自已標註和未標註資料集）進行額外標註。

# Helper method to get indices of examples with the lowest active learning score to collect more labels for.
def get_idx_to_label(
    X_labeled_full,
    X_unlabeled,
    extra_annotations,
    batch_size_to_label,
    active_learning_scores,
    active_learning_scores_unlabeled=None,
):
    if active_learning_scores_unlabeled is None:
        active_learning_scores_unlabeled = np.array([])

    to_label_idx = []
    to_label_idx_unlabeled = []

    num_labeled = len(active_learning_scores)
    active_learning_scores_combined = np.concatenate((active_learning_scores, active_learning_scores_unlabeled))
    to_label_idx_combined = np.argsort(active_learning_scores_combined)

    # We want to collect the n=batch_size best examples to collect another annotation for.
    i = 0
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        idx = to_label_idx_combined[i]
        # We know this is an already annotated example.
        if idx < num_labeled:
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
        # We know this is an example that is currently not annotated.
        else:
            # Subtract off offset to get back original index.
            idx -= num_labeled
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
        i += 1

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

get_idx_to_label_random 是為主動學習場景設計的，其中選擇資料點進行額外標註是隨機的，而不是基於模型的不確定性或學習分數。這種方法可以用作基線，與更復雜的主動學習策略進行比較，或者在不清楚如何對樣本進行評分的情況下使用。

# Helper method to get indices of random examples to collect more labels for.
def get_idx_to_label_random(X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label):
    to_label_idx = []
    to_label_idx_unlabeled = []

    # Generate list of indices for both sets of examples.
    labeled_idx = [(x, "labeled") for x in range(len(X_labeled_full))]
    unlabeled_idx = []
    if X_unlabeled is not None:
        unlabeled_idx = [(x, "unlabeled") for x in range(len(X_unlabeled))]
    combined_idx = labeled_idx + unlabeled_idx

    # We want to collect the n=batch_size random examples to collect another annotation for.
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        # Random choice from indices.
        # We time-seed to ensure randomness.
        random.seed(datetime.now().timestamp())
        choice = random.choice(combined_idx)
        idx, which_subset = choice
        # We know this is an already annotated example.
        if which_subset == "labeled":
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
            combined_idx.remove(choice)
        # We know this is an example that is currently not annotated.
        else:
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
            combined_idx.remove(choice)

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

下面是一些輔助方法，幫助我們計算標準差，選擇一個先前已標註過該樣本的特定標註員，以及一些用於文字樣本分詞的 Token 函式。

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev


# Helper method to select which annotator we should collect another annotation from.
def choose_existing(annotators, existing_annotators):
    for annotator in annotators:
        # If we find one that has already given an annotation, we return it.
        if annotator in existing_annotators:
            return annotator
    # If we don't find an existing, just return a random one.
    choice = random.choice(list(annotators.keys()))
    return choice


# Helper method for Trainer.
def compute_metrics(p):
    logits, labels = p
    pred = np.argmax(logits, axis=1)
    pred_probs = softmax(logits, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {"logits": logits, "pred_probs": pred_probs, "accuracy": accuracy}


# Helper method to tokenize text.
def tokenize_function(examples):
    model_name = "distilbert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer(examples["text"], padding="max_length", truncation=True)


# Helper method to tokenize given dataset.
def tokenize_data(data):
    dataset = Dataset.from_dict({"label": data["label"], "text": data["text"].values})
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset = tokenized_dataset.cast_column("label", ClassLabel(names=["0", "1"]))
    return tokenized_dataset

這裡的 get_trainer 函式旨在為文字分類任務設定一個訓練環境，使用的是 DistilBERT，它是 BERT 模型的一個蒸餾版本，更輕更快。

# Helper method to initiate a new Trainer with given train and test sets.
def get_trainer(train_set, test_set):

    # Model params.
    model_name = "distilbert-base-uncased"
    model_folder = "model_training"
    max_training_steps = 300
    num_classes = 2

    # Set training args.
    # We time-seed to ensure randomness between different benchmarking runs.
    training_args = TrainingArguments(
        max_steps=max_training_steps, output_dir=model_folder, seed=int(datetime.now().timestamp())
    )

    # Tokenize train/test set.
    train_tokenized_dataset = tokenize_data(train_set)
    test_tokenized_dataset = tokenize_data(test_set)

    # Initiate a pre-trained model.
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_tokenized_dataset,
        eval_dataset=test_tokenized_dataset,
    )
    return trainer

get_pred_probs 函式使用交叉驗證為給定資料集計算樣本外預測機率，並額外處理未標註資料。

# Helper method to manually compute cross-validated predicted probabilities needed for ActiveLab.
def get_pred_probs(X, X_unlabeled):
    """Uses cross-validation to obtain out-of-sample predicted probabilities
    for given dataset"""

    # Generate cross-val splits.
    n_splits = 3
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True)
    skf_splits = [[train_index, test_index] for train_index, test_index in skf.split(X=X["text"], y=X["label"])]

    # Initiate empty array to store pred_probs.
    num_examples, num_classes = len(X), len(X.label.value_counts())
    pred_probs = np.full((num_examples, num_classes), np.NaN)
    pred_probs_unlabeled = None

    # If we use up all examples from the initial unlabeled pool, X_unlabeled will be None.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.full((n_splits, len(X_unlabeled), num_classes), np.NaN)

    # Iterate through cross-validation folds.
    for split_num, split in enumerate(skf_splits):
        train_index, test_index = split

        train_set = X.iloc[train_index]
        test_set = X.iloc[test_index]

        # Get trainer with train/test subsets.
        trainer = get_trainer(train_set, test_set)
        trainer.train()
        eval_metrics = trainer.evaluate()

        # Get pred_probs and insert into dataframe.
        pred_probs_fold = eval_metrics["eval_pred_probs"]
        pred_probs[test_index] = pred_probs_fold

        # Since we don't have labels for the unlabeled pool, we compute pred_probs at each round of CV
        # and then average the results at the end.
        if X_unlabeled is not None:
            dataset_unlabeled = Dataset.from_dict({"text": X_unlabeled["text"].values})
            unlabeled_tokenized_dataset = dataset_unlabeled.map(tokenize_function, batched=True)
            logits = trainer.predict(unlabeled_tokenized_dataset).predictions
            curr_pred_probs_unlabeled = softmax(logits, axis=1)
            pred_probs_unlabeled[split_num] = curr_pred_probs_unlabeled

    # Here we average the pred_probs from each round of CV to get pred_probs for the unlabeled pool.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.mean(np.array(pred_probs_unlabeled), axis=0)

    return pred_probs, pred_probs_unlabeled

get_annotator 函式根據一系列標準確定最合適的標註員來為一個特定樣本收集新的標註，而 get_annotation 則專注於從選定的標註員那裡為給定樣本收集實際的標註，它還會從池中刪除已收集的標註，以防止其被再次選擇。

# Helper method to determine which annotator to collect annotation from for given example.
def get_annotator(example_id):
    # Update who has already annotated atleast one example.
    existing_annotators = set(X_labeled_full.drop("text", axis=1).columns)
    # Returns the annotator we want to collect annotation from.
    # Chooses existing annotators first.
    annotators = extra_annotations[example_id]
    chosen_annotator = choose_existing(annotators, existing_annotators)
    return chosen_annotator


# Helper method to collect an annotation for given text example.
def get_annotation(example_id, chosen_annotator):

    # Collect new annotation.
    new_annotation = extra_annotations[example_id][chosen_annotator]

    # Remove annotation.
    del extra_annotations[example_id][chosen_annotator]

    return new_annotation

執行以下單元格以隱藏下一個模型訓練塊的 HTML 輸出。

%%html
<style>
    div.output_stderr {
    display: none;
    }
</style>

使用的方法論

在每個主動學習輪次中，我們

為每個訓練樣本計算 ActiveLab 一致性標籤，該標籤源自迄今為止收集的所有標註。
在我們當前的訓練集上使用這些一致性標籤訓練我們的 Transformer 分類模型。
在測試集（其具有高質量的真實標籤）上評估測試準確率。
執行交叉驗證，為整個訓練集和未標註集獲取我們模型的樣本外預測類別機率。
為訓練集和未標註集中的每個樣本獲取 ActiveLab 主動學習分數。這些分數估計了為每個樣本收集另一個標註的資訊量有多大。
選擇一個主動學習分數最低的子集（n = batch_size）作為樣本。
為 n 個選定樣本中的每一個收集一個額外的標註。
將新的標註（以及如果被選中的新的先前未標註的樣本）新增到我們的訓練集中，用於下一次迭代。

隨後，我比較了透過主動學習標註的資料和透過隨機選擇標註的資料訓練的模型。對於每個隨機選擇輪次，我使用多數投票一致性代替 ActiveLab 一致性（在步驟 1 中），然後只隨機選擇 n 個樣本來收集額外的標籤，而不是使用 ActiveLab 分數（在步驟 6 中）。

關於 ActiveLab 一致性標籤和主動學習分數的更多直覺將在 notebook 的後續部分分享。

模型訓練與評估

我首先對我的測試集和訓練集進行分詞，然後初始化一個預訓練的 DistilBert Transformer 模型。使用 300 個訓練步驟微調 DistilBert 在我的資料上取得了準確率和訓練時間之間的良好平衡。這個分類器輸出預測的類別機率，在評估其準確率之前，我將這些機率轉換為類別預測。

使用主動學習分數決定下一步標註什麼

在每一輪主動學習中，我們透過 3 折交叉驗證在當前訓練集上擬合我們的 Transformer 模型。這使我們能夠為訓練集中的每個樣本獲得樣本外預測的類別機率，並且我們還可以使用訓練好的 Transformer 為未標註池中的每個樣本獲得樣本外預測的類別機率。所有這些都在 get_pred_probs 輔助方法中內部實現。使用樣本外預測有助於我們避免因潛在的過擬合而產生的偏差。

一旦我有了這些機率預測，我將它們傳遞給開源 cleanlab 包中的 get_active_learning_scores 方法，該方法實現了 ActiveLab 演算法。這個方法為我們所有的已標註和未標註資料提供了分數。較低的分數表示那些為其收集一個額外標籤對我們當前模型最有資訊量的資料點（分數在已標註和未標註資料之間是直接可比的）。

我將分數最低的樣本組成一個批次，作為要收集標註的樣本（透過 get_idx_to_label 方法）。在這裡，我總是在每一輪中收集完全相同數量的標註（在主動學習和隨機選擇兩種方法下都是如此）。對於這個應用，我將每個樣本的最大標註數量限制為 5 個（不想花費精力一遍又一遍地標註同一個樣本）。

新增新標註

combined_example_ids 是我們想要收集標註的文字樣本的 ID。對於其中的每一個，我們使用 get_annotation 輔助方法從一個標註員那裡收集一個新的標註。在這裡，我們優先選擇已經標註過另一個樣本的標註員。如果給定樣本的標註員都不存在於訓練集中，我們隨機選擇一個。在這種情況下，我們在訓練集中新增一個新列，代表新的標註員。最後，我們將新收集的標註新增到訓練集中。如果相應的樣本之前是未標註的，我們也會將其新增到訓練集中，並從未標註的集合中移除。

我們現在已經完成了一輪收集新標註的過程，並在更新後的訓練集上重新訓練 Transformer 模型。我們將在多輪中重複這個過程，以不斷擴大訓練資料集並改進我們的模型。

# For this Active Learning demo, we add 25 additional annotations to the training set
# each iteration, for 25 rounds.
num_rounds = 25
batch_size_to_label = 25
model_accuracy_arr = np.full(num_rounds, np.nan)

# The 'selection_method' varible determines if we use ActiveLab or random selection
# to choose the new annotations each round.
selection_method = "random"
# selection_method = 'active_learning'

# Each round we:
# - train our model
# - evaluate on unchanging test set
# - collect and add new annotations to training set
for i in range(num_rounds):

    # X_labeled_full is updated each iteration. We drop the text column which leaves us with just the annotations.
    multiannotator_labels = X_labeled_full.drop(["text"], axis=1)

    # Use majority vote when using random selection to select the consensus label for each example.
    if i == 0 or selection_method == "random":
        consensus_labels = get_majority_vote_label(multiannotator_labels)

    # When using ActiveLab, use cleanlab's CrowdLab to select the consensus label for each example.
    else:
        results = get_label_quality_multiannotator(
            multiannotator_labels,
            pred_probs_labeled,
            calibrate_probs=True,
        )
        consensus_labels = results["label_quality"]["consensus_label"].values

    # We only need the text and label columns.
    train_set = X_labeled_full[["text"]]
    train_set["label"] = consensus_labels
    test_set = test[["text", "label"]]

    # Train our Transformer model on the full set of labeled data to evaluate model accuracy for the current round.
    # This is an optional step for demonstration purposes, in practical applications
    # you may not have ground truth labels.
    trainer = get_trainer(train_set, test_set)
    trainer.train()
    eval_metrics = trainer.evaluate()
    # set statistics
    model_accuracy_arr[i] = eval_metrics["eval_accuracy"]

    # For ActiveLab, we need to run cross-validation to get out-of-sample predicted probabilites.
    if selection_method == "active_learning":
        pred_probs, pred_probs_unlabeled = get_pred_probs(train_set, X_unlabeled)

        # Compute active learning scores.
        active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
            multiannotator_labels, pred_probs, pred_probs_unlabeled
        )

        # Get the indices of examples to collect more labels for.
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label(
            X_labeled_full,
            X_unlabeled,
            extra_annotations,
            batch_size_to_label,
            active_learning_scores,
            active_learning_scores_unlabeled,
        )

    # We don't need to run cross-validation, just get random examples to collect annotations for.
    if selection_method == "random":
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label_random(
            X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label
        )

    unlabeled_example_ids = np.array([])
    # Check to see if we still have unlabeled examples left.
    if X_unlabeled is not None:
        # Get unlabeled text examples we want to collect annotations for.
        new_text = X_unlabeled.iloc[chosen_examples_unlabeled]
        unlabeled_example_ids = new_text.index.values
        num_ex, num_annot = len(new_text), multiannotator_labels.shape[1]
        empty_annot = pd.DataFrame(
            data=np.full((num_ex, num_annot), np.NaN),
            columns=multiannotator_labels.columns,
            index=unlabeled_example_ids,
        )
        new_unlabeled_df = pd.concat([new_text, empty_annot], axis=1)

        # Combine unlabeled text examples with existing, labeled examples.
        X_labeled_full = pd.concat([X_labeled_full, new_unlabeled_df], axis=0)

        # Remove examples from X_unlabeled and check if empty.
        # Once it is empty we set it to None to handle appropriately elsewhere.
        X_unlabeled = X_unlabeled.drop(new_text.index)
        if X_unlabeled.empty:
            X_unlabeled = None

    if selection_method == "active_learning":
        # Update pred_prob arrays with newly added examples if necessary.
        if pred_probs_unlabeled is not None and len(chosen_examples_unlabeled) != 0:
            pred_probs_new = pred_probs_unlabeled[chosen_examples_unlabeled, :]
            pred_probs_labeled = np.concatenate((pred_probs, pred_probs_new))
            pred_probs_unlabeled = np.delete(pred_probs_unlabeled, chosen_examples_unlabeled, axis=0)
        # Otherwise we have nothing to modify.
        else:
            pred_probs_labeled = pred_probs

    # Get combined list of text ID's to relabel.
    labeled_example_ids = X_labeled_full.iloc[chosen_examples_labeled].index.values
    combined_example_ids = np.concatenate([labeled_example_ids, unlabeled_example_ids])

    # Now we collect annotations for the selected examples.
    for example_id in combined_example_ids:
        # Choose which annotator to collect annotation from.
        chosen_annotator = get_annotator(example_id)
        # Collect new annotation.
        new_annotation = get_annotation(example_id, chosen_annotator)
        # New annotator has been selected.
        if chosen_annotator not in X_labeled_full.columns.values:
            empty_col = np.full((len(X_labeled_full),), np.nan)
            X_labeled_full[chosen_annotator] = empty_col

        # Add selected annotation to the training set.
        X_labeled_full.at[example_id, chosen_annotator] = new_annotation

結果

在執行 25 輪主動學習（每輪標註一批資料並重新訓練 Transformer 模型），每輪收集 25 個標註之後。我重複了所有這些步驟，下一次使用隨機選擇來決定每輪要標註哪些樣本——作為基線比較。在標註額外資料之前，兩種方法都從相同的初始訓練集（100 個樣本）開始（因此在第一輪中 Transformer 的準確率大致相同）。由於訓練 Transformer 存在固有的隨機性，我將整個過程重複了五次（對於每種資料標註策略），並報告了五次重複執行中測試準確率的標準差（陰影區域）和平均值（實線）。

# Get numpy array of results.
!wget -nc -O 'random_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/activelearn_acc.npy'
!wget -nc -O 'activelearn_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/random_acc.npy'

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev

>>> al_acc = np.load("activelearn_acc.npy")
>>> rand_acc = np.load("random_acc.npy")

>>> rand_acc_std = compute_std_dev(rand_acc)
>>> al_acc_std = compute_std_dev(al_acc)

>>> plt.plot(range(1, al_acc.shape[1] + 1), np.mean(al_acc, axis=0), label="active learning", color="green")
>>> plt.fill_between(range(1, al_acc.shape[1] + 1), al_acc_std[0], al_acc_std[1], alpha=0.3, color="green")

>>> plt.plot(range(1, rand_acc.shape[1] + 1), np.mean(rand_acc, axis=0), label="random", color="red")
>>> plt.fill_between(range(1, rand_acc.shape[1] + 1), rand_acc_std[0], rand_acc_std[1], alpha=0.1, color="red")

>>> plt.hlines(y=0.9, xmin=1.0, xmax=25.0, color="black", linestyle="dotted")
>>> plt.legend()
>>> plt.xlabel("Round Number")
>>> plt.ylabel("Test Accuracy")
>>> plt.title("ActiveLab vs Random Annotation Selection --- 5 Runs")
>>> plt.savefig("al-results.png")
>>> plt.show()

我們看到，選擇下一步要標註哪些資料對模型效能有巨大影響。使用 ActiveLab 的主動學習在每一輪都持續顯著優於隨機選擇。例如，在第 4 輪，訓練集中總共有 275 個標註時，我們透過主動學習獲得了 91% 的準確率，而沒有巧妙選擇標註策略的情況下只有 76% 的準確率。總體而言，無論總標註預算如何，基於主動學習構建的資料集上訓練出的 Transformer 模型，其錯誤率大約降低了 50%！

在為文字分類標註資料時，您應考慮使用帶重新標註選項的主動學習，以更好地處理不完美的標註員。

< > 在 GitHub 上更新

←使用 Cleanlab 檢測文字資料集中的問題使用 Gemma、Elasticsearch 和開源模型構建 RAG 系統→