文字分類

文字分類是常見的 NLP 任務，它為文字分配標籤或類別。一些大公司在生產中執行文字分類，用於各種實際應用。情感分析是文字分類最流行的形式之一，它將諸如 🙂 積極、🙁 消極或 😐 中性等標籤分配給文字序列。

本指南將向您展示如何：

在 IMDb 資料集上微調 DistilBERT，以確定影評是積極的還是消極的。
使用您的微調模型進行推理。

要檢視所有與此任務相容的架構和檢查點，建議檢視任務頁面。

在開始之前，請確保您已安裝所有必要的庫

pip install transformers datasets evaluate accelerate

我們鼓勵您登入 Hugging Face 賬戶，以便您可以上傳並與社群分享您的模型。出現提示時，輸入您的令牌進行登入

>>> from huggingface_hub import notebook_login

>>> notebook_login()

載入 IMDb 資料集

首先從 🤗 Datasets 庫載入 IMDb 資料集

>>> from datasets import load_dataset

>>> imdb = load_dataset("imdb")

然後檢視一個示例

>>> imdb["test"][0]
{
    "label": 0,
    "text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}

此資料集中有兩個欄位

text: 影評文字。
label: 一個值為 0 代表負面評論或 1 代表正面評論的欄位。

預處理

下一步是載入 DistilBERT 分詞器來預處理 text 欄位

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

建立一個預處理函式，以分詞 text 並將序列截斷為不超過 DistilBERT 的最大輸入長度

>>> def preprocess_function(examples):
...     return tokenizer(examples["text"], truncation=True)

要將預處理函式應用於整個資料集，請使用 🤗 Datasets 的 map 函式。透過設定 batched=True 來一次處理資料集的多個元素，可以加快 map 的速度

tokenized_imdb = imdb.map(preprocess_function, batched=True)

現在使用 DataCollatorWithPadding 建立一批示例。在整理過程中，將句子動態填充到批次中最長長度，而不是將整個資料集填充到最大長度，效率更高。

Pytorch

隱藏 Pytorch 內容

>>> from transformers import DataCollatorWithPadding

>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

TensorFlow

隱藏 TensorFlow 內容

>>> from transformers import DataCollatorWithPadding

>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

評估

在訓練期間包含一個指標通常有助於評估模型的效能。您可以使用 🤗 Evaluate 庫快速載入評估方法。對於此任務，載入 accuracy 指標（請參閱 🤗 Evaluate 快速入門，瞭解如何載入和計算指標）

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

然後建立一個函式，將您的預測和標籤傳遞給 compute 以計算準確度。

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)

您的 compute_metrics 函式現在可以使用了，您將在設定訓練時再次用到它。

訓練

在開始訓練模型之前，使用 id2label 和 label2id 為預期 ID 及其標籤建立對映

>>> id2label = {0: "NEGATIVE", 1: "POSITIVE"}
>>> label2id = {"NEGATIVE": 0, "POSITIVE": 1}

Pytorch

隱藏 Pytorch 內容

如果您不熟悉如何使用 Trainer 對模型進行微調，請參閱此處的基本教程！

現在您已準備好開始訓練模型！使用 AutoModelForSequenceClassification 載入 DistilBERT，同時指定預期標籤數量和標籤對映

>>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

>>> model = AutoModelForSequenceClassification.from_pretrained(
...     "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
... )

此時，只剩下三個步驟

在 TrainingArguments 中定義訓練超引數。唯一必需的引數是 output_dir，它指定儲存模型的位置。您可以透過設定 push_to_hub=True 將此模型推送到 Hub（您需要登入 Hugging Face 才能上傳模型）。在每個 epoch 結束時，Trainer 將評估準確性並儲存訓練檢查點。
將訓練引數與模型、資料集、分詞器、資料整理器和 compute_metrics 函式一起傳遞給 Trainer。
呼叫 train() 來微調您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_model",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=2,
...     weight_decay=0.01,
...     eval_strategy="epoch",
...     save_strategy="epoch",
...     load_best_model_at_end=True,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_imdb["train"],
...     eval_dataset=tokenized_imdb["test"],
...     processing_class=tokenizer,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

當您將 tokenizer 傳遞給 Trainer 時，它會預設應用動態填充。在這種情況下，您無需顯式指定資料收集器。

訓練完成後，使用 push_to_hub() 方法將您的模型分享到 Hub，以便所有人都可以使用您的模型。

>>> trainer.push_to_hub()

TensorFlow

隱藏 TensorFlow 內容

如果您不熟悉如何使用 Keras 對模型進行微調，請參閱此處的基本教程！

要在 TensorFlow 中對模型進行微調，首先要設定最佳化器函式、學習率排程和一些訓練超引數

>>> from transformers import create_optimizer
>>> import tensorflow as tf

>>> batch_size = 16
>>> num_epochs = 5
>>> batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
>>> total_train_steps = int(batches_per_epoch * num_epochs)
>>> optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

然後，您可以使用 TFAutoModelForSequenceClassification 載入 DistilBERT，同時指定預期標籤數量和標籤對映

>>> from transformers import TFAutoModelForSequenceClassification

>>> model = TFAutoModelForSequenceClassification.from_pretrained(
...     "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
... )

使用 prepare_tf_dataset() 將資料集轉換為 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_imdb["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_imdb["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型進行訓練。請注意，所有 Transformers 模型都具有預設的任務相關損失函式，因此除非您需要，否則無需指定

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

在開始訓練之前，還需要設定兩件事：計算預測的準確性，並提供一種將模型推送到 Hub 的方法。這兩者都透過使用 Keras 回撥來完成。

將您的 compute_metrics 函式傳遞給 KerasMetricCallback

>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在 PushToHubCallback 中指定將模型和分詞器推送到何處

>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_model",
...     tokenizer=tokenizer,
... )

然後將回調函式捆綁在一起

>>> callbacks = [metric_callback, push_to_hub_callback]

最後，您已準備好開始訓練模型！使用訓練和驗證資料集、epoch 數量和回撥函式呼叫 fit 來微調模型

>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

訓練完成後，您的模型會自動上傳到 Hub，供所有人使用！

有關如何微調文字分類模型的更深入示例，請參閱相應的 PyTorch notebook 或 TensorFlow notebook。

推理

太棒了，現在您已經微調了模型，您可以將其用於推理了！

獲取一些您想要執行推理的文字

>>> text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

試用微調模型進行推理的最簡單方法是將其用於 pipeline()。使用您的模型例項化一個情感分析 pipeline，並將您的文字傳遞給它

>>> from transformers import pipeline

>>> classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
>>> classifier(text)
[{'label': 'POSITIVE', 'score': 0.9994940757751465}]

如果需要，您也可以手動複製 pipeline 的結果

Pytorch

隱藏 Pytorch 內容

對文字進行分詞並返回 PyTorch 張量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
>>> inputs = tokenizer(text, return_tensors="pt")

將您的輸入傳遞給模型並返回 logits。

>>> from transformers import AutoModelForSequenceClassification

>>> model = AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

獲取機率最高的類別，並使用模型的 id2label 對映將其轉換為文字標籤

>>> predicted_class_id = logits.argmax().item()
>>> model.config.id2label[predicted_class_id]
'POSITIVE'

TensorFlow

隱藏 TensorFlow 內容

對文字進行分詞並返回 TensorFlow 張量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
>>> inputs = tokenizer(text, return_tensors="tf")

將您的輸入傳遞給模型並返回 logits。

>>> from transformers import TFAutoModelForSequenceClassification

>>> model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
>>> logits = model(**inputs).logits

獲取機率最高的類別，並使用模型的 id2label 對映將其轉換為文字標籤

>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
>>> model.config.id2label[predicted_class_id]
'POSITIVE'

< > 在 GitHub 上更新