快速入門

🤗 Evaluate 提供了廣泛的評估工具。它涵蓋了文字、計算機視覺、音訊等多種模態，以及評估模型或資料集的工具。這些工具分為三類。

評估型別

典型的機器學習流程中有不同的方面可以進行評估，對於每個方面，🤗 Evaluate 都提供了一個工具：

指標（Metric）：指標用於評估模型的效能，通常涉及模型的預測以及一些真實標籤。你可以在 evaluate-metric 找到所有整合的指標。
比較（Comparison）：比較用於對比兩個模型。例如，可以透過將它們的預測與真實標籤進行比較並計算它們的一致性來完成。你可以在 evaluate-comparison 找到所有整合的比較方法。
測量（Measurement）：資料集與在其上訓練的模型同等重要。透過測量，可以研究資料集的屬性。你可以在 evaluate-measurement 找到所有整合的測量方法。

這些評估模組中的每一個都以 Space 的形式存在於 Hugging Face Hub 上。它們帶有互動式小部件和文件卡片，記錄了其用途和侷限性。例如 accuracy

每個指標、比較和測量都是一個獨立的 Python 模組，但使用它們中的任何一個，都有一個統一的入口：evaluate.load()！

載入

任何指標、比較或測量都透過 `evaluate.load` 函式載入。

>>> import evaluate
>>> accuracy = evaluate.load("accuracy")

如果你想確保載入了正確的評估型別（特別是在有名稱衝突的情況下），可以顯式地傳遞型別。

>>> word_length = evaluate.load("word_length", module_type="measurement")

社群模組

除了 🤗 Evaluate 中實現的模組外，你還可以透過指定指標實現倉庫的 ID 來載入任何社群模組。

>>> element_count = evaluate.load("lvwerra/element_count", module_type="measurement")

有關上傳自定義指標的資訊，請參閱建立和共享指南。

列出可用模組

使用 list_evaluation_modules()，你可以檢視 Hub 上有哪些可用的模組。你還可以篩選特定模組，並根據需要跳過社群指標。你還可以看到其他資訊，例如點贊數。

>>> evaluate.list_evaluation_modules(
...   module_type="comparison",
...   include_community=False,
...   with_details=True)

[{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1},
 {'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}]

模組屬性

所有評估模組都帶有一系列有用的屬性，這些屬性有助於使用儲存在 EvaluationModuleInfo 物件中的模組。

屬性	描述
`description`	評估模組的簡短描述。
`引用`	可用的 BibTex 引用字串。
`features`	定義輸入格式的 `Features` 物件。
`inputs_description`	這相當於模組的文件字串（docstring）。
`homepage`	模組的主頁。
`許可證`	模組的許可證。
`codebase_urls`	指向模組背後程式碼的連結。
`reference_urls`	其他參考 URL。

讓我們看幾個例子。首先，我們看看準確率（accuracy）指標的 `description` 屬性。

>>> accuracy = evaluate.load("accuracy")
>>> accuracy.description
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative

你可以看到它描述了該指標在理論上是如何工作的。如果你在你的工作中使用這個指標，特別是在學術出版物中，你需要正確地引用它。為此，你可以檢視 `citation` 屬性。

>>> accuracy.citation
@article{scikit-learn,
  title={Scikit-learn: Machine Learning in {P}ython},
  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
  journal={Journal of Machine Learning Research},
  volume={12},
  pages={2825--2830},
  year={2011}
}

在我們能將指標或其他評估模組應用於一個用例之前，我們需要知道該指標的輸入格式是什麼。

>>> accuracy.features
{
    'predictions': Value(dtype='int32', id=None),
    'references': Value(dtype='int32', id=None)
}

請注意，features 總是描述單個輸入元素的型別。通常我們會新增元素列表，所以你總可以想象在 `features` 中的型別周圍有一個列表。Evaluate 接受各種輸入格式（Python 列表、NumPy 陣列、PyTorch 張量等），並將其轉換為適合儲存和計算的格式。

計算

現在我們知道評估模組是如何工作的以及應該輸入什麼了，我們想實際使用它！當涉及到計算實際分數時，主要有兩種方法：

一次性計算
增量計算

在增量方法中，必要的輸入透過 EvaluationModule.add() 或 EvaluationModule.add_batch() 新增到模組中，最終的分數透過 EvaluationModule.compute() 計算。或者，可以將所有輸入一次性傳遞給 `compute()`。讓我們看看這兩種方法。

如何計算

計算評估模組分數的最簡單方法是直接用必要的輸入呼叫 `compute()`。只需將 `features` 中看到的輸入作為引數傳遞給 `compute()` 方法即可。

>>> accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
{'accuracy': 0.5}

評估模組以字典形式返回結果。然而，在某些情況下，你是迭代地或以分散式方式構建預測的，這時 `add()` 或 `add_batch()` 就很有用。

計算單個指標或一批指標

在許多評估流程中，你是迭代地構建預測的，例如在 for 迴圈中。在這種情況下，你可以將預測儲存在一個列表中，最後將它們傳遞給 `compute()`。使用 `add()` 和 `add_batch()` 可以避免單獨儲存預測的步驟。如果你一次只建立一個預測，可以使用 `add()`。

>>> for ref, pred in zip([0,1,0,1], [1,0,0,1]):
>>>     accuracy.add(references=ref, predictions=pred)
>>> accuracy.compute()
{'accuracy': 0.5}

一旦收集了所有預測，你就可以呼叫 `compute()` 來根據所有儲存的值計算分數。當以批次獲取預測和參考時，你可以使用 `add_batch()`，它會新增一個元素列表以供後續處理。其餘部分與 `add()` 類似。

>>> for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):
>>>     accuracy.add_batch(references=refs, predictions=preds)
>>> accuracy.compute()
{'accuracy': 0.5}

當需要分批從模型中獲取預測時，這尤其有用。

>>> for model_inputs, gold_standards in evaluation_dataset:
>>>     predictions = model(model_inputs)
>>>     metric.add_batch(references=gold_standards, predictions=predictions)
>>> metric.compute()

分散式評估

在分散式環境中計算指標可能很棘手。指標評估在不同的 Python 程序或節點上，對資料集的不同子集執行。通常，當一個指標分數是可加的（`f(AuB) = f(A) + f(B)`）時，你可以使用分散式 reduce 操作來收集資料集每個子集的分數。但是當一個指標是不可加的（`f(AuB) ≠ f(A) + f(B)`）時，事情就沒那麼簡單了。例如，你不能將每個資料子集的 F1 分數相加作為你的**最終指標**。

克服這個問題的一個常用方法是退回到單程序評估。指標在單個 GPU 上進行評估，這變得效率低下。

🤗 Evaluate 透過只在第一個節點上計算最終指標來解決這個問題。預測和參考值在每個節點上分別計算並提供給指標。這些資料暫時儲存在一個 Apache Arrow 表中，避免了佔用 GPU 或 CPU 記憶體。當你準備好 `compute()` 最終指標時，第一個節點能夠訪問儲存在所有其他節點上的預測和參考值。一旦收集了所有的預測和參考值，`compute()` 將執行最終的指標評估。

這個解決方案允許 🤗 Evaluate 執行分散式預測，這對於分散式環境下的評估速度很重要。同時，你也可以使用複雜的非可加性指標，而不會浪費寶貴的 GPU 或 CPU 記憶體。

組合多個評估

通常，人們不僅希望評估單個指標，還希望評估一系列不同的指標，以捕捉模型的不同方面。例如，對於分類任務，除了準確率之外，計算 F1 分數、召回率和精確率通常是個好主意，以便更全面地瞭解模型效能。當然，你可以載入一堆指標並按順序呼叫它們。然而，一個更方便的方法是使用 combine() 函式將它們捆綁在一起。

>>> clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

`combine` 函式既接受指標名稱列表，也接受例項化的模組。然後 `compute` 呼叫會計算每個指標。

>>> clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{
  'accuracy': 0.667,
  'f1': 0.667,
  'precision': 1.0,
  'recall': 0.5
}

儲存並推送到 Hub

儲存和共享評估結果是重要的一步。我們提供 evaluate.save() 函式來輕鬆儲存指標結果。你可以傳遞一個特定的檔名或一個目錄。在後一種情況下，結果會儲存在一個自動建立檔名的檔案中。除了目錄或檔名，該函式還接受任何鍵值對作為輸入，並將它們儲存在一個 JSON 檔案中。

>>> result = accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])

>>> hyperparams = {"model": "bert-base-uncased"}
>>> evaluate.save("./results/", experiment="run 42", **result, **hyperparams)
PosixPath('results/result-2022_05_30-22_09_11.json')

JSON 檔案的內容如下所示：

{
    "experiment": "run 42",
    "accuracy": 0.5,
    "model": "bert-base-uncased",
    "_timestamp": "2022-05-30T22:09:11.959469",
    "_git_commit_hash": "123456789abcdefghijkl",
    "_evaluate_version": "0.1.0",
    "_python_version": "3.9.12 (main, Mar 26 2022, 15:51:15) \n[Clang 13.1.6 (clang-1316.0.21.2)]",
    "_interpreter_path": "/Users/leandro/git/evaluate/env/bin/python"
}

除了指定的欄位外，它還包含有用的系統資訊，以便復現結果。

除了在本地儲存結果，你還應該在 Hub 上的模型倉庫中報告它們。使用 evaluate.push_to_hub() 函式，你可以輕鬆地將評估結果報告到模型的倉庫中。

evaluate.push_to_hub(
  model_id="huggingface/gpt2-wikitext2",  # model repository on hub
  metric_value=0.5,                       # metric value
  metric_type="bleu",                     # metric name, e.g. accuracy.name
  metric_name="BLEU",                     # pretty name which is displayed
  dataset_type="wikitext",                # dataset name on the hub
  dataset_name="WikiText",                # pretty name
  dataset_split="test",                   # dataset split used
  task_type="text-generation",            # task id, see https://github.com/huggingface/evaluate/blob/main/src/evaluate/config.py#L154-L192
  task_name="Text Generation"             # pretty name for task
)

Evaluator

evaluate.evaluator() 提供了自動評估功能，只需要一個模型、資料集和指標，這與 `EvaluationModule` 中的指標不同，後者需要模型的預測結果。因此，它更容易在給定資料集和指標的情況下評估模型，因為推理過程在內部處理。為了實現這一點，它使用了 `transformers` 的 pipeline 抽象。不過，你也可以使用自己的框架，只要它遵循 `pipeline` 介面即可。

要使用 `evaluator` 進行評估，我們先載入一個 `transformers` pipeline（但你也可以傳遞任何框架的自定義推理類，只要它遵循 pipeline 呼叫 API），該 pipeline 使用一個在 IMDb 上訓練過的模型，以及 IMDb 測試集和準確率指標。

from transformers import pipeline
from datasets import load_dataset
from evaluate import evaluator
import evaluate

pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)
data = load_dataset("imdb", split="test").shuffle().select(range(1000))
metric = evaluate.load("accuracy")

然後，你可以建立一個文字分類的 evaluator，並將這三個物件傳遞給 `compute()` 方法。透過標籤對映，`evaluate` 提供了一種方法，可以將 pipeline 的輸出與資料集中的標籤列對齊。

>>> task_evaluator = evaluator("text-classification")

>>> results = task_evaluator.compute(model_or_pipeline=pipe, data=data, metric=metric,
...                        label_mapping={"NEGATIVE": 0, "POSITIVE": 1},)

>>> print(results)
{'accuracy': 0.934}

僅僅計算指標值通常不足以判斷一個模型是否顯著優於另一個模型。透過*自助法 (bootstrapping)*，`evaluate` 計算置信區間和標準誤差，這有助於估計分數的穩定性。

>>> results = eval.compute(model_or_pipeline=pipe, data=data, metric=metric,
...                        label_mapping={"NEGATIVE": 0, "POSITIVE": 1},
...                        strategy="bootstrap", n_resamples=200)

>>> print(results)
{'accuracy':
    {
      'confidence_interval': (0.906, 0.9406749892841922),
      'standard_error': 0.00865213251082787,
      'score': 0.923
    }
}

evaluator 期望資料輸入中包含 `"text"` 和 `"label"` 列。如果你的資料集不同，你可以透過 `input_column="text"` 和 `label_column="label"` 關鍵字來提供列名。目前僅支援 `"text-classification"`，未來將新增更多工。

視覺化

在比較多個模型時，僅僅透過看分數很難發現它們效能上的差異。而且通常沒有一個單一的最佳模型，而是在例如延遲和準確率之間存在權衡，因為較大的模型可能有更好的效能，但速度也更慢。我們正在逐步新增不同的視覺化方法，如圖表，以便更容易地為特定用例選擇最佳模型。

例如，如果你有一個包含多個模型結果的列表（以字典形式），你可以將它們輸入到 `radar_plot()` 函式中。

import evaluate
from evaluate.visualization import radar_plot

>>> data = [
   {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6},
   {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2},
   {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, 
   {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6}
   ]
>>> model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]
>>> plot = radar_plot(data=data, model_names=model_names)
>>> plot.show()

這可以讓你直觀地比較這 4 個模型，並根據一個或多個指標選擇最適合你的模型。

在一系列任務上執行評估

在各種不同的任務上評估模型，以瞭解它們的下游效能可能很有用。EvaluationSuite 能夠在一系列任務上評估模型。任務可以構建為 (evaluator, dataset, metric) 元組，並傳遞給儲存在 Hugging Face Hub 上作為 Space 的 EvaluationSuite，或者本地作為 Python 指令碼。有關當前支援的任務列表，請參閱 evaluator 文件。

`EvaluationSuite` 指令碼可以如下定義，並支援用於資料預處理的 Python 程式碼。

import evaluate
from evaluate.evaluation_suite import SubTask

class Suite(evaluate.EvaluationSuite):

    def __init__(self, name):
        super().__init__(name)

        self.suite = [
            SubTask(
                task_type="text-classification",
                data="imdb",
                split="test[:1]",
                args_for_task={
                    "metric": "accuracy",
                    "input_column": "text",
                    "label_column": "label",
                    "label_mapping": {
                        "LABEL_0": 0.0,
                        "LABEL_1": 1.0
                    }
                }
            ),
            SubTask(
                task_type="text-classification",
                data="sst2",
                split="test[:1]",
                args_for_task={
                    "metric": "accuracy",
                    "input_column": "sentence",
                    "label_column": "label",
                    "label_mapping": {
                        "LABEL_0": 0.0,
                        "LABEL_1": 1.0
                    }
                }
            )
        ]

可以透過載入 `EvaluationSuite` 並使用模型或 pipeline 呼叫 `run()` 方法來執行評估。

>>> from evaluate import EvaluationSuite
>>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
>>> results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")

準確率	總耗時（秒）	每秒樣本數	延遲（秒）	任務名稱
0.3	4.62804	2.16074	0.462804	imdb
0	0.686388	14.569	0.0686388	sst2

< > 在 GitHub 上更新