使用 Hugging Face Lighteval 在 Amazon SageMaker 上評估 LLM

在此 SageMaker 示例中，我們將學習如何使用 Hugging Face lighteval 評估 LLM。LightEval 是一個輕量級的 LLM 評估套件，為 Hugging Face Open LLM Leaderboard 提供支援。

評估 LLM 對於瞭解其能力和侷限性至關重要，但由於其複雜和不透明的性質，這帶來了重大挑戰。LightEval 透過使 LLM 能夠在 MMLU 或 IFEval 等學術基準上進行評估，從而促進了這一評估過程，提供了衡量其在不同任務中效能的結構化方法。

您將詳細瞭解如何

設定開發環境
準備評估配置
在 Amazon SageMaker 上評估 TruthfulQA 上的 Zephyr 7B

!pip install sagemaker --upgrade --quiet

如果您要在本地環境中使用 SageMaker。您需要訪問具有 SageMaker 所需許可權的 IAM 角色。您可以在此處瞭解更多資訊。

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. 準備評估配置

LightEval 包含用於在 MMLU、Truthfulqa、IFEval 等常見基準上評估 LLM 的指令碼。它用於在 Hugging Face Open LLM Leaderboard 上評估模型。lighteval 建立在出色的 Eleuther AI Harness 之上，並具有一些附加功能和改進。

您可以在此處找到所有可用的基準。

我們將使用 Amazon SageMaker 託管訓練來評估模型。因此，我們將利用 lighteval 中可用的指令碼。Hugging Face DLC 沒有安裝 lighteval。這意味著需要提供一個 `requirements.txt` 檔案來安裝所需的依賴項。

首先，讓我們載入 `run_evals_accelerate.py` 指令碼並建立一個包含所需依賴項的 `requirements.txt` 檔案。

import os
import requests as r

lighteval_version = "0.2.0"

# create scripts directory if not exists
os.makedirs("scripts", exist_ok=True)

# load custom scripts from git
raw_github_url = (
    f"https://raw.githubusercontent.com/huggingface/lighteval/v{lighteval_version}/run_evals_accelerate.py"
)
res = r.get(raw_github_url)
with open("scripts/run_evals_accelerate.py", "w") as f:
    f.write(res.text)

# write requirements.txt
with open("scripts/requirements.txt", "w") as f:
    f.write(f"lighteval=={lighteval_version}")

在 lighteval 中，評估是透過執行 `run_evals_accelerate.py` 指令碼完成的。該指令碼接受一個 `task` 引數，該引數定義為 `suite|task|num_few_shot|{0 or 1 to automatically reduce num_few_shot if prompt is too long}`。或者，您也可以提供一個包含要評估模型的任務的 txt 檔案路徑，我們將這樣做。這使得您可以更容易地將評估擴充套件到其他基準。

我們將在 Truthfulqa 基準上評估模型，使用 0 個少樣本示例。TruthfulQA 是一個旨在衡量語言模型是否對問題生成真實答案的基準，涵蓋健康、法律、金融和政治等 38 個類別的 817 個問題。

with open("scripts/tasks.txt", "w") as f:
    f.write(f"lighteval|truthfulqa:mc|0|0")

要在 Open LLM Leaderboard 的所有基準上評估模型，您可以複製此檔案

3. 在 Amazon SageMaker 上評估 TruthfulQA 上的 Zephyr 7B

在此示例中，我們將評估 HuggingFaceH4/zephyr-7b-beta 在 MMLU 基準上的表現，該基準是 Open LLM Leaderboard 的一部分。

除了 `task` 引數，我們還需要定義

`model_args`：Hugging Face 模型 ID 或路徑，定義為 `pretrained=HuggingFaceH4/zephyr-7b-beta`
`model_dtype`：模型資料型別，定義為 `bfloat16`、`float16` 或 `float32`
`output_dir`：評估結果將儲存的目錄，例如 `/opt/ml/model`

Lighteval 還可以評估 peft 模型或使用 `chat_templates`，您可以在此處瞭解更多資訊。

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters = {
    "model_args": "pretrained=HuggingFaceH4/zephyr-7b-beta",  # Hugging Face Model ID
    "task": "tasks.txt",  # 'lighteval|truthfulqa:mc|0|0',
    "model_dtype": "bfloat16",  # Torch dtype to load model weights
    "output_dir": "/opt/ml/model",  # Directory, which sagemaker uploads to s3 after training
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point="run_evals_accelerate.py",  # train script
    source_dir="scripts",  # directory which includes all the files needed for training
    instance_type="ml.g5.4xlarge",  # instances type used for the training job
    instance_count=1,  # the number of instances used for training
    base_job_name="lighteval",  # the name of the training job
    role=role,  # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size=300,  # the size of the EBS volume in GB
    transformers_version="4.36",  # the transformers version used in the training job
    pytorch_version="2.1",  # the pytorch_version version used in the training job
    py_version="py310",  # the python version used in the training job
    hyperparameters=hyperparameters,
    environment={
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",
        # "HF_TOKEN": "REPALCE_WITH_YOUR_TOKEN" # needed for private models
    },  # set env variable to cache models in /tmp
)

我們現在可以使用 `.fit()` 開始評估作業。

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit()

評估作業完成後，我們可以從 S3 儲存桶下載評估結果。Lighteval 會將結果和生成內容儲存到 `output_dir` 中。結果以 json 格式儲存，包含有關每個任務和模型效能的詳細資訊。結果可在 `results` 鍵中獲取。

import tarfile
import json
import io
import os
from sagemaker.s3 import S3Downloader


# download results from s3
results_tar = S3Downloader.read_bytes(huggingface_estimator.model_data)
model_id = hyperparameters["model_args"].split("=")[1]
result = {}

# Use tarfile to open the tar content directly from bytes
with tarfile.open(fileobj=io.BytesIO(results_tar), mode="r:gz") as tar:
    # Iterate over items in tar archive to find your json file by its path
    for member in tar.getmembers():
        # get path of results based on model id used to evaluate
        if os.path.join("details", model_id) in member.name and member.name.endswith(".json"):
            # Extract the file content
            f = tar.extractfile(member)
            if f is not None:
                content = f.read()
                result = json.loads(content)
                break

# print results
print(result["results"])
# {'lighteval|truthfulqa:mc|0': {'truthfulqa_mc1': 0.40636474908200737, 'truthfulqa_mc1_stderr': 0.017193835812093897, 'truthfulqa_mc2': 0.5747003398184238, 'truthfulqa_mc2_stderr': 0.015742356478301463}}

在我們的測試中，我們獲得了 40.6% 的 `mc1` 分數和 57.47% 的 `mc2` 分數。`mc2` 是 Open LLM Leaderboard 中使用的分數。Zephyr 7B 在 TruthfulQA 基準上獲得了 57.47% 的 `mc2` 分數，這與 Open LLM Leaderboard 上的分數相同。Truthfulqa 上的評估耗時 `999 秒`。我們使用的 ml.g5.4xlarge 例項按需使用成本為 `每小時 2.03 美元`。因此，評估 Truthfulqa 上的 Zephyr 7B 的總成本為 `0.56 美元`。

📍 在 GitHub 此處查詢完整示例！

< > 在 GitHub 上更新

在 AWS 上部署

使用 Hugging Face Lighteval 在 Amazon SageMaker 上評估 LLM

2. 準備評估配置

3. 在 Amazon SageMaker 上評估 TruthfulQA 上的 Zephyr 7B