在 AMD GPU 上使用 Hugging Face 庫

Hugging Face 庫原生支援 AMD Instinct MI210、MI250 和 MI300 GPU。對於其他 ROCm 支援的 GPU，目前尚未驗證其支援，但預計大多數功能可以流暢使用。

整合總結如下。

Flash Attention 2

Flash Attention 2 可透過 ROCm/flash-attention 庫在 ROCm 上使用（已在 MI210、MI250 和 MI300 上驗證），並可在 Transformers 中使用。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        "tiiuae/falcon-7b",
        torch_dtype=torch.float16,
        use_flash_attention_2=True,
)

我們建議使用此 Dockerfile 示例在 ROCm 上使用 Flash Attention，或遵循官方安裝說明。

GPTQ 量化

GPTQ 量化模型可以在 Transformers 中載入，使用後端 AutoGPTQ 庫

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")

with torch.device("cuda"):
    model = AutoModelForCausalLM.from_pretrained(
        "TheBloke/Llama-2-7B-Chat-GPTQ",
        torch_dtype=torch.float16,
    )

ROCm 提供託管 wheel 包，請檢視安裝說明。

文字生成推理庫

Hugging Face 的文字生成推理庫 (TGI) 專為低延遲 LLM 服務設計，並原生支援 AMD Instinct MI210、MI250 和 MI300 GPU。更多詳情請參閱快速導覽部分。

在 ROCm 上使用 AMD Instinct MI210、MI250 或 MI300 GPU 的 TGI 就像使用 docker 映象 ghcr.io/huggingface/text-generation-inference:latest-rocm 一樣簡單。

TGI 在 MI300 GPU 上的詳細基準測試即將釋出。

ONNX Runtime 整合

🤗 Optimum 支援透過 ONNX Runtime 在 ROCm 支援的 AMD GPU 上執行 Transformers 和 Diffusers 模型。使用起來非常簡單：

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

ort_model = ORTModelForSequenceClassification.from_pretrained(
  "distilbert-base-uncased-finetuned-sst-2-english",
  export=True,
  provider="ROCMExecutionProvider",
)

inp = tokenizer("Both the music and visual were astounding, not to mention the actors performance.", return_tensors="np")
result = ort_model(**inp)

請查閱本指南，瞭解有關支援的更多詳細資訊。

Bitsandbytes 量化

Bitsandbytes（整合在 HF 的 Transformers 和 Text Generation Inference 中）目前官方不支援 ROCm。我們正在努力驗證它在 ROCm 和 Hugging Face 庫上的相容性。

同時，高階使用者可能希望暫時使用 ROCm/bitsandbytes 分支。有關詳細資訊，請參閱 #issuecomment。

AWQ 量化

AWQ 量化，在 Transformers 中和 Text Generation Inference 中都支援，現在透過 Exllama 核心在 AMD GPU 上得到了支援。透過最近的最佳化，AWQ 模型在載入時被轉換為 Exllama/GPTQ 格式模型。這使得 AMD ROCm 裝置能夠同時受益於 AWQ 檢查點的高質量和 ExllamaV2 核心的速度。

參見：AutoAWQ 獲取更多詳情。

注意：確保您使用的 PyTorch 版本與構建核心時使用的版本相同。

< > 在 GitHub 上更新