聊天基礎知識

聊天模型是您可以傳送和接收訊息的對話模型。有許多聊天模型可供選擇，但通常情況下，較大的模型往往更好，儘管並非總是如此。模型大小通常包含在名稱中，例如“8B”或“70B”，它描述了引數的數量。專家混合（MoE）模型的名稱如“8x7B”或“141B-A35B”，這意味著它是一個56B和141B引數的模型。您可以嘗試對較大的模型進行量化以減少記憶體需求，否則您將需要每個引數約2位元組的記憶體。

檢視模型排行榜，如 OpenLLM 和 LMSys Chatbot Arena，以進一步幫助您確定最適合您用例的聊天模型。在某些特定領域（醫療、法律文字、非英語語言等）的模型有時可能優於更大的通用模型。

在 HuggingChat 上免費與許多開源模型聊天！

本指南向您展示如何快速從命令列開始使用 Transformers 進行聊天，如何構建和格式化對話，以及如何使用 TextGenerationPipeline 進行聊天。

Transformers CLI

安裝 Transformers 後，您可以直接從命令列與模型聊天，如下所示。它會啟動一個與模型的互動式會話，會話開始時會列出一些基本命令。

transformers chat Qwen/Qwen2.5-0.5B-Instruct

您可以使用任意的 generate 標誌啟動 CLI，格式為 arg_1=value_1 arg_2=value_2 ...

transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10

有關所有選項的完整列表，請執行以下命令。

transformers chat -h

此聊天是在 AutoClass 之上實現的，使用了文字生成和聊天的工具。

TextGenerationPipeline

TextGenerationPipeline 是一個高階文字生成類，具有“聊天模式”。當檢測到會話模型且聊天提示格式正確時，將啟用聊天模式。

首先，使用以下兩個角色構建聊天曆史記錄。

system 描述了模型在您與它聊天時應該如何表現和響應。並非所有聊天模型都支援此角色。
user 是您向模型輸入第一條訊息的地方。

chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

建立 TextGenerationPipeline 並將其傳遞給 chat。對於大型模型，設定 device_map="auto" 有助於更快地載入模型，並自動將其放置在最快的可用裝置上。將資料型別更改為 torch.bfloat16 也有助於節省記憶體。

import torch
from transformers import pipeline

pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide!

So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
some wild stuff, like that Warhol guy's soup cans and all that jazz.

And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind.

Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks)

And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing
pizzerias around the city. Just don't try to order a "robot-sized" slice, trust me, it won't end well. (laughs)

So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll
excuse me, I've got some oil changes to attend to. (winks)

使用 chat 上的 append 方法回覆模型訊息。

chat = response[0]["generated_text"]
chat.append(
    {"role": "user", "content": "Wait, what's so wild about soup cans?"}
)
response = pipeline(chat, max_new_tokens=512)
print(response[0]["generated_text"][-1]["content"])

(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
(sarcastically) Oh, yeah, real original, Andy.

But, you know, back in the '60s, it was like, a big deal. People were all about challenging the
status quo, and Warhol was like, the king of that. He took the ordinary and made it extraordinary.
And, let me tell you, it was like, a real game-changer. I mean, who would've thought that a can of soup could be art? (laughs)

But, hey, you're not alone, pal. I mean, I'm a robot, and even I don't get it. (winks)
But, hey, that's what makes art, art, right? (laughs)

效能

Transformers 預設以全精度載入模型，對於一個 8B 模型，這需要約 32GB 的記憶體！透過以半精度或 bfloat16（每個引數僅使用約 2 位元組）載入模型來減少記憶體使用。您甚至可以使用 bitsandbytes 將模型量化到更低的精度，如 8 位或 4 位。

有關可用的不同量化後端的資訊，請參閱量化文件。

使用您想要的量化設定建立 BitsAndBytesConfig 並將其傳遞給管道的 model_kwargs 引數。下面的示例將模型量化為 8 位。

from transformers import pipeline, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config})

一般來說，大型模型除了需要更多記憶體外，速度也較慢，因為文字生成受到**記憶體頻寬**而不是計算能力的瓶頸。每個啟用引數都必須為每個生成的令牌從記憶體中讀取。對於一個 16GB 的模型，每個生成的令牌必須從記憶體中讀取 16GB。

生成的令牌/秒的數量與系統總記憶體頻寬除以模型大小成正比。根據您的硬體，總記憶體頻寬可能會有所不同。請參閱下表，瞭解不同硬體型別的近似生成速度。

硬體	記憶體頻寬
消費級CPU	20-100GB/秒
專用CPU（Intel Xeon、AMD Threadripper/Epyc、Apple silicon）	200-900GB/秒
資料中心GPU（NVIDIA A100/H100）	2-3TB/秒

提高生成速度最簡單的解決方案是量化模型或使用具有更高記憶體頻寬的硬體。

您還可以嘗試推測解碼等技術，其中較小的模型生成候選令牌，然後由較大的模型進行驗證。如果候選令牌正確，則較大的模型可以在每次 forward 傳遞中生成多個令牌。這顯著緩解了頻寬瓶頸並提高了生成速度。

在MoE模型中，例如Mixtral、Qwen2MoE和DBRX，並非所有生成的令牌都必須啟用所有引數。因此，MoE模型通常具有更低的記憶體頻寬要求，並且比同等大小的常規LLM更快。然而，推測解碼等技術對MoE模型無效，因為每次推測新令牌時都會啟用引數。

< > 在 GitHub 上更新