使用 TensorFlow 和 XLA 加速文字生成

釋出於 2022 年 7 月 27 日

在 GitHub 上更新

贊

Joao Gante

joaogante

TL;DR：現在可以使用 XLA 編譯 🤗 transformers 中使用 TensorFlow 進行的文字生成。它比以前快 100 倍，甚至比 PyTorch 更快——請檢視下面的 colab！

文字生成

隨著大型語言模型質量的提高，我們對這些模型能力的期望也隨之提高。特別是自 OpenAI 釋出 GPT-2 以來，具有文字生成能力的模型一直備受關注。這有其正當理由——這些模型可以用於總結、翻譯，甚至在某些語言任務中展示了零樣本學習能力。這篇部落格文章將展示如何使用 TensorFlow 充分利用這項技術。

🤗 transformers 庫最初是針對 NLP 模型而開發的，因此文字生成對我們來說至關重要。它是 Hugging Face 民主化工作的一部分，旨在確保其易於訪問、易於控制且高效。之前有一篇部落格文章介紹了不同型別的文字生成。儘管如此，下面仍然快速回顧一下核心功能——如果您熟悉我們的 generate 函式並想直接瞭解 TensorFlow 的具體細節，請隨意跳過。

讓我們從基礎開始。文字生成可以是確定性的，也可以是隨機的，這取決於 do_sample 標誌。預設情況下，它設定為 False，導致輸出是確定性的，也稱為貪婪解碼（Greedy Decoding）。當它設定為 True 時，也稱為取樣（Sampling），輸出將是隨機的，但您仍然可以透過 seed 引數獲得可重現的結果（格式與無狀態 TensorFlow 隨機數生成中的相同）。通常情況下，如果您希望從模型中獲取事實資訊，則需要確定性生成；如果您希望獲得更具創意性的輸出，則需要隨機生成。

# Requires transformers >= 4.21.0;
# Sampling outputs may differ, depending on your hardware.
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id
inputs = tokenizer(["TensorFlow is"], return_tensors="tf")

generated = model.generate(**inputs, do_sample=True, seed=(42, 0))
print("Sampling output: ", tokenizer.decode(generated[0]))
# > Sampling output: TensorFlow is a great learning platform for learning about
# data structure and structure in data science..

根據目標應用程式，可能需要更長的輸出。您可以使用 max_new_tokens 控制生成輸出的長度，請記住，更長的生成將需要更多的資源。

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), max_new_tokens=5
)
print("Limiting to 5 new tokens:", tokenizer.decode(generated[0]))
# > Limiting to 5 new tokens: TensorFlow is a great learning platform for
generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), max_new_tokens=30
)
print("Limiting to 30 new tokens:", tokenizer.decode(generated[0]))
# > Limiting to 30 new tokens: TensorFlow is a great learning platform for
# learning about data structure and structure in data science................

取樣有一些你可以用來控制隨機性的旋鈕。最重要的是 temperature，它設定了輸出的整體熵——低於 1.0 的值將優先採樣可能性更高的標記，而高於 1.0 的值則相反。將其設定為 0.0 會將行為簡化為貪婪解碼，而非常大的值則近似於均勻取樣。

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), temperature=0.7
)
print("Temperature 0.7: ", tokenizer.decode(generated[0]))
# > Temperature 0.7: TensorFlow is a great way to do things like this........
generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), temperature=1.5
)
print("Temperature 1.5: ", tokenizer.decode(generated[0]))
# > Temperature 1.5: TensorFlow is being developed for both Cython and Bamboo.
# On Bamboo...

與取樣相反，貪婪解碼在生成的每次迭代中總是選擇最可能的標記。然而，它通常會導致次優輸出。您可以透過 num_beams 引數提高結果的質量。當它大於 1 時，它會觸發 Beam Search，它會持續探索高機率序列。這種探索是以額外的資源和計算時間為代價的。

generated = model.generate(**inputs, num_beams=2)
print("Beam Search output:", tokenizer.decode(generated[0]))
# > Beam Search output: TensorFlow is an open-source, open-source,
# distributed-source application framework for the

最後，在執行取樣或波束搜尋時，您可以使用 num_return_sequences 返回多個序列。對於取樣，它等同於從同一個輸入提示執行生成多次，而對於波束搜尋，它以降序返回得分最高的生成波束。

generated = model.generate(**inputs, num_beams=2, num_return_sequences=2)
print(
    "All generated hypotheses:",
    "\n".join(tokenizer.decode(out) for out in generated)
)
# > All generated hypotheses: TensorFlow is an open-source, open-source,
# distributed-source application framework for the
# > TensorFlow is an open-source, open-source, distributed-source application
# framework that allows

如您所見，文字生成的基礎功能易於控制。然而，上面示例中未涵蓋的選項還有很多，建議閱讀文件以瞭解高階用例。遺憾的是，當您使用 TensorFlow 執行 generate 時，您可能會注意到執行時間較長。如果您的目標應用程式期望低延遲或大量輸入提示，使用 TensorFlow 執行文字生成似乎是一項昂貴的任務。😬

別擔心，本部落格文章的其餘部分旨在證明一行程式碼可以帶來顯著的改進。如果您想直接進入操作，colab 中有一個您可以擺弄的互動式示例！

TensorFlow 和 XLA

XLA，即加速線性代數（Accelerated Linear Algebra），是一種最初為加速 TensorFlow 模型而開發的編譯器。如今，它也是 JAX 背後的編譯器，甚至可以與 PyTorch 一起使用。儘管“編譯器”這個詞對某些人來說可能聽起來令人生畏，但 XLA 在 TensorFlow 中使用起來很簡單——它作為 tensorflow 庫的一部分打包，並且可以在任何建立圖的函式中使用 jit_compile 引數觸發。

對於熟悉 TensorFlow 1 🧓 的人來說，TensorFlow 圖的概念自然而然，因為它是唯一的操作模式。首先，您以宣告式方式定義操作以建立圖。然後，您可以透過圖輸入資料並觀察輸出。快速、高效，但除錯起來很痛苦。隨著 TensorFlow 2 的到來，Eager Execution（即時執行）和命令式編碼模型的能力也隨之而來——TensorFlow 團隊在他們的部落格文章中更詳細地解釋了這種差異。

Hugging Face 在編寫 TensorFlow 模型時考慮到了 Eager Execution。透明度是一個核心價值，能夠隨時檢查模型內部對於實現這一目標非常有益。然而，這意味著模型的某些使用方式不能直接從圖模式的效能優勢中受益（例如，當呼叫 model(args) 時）。

幸運的是，TensorFlow 團隊已經為我們這些使用者考慮周全了🥳！將包含 TensorFlow 程式碼的函式用 tf.function 封裝後，當您呼叫被封裝的函式時，它會嘗試將其轉換為圖。如果您正在訓練模型，呼叫 model.compile()（不帶 run_eagerly=True）正是進行了這種封裝，以便您在呼叫 model.fit() 時能夠從圖模式中受益。由於 tf.function 可以用於任何包含 TensorFlow 程式碼的函式，這意味著您可以在超出模型推理的函式中使用它，從而建立一個單一的最佳化圖。

既然您知道如何建立 TensorFlow 圖，使用 XLA 編譯它們就很簡單了——只需將 jit_compile=True 作為引數新增到上述函式中（tf.function 和 tf.keras.Model.compile）。假設一切順利（下面會詳細介紹）並且您正在使用 GPU 或 TPU，您會注意到第一次呼叫會花費一些時間，但隨後的呼叫會快得多。這是一個執行模型推理和對其輸出進行一些後處理的簡單函式示例

# Note: execution times are deeply dependent on hardware -- a 3090 was used here.
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
inputs = tokenizer(["TensorFlow is"], return_tensors="tf")

def most_likely_next_token(inputs):
    model_output = model(inputs)
    return tf.argmax(model_output.logits[:, -1, :], axis=-1)

print("Calling regular function with TensorFlow code...")
most_likely_next_token(inputs)
# > Execution time -- 48.8 ms

只需一行程式碼，您就可以從上面的函式建立一個 XLA 加速函式。

xla_most_likely_next_token = tf.function(most_likely_next_token, jit_compile=True)

print("Calling XLA function... (for the first time -- will be slow)")
xla_most_likely_next_token(inputs)
# > Execution time -- 3951.0 ms
print("Calling XLA function... (for the second time -- will be fast)")
xla_most_likely_next_token(inputs)
# > Execution time -- 1.6 ms

使用 TensorFlow 和 XLA 進行文字生成

與任何最佳化過程一樣，XLA 也不例外——它並非免費的午餐。從文字生成使用者的角度來看，您只需要記住一個技術方面。無需深入探討細節，XLA 以這種方式使用時，會在呼叫 tf.function 時進行即時 (JIT) 編譯，這依賴於多型性。

當您以這種方式編譯函式時，XLA 會跟蹤每個張量的形狀和型別，以及每個非張量函式輸入的資料。函式被編譯成二進位制檔案，每次使用相同的張量形狀和型別（帶有任何張量資料）和相同的非張量引數呼叫時，編譯後的函式都可以重複使用。相反，如果您使用輸入張量中不同的形狀或型別，或者使用不同的非張量引數，那麼將進行新的昂貴的編譯步驟。以下是一個簡單的示例總結

# Note: execution times are deeply dependent on hardware -- a 3090 was used here.
import tensorflow as tf

@tf.function(jit_compile=True)
def max_plus_constant(tensor, scalar):
    return tf.math.reduce_max(tensor) + scalar

# Slow: XLA compilation will kick in, as it is the first call
max_plus_constant(tf.constant([0, 0, 0]), 1)
# > Execution time -- 520.4 ms

# Fast: Not the first call with this tensor shape, tensor type, and exact same
# non-tensor argument
max_plus_constant(tf.constant([1000, 0, -10]), 1)
# > Execution time -- 0.6 ms

# Slow: Different tensor type
max_plus_constant(tf.constant([0, 0, 0], dtype=tf.int64), 1)
# > Execution time -- 27.1 ms

# Slow: Different tensor shape
max_plus_constant(tf.constant([0, 0, 0, 0]), 1)
# > Execution time -- 25.5 ms

# Slow: Different non-tensor argument
max_plus_constant(tf.constant([0, 0, 0]), 2)
# > Execution time -- 24.9 ms

在實踐中，對於文字生成，這意味著輸入應該填充到某個長度的倍數（以便它具有有限數量的可能形狀），並且第一次使用不同的選項會很慢。讓我們看看當你天真地用 XLA 呼叫生成時會發生什麼。

# Note: execution times are deeply dependent on hardware -- a 3090 was used here.
import time
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

# Notice the new argument, `padding_side="left"` -- decoder-only models, which can
# be instantiated with TFAutoModelForCausalLM, should be left-padded, as they
# continue generating from the input prompt.
tokenizer = AutoTokenizer.from_pretrained(
    "gpt2", padding_side="left", pad_token="</s>"
)
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id
input_1 = ["TensorFlow is"]
input_2 = ["TensorFlow is a"]

# One line to create a XLA generation function
xla_generate = tf.function(model.generate, jit_compile=True)

# Calls XLA generation without padding
tokenized_input_1 = tokenizer(input_1, return_tensors="tf")  # length = 4
tokenized_input_2 = tokenizer(input_2, return_tensors="tf")  # length = 5
print(f"`tokenized_input_1` shape = {tokenized_input_1.input_ids.shape}")
print(f"`tokenized_input_2` shape = {tokenized_input_2.input_ids.shape}")

print("Calling XLA generation with tokenized_input_1...")
print("(will be slow as it is the first call)")
start = time.time_ns()
xla_generate(**tokenized_input_1)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 9565.1 ms

print("Calling XLA generation with tokenized_input_2...")
print("(has a different length = will trigger tracing again)")
start = time.time_ns()
xla_generate(**tokenized_input_2)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 6815.0 ms

哦不，這太慢了！正如前面提到的，保持不同形狀組合受控的一個解決方案是透過填充。分詞器類有一個 pad_to_multiple_of 引數，可以用來在接受任意輸入長度和限制跟蹤之間取得平衡。

padding_kwargs = {"pad_to_multiple_of": 8, "padding": True}
tokenized_input_1_with_padding = tokenizer(
    input_1, return_tensors="tf", **padding_kwargs
)  # length = 8
tokenized_input_2_with_padding = tokenizer(
    input_2, return_tensors="tf", **padding_kwargs
)  # length = 8
print(
    "`tokenized_input_1_with_padding` shape = ",
    f"{tokenized_input_1_with_padding.input_ids.shape}"
)
print(
    "`tokenized_input_2_with_padding` shape = ",
    f"{tokenized_input_2_with_padding.input_ids.shape}"
)

print("Calling XLA generation with tokenized_input_1_with_padding...")
print("(slow, first time running with this length)")
start = time.time_ns()
xla_generate(**tokenized_input_1_with_padding)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 6815.4 ms

print("Calling XLA generation with tokenized_input_2_with_padding...")
print("(will be fast!)")
start = time.time_ns()
xla_generate(**tokenized_input_2_with_padding)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 19.3 ms

好多了，這樣執行的連續生成呼叫將比以前快幾個數量級。請記住，隨時嘗試新的生成選項都會觸發跟蹤。

print("Calling XLA generation with the same input, but with new options...")
print("(slow again)")
start = time.time_ns()
xla_generate(**tokenized_input_1_with_padding, num_beams=2)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 9644.2 ms

從開發人員的角度來看，依賴 XLA 意味著需要注意一些額外的細微差別。當資料結構的大小提前已知時，例如在模型訓練中，XLA 會大放異彩。另一方面，當它們的維度無法確定或使用某些動態切片時，XLA 將無法編譯。文字生成的現代實現是自迴歸的，其自然行為是擴充套件張量並在進行過程中突然中斷某些操作——換句話說，預設情況下不適合 XLA。我們已經重寫了整個 TensorFlow 文字生成程式碼庫，以向量化操作並使用帶有填充的固定大小結構。我們的 NLP 模型也進行了修改，以便在存在填充結構的情況下正確使用其位置嵌入。對於 TensorFlow 文字生成使用者來說，結果應該是不可見的，除了 XLA 編譯的可用性。

基準測試和結論

上面您看到了如何將 TensorFlow 函式轉換為圖並使用 XLA 編譯對其進行加速。當前形式的文字生成僅僅是一種自迴歸函式，它在模型前向傳遞和一些後處理之間交替，每次迭代生成一個標記。透過 XLA 編譯，整個過程得到最佳化，從而實現更快的執行。但是快多少呢？下面的 Gradio 演示包含了一些基準測試，比較了 Hugging Face 在兩種主要機器學習框架（TensorFlow 和 PyTorch）上在多個 GPU 模型上的文字生成。

如果您探究這些結果，很快就會得出兩個結論

正如本篇部落格文章所鋪墊的，當使用 XLA 時，TensorFlow 文字生成要快得多。在某些情況下，我們談論的是超過 100 倍的加速，這真正展示了編譯圖的強大能力 🚀
在絕大多數情況下，使用 XLA 的 TensorFlow 文字生成是速度最快的選擇，有些情況下甚至快達 9 倍，這駁斥了 PyTorch 是嚴肅 NLP 任務首選框架的迷思 💪

試試這個 colab，享受由 XLA 強化的文字生成功能吧！

更多部落格文章

使用 Sentence Transformers v5 訓練和微調稀疏嵌入模型

作者： 2025 年 7 月 1 日 • 106

使用 Sentence Transformers v4 訓練和微調 Reranker 模型

作者： 2025 年 3 月 26 日 • 155

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入發表評論

贊