XLA

加速線性代數（XLA）是一種線性代數編譯器，可最佳化不同硬體和框架上的模型執行時。

本指南將重點介紹如何使用 XLA 加速 TensorFlow 模型。

TensorFlow

XLA 可以在不更改任何原始碼的情況下加速 TensorFlow 模型。它已經與 TensorFlow 庫一起打包，並透過 tf.function 等任何圖形建立函式中的 jit_compile 觸發。

如果您正在使用 Keras 方法，例如 fit 和 predict，可以透過將 jit_compile=True 傳遞給 compile 來啟用 XLA。

model.compile(jit_compile=True)

XLA 可用於加速任意 tf.function。

具有 TensorFlow 實現的模型，例如 GPT2、T5、OPT 和 Whisper，都與 XLA 相容。加速取決於模型，但通常來說，Transformers 中的 TensorFlow 模型可以獲得大約 100 倍的加速。

函式

TensorFlow 模型中的典型前向傳播如下所示。要使用 XLA 執行前向傳播，請使用 tf.function 包裝模型並將 jit_compile=True 設定為 true。

import tensorflow as tf

model = tf.keras.Sequential(
    [tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"), tf.keras.layers.Dense(5, activation="softmax")]
)
# Generate random inputs for the model.
batch_size = 16
input_vector_dim = 10
random_inputs = tf.random.normal((batch_size, input_vector_dim))

# Run a forward pass.
- _ = model(random_inputs)
+ xla_fn = tf.function(model, jit_compile=True)
+ _ = xla_fn(random_inputs)

模型的預設 call 函式用於編譯 XLA 圖。但是，如果您想用 XLA 編譯任何其他模型函式，請使用 tf.function 包裝它們。

my_xla_fn = tf.function(model.my_xla_fn, jit_compile=True)

文字生成

您還可以使用 XLA 編譯其他模型函式。例如，透過使用 tf.function 包裝 generate() 來為文字生成啟用 XLA。

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM
# Will error if the minimal version of Transformers is not installed.
from transformers.utils import check_min_version

check_min_version("4.21.0")

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]

xla_generate = tf.function(model.generate, jit_compile=True)

tokenized_input = tokenizer(input_string, return_tensors="tf")
generated_tokens = xla_generate(**tokenized_input, num_beams=2)

decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")
"Generated -- TensorFlow is an open-source, open-source, distributed-source application framework for the"

追蹤

首次執行啟用 XLA 的函式時，它會嘗試在一個稱為*追蹤*的過程中推斷計算圖。這是一個耗時的步驟，但對該函式的任何後續呼叫都會快得多，因為它不必再次追蹤計算圖。

為了確保函式只被追蹤一次，輸入必須與構建圖時的形狀相同。這對於影像等固定輸入形狀通常不是問題，但對於文字等可變形狀輸入可能是一個問題。

解決這個問題的一種方法是填充您的文字，使其始終具有相同的形狀。在 tokenizer 中配置填充選項，例如 pad_to_multiple_of。

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]

xla_generate = tf.function(model.generate, jit_compile=True)

# Call tokenizer with padding options.
tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf")

generated_tokens = xla_generate(**tokenized_input, num_beams=2)
decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")

除了輸入形狀，任何時候生成選項的任何更改也會觸發追蹤。

資源

透過以下資源瞭解有關 XLA 的更多資訊。

一個演示與 XLA 相容的編碼器-解碼器和僅解碼器文字生成模型的筆記本。
使用 TensorFlow 和 XLA 加速文字生成部落格文章比較了與 XLA 相容的模型的基準，並友好的介紹了 TensorFlow 中的 XLA。
Hugging Face 如何透過 XLA 改進文字生成效能部落格文章討論了在 Transformers 中向 TensorFlow 模型新增 XLA 的設計理念。
圖和 tf.function 簡介指南。
使用 tf.function 獲得更好的效能指南。
XLA 文件。

< > 在 GitHub 上更新