在 Hugging Face Transformers 中更快的 TensorFlow 模型
在過去的幾個月中,Hugging Face 團隊一直致力於改進 Transformers 的 TensorFlow 模型,使其更加穩健和快速。最近的改進主要集中在兩個方面:
- 計算效能:BERT、RoBERTa、ELECTRA 和 MPNet 得到了改進,以大幅縮短計算時間。這種計算效能的提升在所有計算方面都很顯著:graph/eager 模式、TF Serving 以及 CPU/GPU/TPU 裝置。
- TensorFlow Serving:每個 TensorFlow 模型都可以透過 TensorFlow Serving 進行部署,從而在推理時受益於這種計算效能的提升。
計算效能
為了展示計算效能的提升,我們進行了一項全面的基準測試,將 v4.2.0 版本中 BERT 在 TensorFlow Serving 上的效能與 Google 的官方實現進行了比較。該基準測試在 GPU V100 上執行,序列長度為 128(時間單位為毫秒)。
批次大小 | Google 實現 | v4.2.0 實現 | Google/v4.2.0 實現的相對差異 |
---|---|---|---|
1 | 6.7 | 6.26 | 6.79% |
2 | 9.4 | 8.68 | 7.96% |
4 | 14.4 | 13.1 | 9.45% |
8 | 24 | 21.5 | 10.99% |
16 | 46.6 | 42.3 | 9.67% |
32 | 83.9 | 80.4 | 4.26% |
64 | 171.5 | 156 | 9.47% |
128 | 338.5 | 309 | 9.11% |
目前 v4.2.0 中的 Bert 實現比 Google 的實現快了約 10%。除此之外,它也比 4.1.1 版本中的實現快了兩倍。
TensorFlow Serving
上一節展示了全新的 Bert 模型在最新版本的 Transformers 中計算效能得到了顯著提升。在本節中,我們將逐步向您展示如何使用 TensorFlow Serving 部署 Bert 模型,以便在生產環境中受益於計算效能的提升。
什麼是 TensorFlow Serving?
TensorFlow Serving 屬於 TensorFlow Extended (TFX) 提供的工具集,它使將模型部署到伺服器的任務變得前所未有的簡單。TensorFlow Serving 提供了兩個 API,一個可以透過 HTTP 請求呼叫,另一個則使用 gRPC 在伺服器上執行推理。
什麼是 SavedModel?
SavedModel 包含一個獨立的 TensorFlow 模型,包括其權重和架構。它不需要模型的原始原始碼即可執行,這使其非常適合與任何支援讀取 SavedModel 的後端(如 Java、Go、C++ 或 JavaScript 等)共享或部署。SavedModel 的內部結構表示如下:
savedmodel
/assets
-> here the needed assets by the model (if any)
/variables
-> here the model checkpoints that contains the weights
saved_model.pb -> protobuf file representing the model graph
如何安裝 TensorFlow Serving?
有三種方法可以安裝和使用 TensorFlow Serving:
- 透過 Docker 容器,
- 透過 apt 包,
- 或者使用 pip。
為了簡化操作並與所有現有作業系統相容,我們將在本教程中使用 Docker。
如何建立 SavedModel?
SavedModel 是 TensorFlow Serving 期望的格式。自 Transformers v4.2.0 起,建立 SavedModel 具有三個附加功能:
- 序列長度可以在不同執行之間自由修改。
- 所有模型輸入都可用於推理。
- 現在,當使用
output_hidden_states=True
或output_attentions=True
返回時,隱藏狀態 (hidden states)
或注意力 (attention)
會被分組到單個輸出中。
下面,你可以找到儲存為 TensorFlow SavedModel 的 TFBertForSequenceClassification
的輸入和輸出表示:
The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_attention_mask:0
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_input_ids:0
inputs['token_type_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['attentions'] tensor_info:
dtype: DT_FLOAT
shape: (12, -1, 12, -1, -1)
name: StatefulPartitionedCall:0
outputs['logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 2)
name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict
要直接傳遞 inputs_embeds
(詞元嵌入)而不是 input_ids
(詞元 ID)作為輸入,我們需要對模型進行子類化,以獲得新的服務簽名。以下程式碼片段展示瞭如何做到這一點:
from transformers import TFBertForSequenceClassification
import tensorflow as tf
# Creation of a subclass in order to define a new serving signature
class MyOwnModel(TFBertForSequenceClassification):
# Decorate the serving method with the new input_signature
# an input_signature represents the name, the data type and the shape of an expected input
@tf.function(input_signature=[{
"inputs_embeds": tf.TensorSpec((None, None, 768), tf.float32, name="inputs_embeds"),
"attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
"token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
}])
def serving(self, inputs):
# call the model to process the inputs
output = self.call(inputs)
# return the formated output
return self.serving_output(output)
# Instantiate the model with the new serving method
model = MyOwnModel.from_pretrained("bert-base-cased")
# save it with saved_model=True in order to have a SavedModel version along with the h5 weights.
model.save_pretrained("my_model", saved_model=True)
serving 方法必須透過 tf.function
裝飾器的新的 input_signature
引數進行重寫。請參閱官方文件瞭解更多關於 input_signature
引數的資訊。serving
方法用於定義 SavedModel 在使用 TensorFlow Serving 部署時的行為。現在 SavedModel 看起來如預期,請參見新的 inputs_embeds
輸入:
The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_attention_mask:0
inputs['inputs_embeds'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, 768)
name: serving_default_inputs_embeds:0
inputs['token_type_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['attentions'] tensor_info:
dtype: DT_FLOAT
shape: (12, -1, 12, -1, -1)
name: StatefulPartitionedCall:0
outputs['logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 2)
name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict
如何部署和使用 SavedModel?
讓我們一步步看看如何部署和使用一個 BERT 模型進行情感分類。
第 1 步
建立 SavedModel。要建立 SavedModel,Transformers 庫允許你載入一個名為 nateraw/bert-base-uncased-imdb
的 PyTorch 模型,該模型在 IMDB 資料集上訓練過,併為你將其轉換為 TensorFlow Keras 模型。
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained("nateraw/bert-base-uncased-imdb", from_pt=True)
# the saved_model parameter is a flag to create a SavedModel version of the model in same time than the h5 weights
model.save_pretrained("my_model", saved_model=True)
第 2 步
建立並執行一個包含 SavedModel 的 Docker 容器。首先,拉取 CPU 版本的 TensorFlow Serving Docker 映象(對於 GPU,將 serving 替換為 serving:latest-gpu):
docker pull tensorflow/serving
接下來,以守護程序方式執行一個名為 serving_base 的服務映象:
docker run -d --name serving_base tensorflow/serving
將新建立的 SavedModel 複製到 serving_base 容器的 models 資料夾中:
docker cp my_model/saved_model serving_base:/models/bert
提交服務模型的容器,將 MODEL_NAME 更改為與模型名稱匹配(此處為 bert
),該名稱 (bert
) 對應於我們想給 SavedModel 的名稱:
docker commit --change "ENV MODEL_NAME bert" serving_base my_bert_model
然後殺死以守護程序執行的 serving_base 映象,因為我們不再需要它:
docker kill serving_base
最後,將我們的 SavedModel 映象作為守護程序執行,並將容器中的埠 8501 (REST API) 和 8500 (gRPC API) 對映到主機,並將容器命名為 bert
。
docker run -d -p 8501:8501 -p 8500:8500 --name bert my_bert_model
第 3 步
透過 REST API 查詢模型:
from transformers import BertTokenizerFast, BertConfig
import requests
import json
import numpy as np
sentence = "I love the new TensorFlow update in transformers."
# Load the corresponding tokenizer of our SavedModel
tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
# Load the model config of our SavedModel
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")
# Tokenize the sentence
batch = tokenizer(sentence)
# Convert the batch into a proper dict
batch = dict(batch)
# Put the example into a list of size 1, that corresponds to the batch size
batch = [batch]
# The REST API needs a JSON that contains the key instances to declare the examples to process
input_data = {"instances": batch}
# Query the REST API, the path corresponds to http://host:port/model_version/models_root_folder/model_name:method
r = requests.post("https://:8501/v1/models/bert:predict", data=json.dumps(input_data))
# Parse the JSON result. The results are contained in a list with a root key called "predictions"
# and as there is only one example, takes the first element of the list
result = json.loads(r.text)["predictions"][0]
# The returned results are probabilities, that can be positive or negative hence we take their absolute value
abs_scores = np.abs(result)
# Take the argmax that correspond to the index of the max probability.
label_id = np.argmax(abs_scores)
# Print the proper LABEL with its index
print(config.id2label[label_id])
這應該返回 POSITIVE。也可以透過 gRPC (Google Remote Procedure Call) API 來獲得相同的結果:
from transformers import BertTokenizerFast, BertConfig
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc
sentence = "I love the new TensorFlow update in transformers."
tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")
# Tokenize the sentence but this time with TensorFlow tensors as output already batch sized to 1. Ex:
# {
# 'input_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[ 101, 19082, 102]])>,
# 'token_type_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[0, 0, 0]])>,
# 'attention_mask': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[1, 1, 1]])>
# }
batch = tokenizer(sentence, return_tensors="tf")
# Create a channel that will be connected to the gRPC port of the container
channel = grpc.insecure_channel("localhost:8500")
# Create a stub made for prediction. This stub will be used to send the gRPC request to the TF Server.
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Create a gRPC request made for prediction
request = predict_pb2.PredictRequest()
# Set the name of the model, for this use case it is bert
request.model_spec.name = "bert"
# Set which signature is used to format the gRPC query, here the default one
request.model_spec.signature_name = "serving_default"
# Set the input_ids input from the input_ids given by the tokenizer
# tf.make_tensor_proto turns a TensorFlow tensor into a Protobuf tensor
request.inputs["input_ids"].CopyFrom(tf.make_tensor_proto(batch["input_ids"]))
# Same with attention mask
request.inputs["attention_mask"].CopyFrom(tf.make_tensor_proto(batch["attention_mask"]))
# Same with token type ids
request.inputs["token_type_ids"].CopyFrom(tf.make_tensor_proto(batch["token_type_ids"]))
# Send the gRPC request to the TF Server
result = stub.Predict(request)
# The output is a protobuf where the only one output is a list of probabilities
# assigned to the key logits. As the probabilities as in float, the list is
# converted into a numpy array of floats with .float_val
output = result.outputs["logits"].float_val
# Print the proper LABEL with its index
print(config.id2label[np.argmax(np.abs(output))])
結論
得益於 transformers 中 TensorFlow 模型的最新更新,現在可以輕鬆地使用 TensorFlow Serving 在生產中部署模型。我們正在考慮的下一步驟之一是直接將預處理部分整合到 SavedModel 中,以使事情變得更加簡單。