使用資料集構建嵌入管道

本教程將引導您部署嵌入端點並構建一個 Python 指令碼，以高效處理帶有嵌入的資料集。我們將使用強大的 Qwen/Qwen3-Embedding-4B 模型為您的資料建立高質量的嵌入。

本教程側重於建立可用於生產的指令碼，該指令碼可以使用 **文字嵌入推理 (TEI)** 引擎處理任何資料集並新增嵌入，以最佳化效能。

建立您的嵌入端點

首先，我們需要建立一個針對嵌入最佳化的推理端點。

首先導航到 Inference Endpoints UI，登入後您應該會看到一個用於建立新推理端點的按鈕。單擊“New”按鈕。

從那裡，您將被定向到目錄。模型目錄包含流行的模型，這些模型具有經過調優的配置，可以一鍵部署。您可以搜尋嵌入模型或建立自定義端點。

catalog

在本教程中，我們將使用推理端點模型目錄中提供的 Qwen3-Embedding-4B 模型。請注意，如果它不在目錄中，您可以透過輸入模型倉庫 ID `Qwen/Qwen3-Embedding-4B`，將其作為自定義端點從 Hugging Face Hub 部署。

對於嵌入模型，我們建議使用：

GPU：NVIDIA、T4、L4 或 A10G，以獲得良好效能。
例項大小：x1（足以滿足大多數嵌入工作負載）
自動擴縮：啟用縮放到零，以便在不使用端點時將其切換到暫停狀態，從而節省成本。
超時：設定 10 分鐘的超時時間以避免長時間執行的請求。您應該根據預期如何使用端點來定義超時。

如果您正在尋找計算要求較低的模型，可以使用 sentence-transformers/all-MiniLM-L6-v2 模型。

Qwen3-Embedding-4B 模型將自動使用 **文字嵌入推理 (TEI)** 引擎，該引擎提供最佳化的推理和自動批處理。

點選“建立端點”以部署您的嵌入服務。

config

這可能需要大約 5 分鐘來初始化。

測試您的端點

推理端點執行後，您可以在遊樂場中直接測試它。它接受文字輸入並返回高維向量。

playground

嘗試輸入一些示例文字，例如“機器學習正在改變我們處理資料的方式”，並檢視嵌入輸出。

獲取您的端點詳細資訊

要以程式設計方式使用您的端點，您需要從端點的概述中獲取這些詳細資訊：

基本 URL：`https://<endpoint-name>.endpoints.huggingface.cloud/v1/`
模型名稱：您的端點名稱
令牌：您從設定中獲取的 HF 令牌

endpoint-details

構建嵌入指令碼

現在，讓我們逐步構建一個處理帶有嵌入的資料集的指令碼。我們將它分解成邏輯塊。

步驟 1：設定依賴項和匯入

我們將使用 OpenAI 客戶端連線到端點，並使用 `datasets` 庫載入和處理資料集。因此，我們首先安裝所需的包：

pip install datasets openai

然後，在一個新的 Python 檔案中設定您的匯入：

import os
from datasets import load_dataset
from openai import OpenAI

步驟 2：配置連線

根據您上一步收集的詳細資訊，設定連線到推理端點的配置。

# Configuration
ENDPOINT_URL = "https://your-endpoint-name.endpoints.huggingface.cloud/v1/" # Endpoint URL + version
HF_TOKEN = os.getenv("HF_TOKEN") # Your Hugging Face Hub token from hf.co/settings/tokens

# Initialize OpenAI client for your endpoint
client = OpenAI(
    base_url=ENDPOINT_URL,
    api_key=HF_TOKEN,
)

您的 OpenAI 客戶端現在已配置為連線到您的推理端點。如需進一步閱讀，您可以查閱文字嵌入客戶端文件此處。

步驟 3：建立嵌入函式

接下來，我們將建立一個函式來處理批次的文字並返回嵌入。

def get_embeddings(examples):
    """Get embeddings for a batch of texts."""
    response = client.embeddings.create(
        model="your-endpoint-name",  # Replace with your actual endpoint name
        input=examples["context"], # In the squad dataset, the text is in the "context" column
    )
    
    # Extract embeddings from response objects
    embeddings = [sample.embedding for sample in response.data]
    
    return {"embeddings": embeddings} # datasets expects a dictionary with a key "embeddings" and a value of a list of embeddings

`datasets` 庫將向我們的函式傳遞一批來自資料集的示例，作為批次值的字典。鍵將是我們要嵌入的列的名稱，值將是該列的值列表。

步驟 4：載入和處理您的資料集

載入您的資料集並應用嵌入函式

# Load a sample dataset (you can replace this with your own)
dataset = load_dataset("squad", split="train[:100]")  # Using first 100 examples for demo

# Process the dataset with embeddings
dataset_with_embeddings = dataset.map(
    get_embeddings,
    batched=True,
    batch_size=10,  # Process in small batches to avoid timeouts
    desc="Adding embeddings",
)

資料集庫的 `map` 函式經過最佳化，可自動為我們批處理行。推理端點還可以根據批次大小的需求進行擴充套件，因此為了獲得最佳效能，您應該根據推理端點的配置校準批次大小。

例如，為您的模型選擇儘可能高的批處理大小，並將批處理大小與推理端點在 `max_concurrent_requests` 中的配置同步。

步驟 5：儲存並分享您的結果

最後，讓我們將嵌入式資料集本地儲存或推送到 Hugging Face Hub

# Save the processed dataset locally
dataset_with_embeddings.save_to_disk("./embedded_dataset")

# Or push directly to Hugging Face Hub
dataset_with_embeddings.push_to_hub("your-username/squad-embeddings")

後續步驟

幹得好！您現在已經構建了一個可以處理任何資料集的嵌入管道。以下是完整的指令碼：

點選檢視完整指令碼

import os
from datasets import load_dataset
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

# Configuration
ENDPOINT_URL = "https://your-endpoint-name.endpoints.huggingface.cloud/v1/"
HF_TOKEN = os.getenv("HF_TOKEN")

# Initialize OpenAI client for your endpoint
client = OpenAI(
    base_url=ENDPOINT_URL,
    api_key=HF_TOKEN,
)

def get_embeddings(examples):
    """Get embeddings for a batch of texts."""
    response = client.embeddings.create(
        model="your-endpoint-name",  # Replace with your actual endpoint name
        input=examples["context"],
    )
    
    # Extract embeddings from response
    embeddings = [sample.embedding for sample in response.data]
    
    return {"embeddings": embeddings}

# Load a sample dataset (you can replace this with your own)
print("Loading dataset...")
dataset = load_dataset("squad", split="train[:1000]")  # Using first 1000 examples for demo

# Process the dataset with embeddings
print("Processing dataset with embeddings...")
dataset_with_embeddings = dataset.map(
    get_embeddings,
    batched=True,
    batch_size=10,  # Process in small batches to avoid timeouts
    desc="Adding embeddings",
)

# Save the processed dataset locally
print("Saving processed dataset...")
dataset_with_embeddings.save_to_disk("./embedded_dataset")

# Or push directly to Hugging Face Hub
print("Pushing to Hugging Face Hub...")
dataset_with_embeddings.push_to_hub("your-username/squad-embeddings")

print("Dataset processing complete!")

以下是一些擴充套件指令碼的方法：

處理多個數據集：修改指令碼以處理不同的資料集源
新增錯誤處理：為失敗的 API 呼叫實現重試邏輯
最佳化批次大小：試驗不同的批次大小以獲得更好的效能
新增驗證：檢查嵌入質量和維度
自定義預處理：新增文字清洗或規範化步驟
構建語義搜尋應用程式：使用嵌入構建語義搜尋應用程式。

您的嵌入資料集現在已準備好用於語義搜尋、推薦系統或 RAG 應用程式等下游任務！

< > 在 GitHub 上更新

推理端點（專用）