開源 AI 食譜文件

擴充套件測試時計算以實現 LLM 更長時間的思考

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

擴充套件測試時計算以實現 LLM 更長時間的思考

作者: Sergio Paniego

🚨 警告：此筆記本是資源密集型的，需要大量計算能力。如果您在 Colab 中執行，它將使用 A100 GPU。

在本指南中，我們將引導您使用測試時計算來延長指令式 LLM 系統的推理時間，以解決更具挑戰性的問題，例如複雜的數學問題。這種方法的靈感來源於 OpenAI o1-o3 模型，它表明推理過程中更長的推理時間可以提升模型效能。

這項技術建立在這篇部落格文章中分享的實驗基礎上，這些實驗表明，像 1B 和 3B Llama 指令模型這樣的小模型，在給予足夠的“思考時間”後，可以在 MATH-500 基準測試中超越大得多的模型。來自 DeepMind 的最新研究表明，透過迭代自我修正或使用獎勵模型等策略，可以最佳化測試時計算。

該部落格文章介紹了一個用於執行這些實驗的新倉庫。在本指南中，我們將專注於構建一個小型聊天機器人，它將透過更長時間的推理來使用小型開源模型解決更難的問題。

1. 安裝依賴項

讓我們首先安裝 search-and-learn 倉庫！🚀
這個倉庫是為復現實踐結果而設計的，不是一個 Python pip 包。但是，我們仍然可以用它來生成我們的系統。為此，我們需要按照以下步驟從原始碼安裝：

!git clone https://github.com/huggingface/search-and-learn

%cd search-and-learn
!pip install -e '.[dev]'

登入 Hugging Face 以訪問 meta-llama/Llama-3.2-1B-Instruct，因為它是一個受限模型！🗝️
如果您之前沒有請求過訪問許可權，您需要先提交請求才能繼續。

from huggingface_hub import notebook_login

notebook_login()

2. 設定大語言模型 (LLM) 和過程獎勵模型 (PRM) 💬

如圖所示，該系統由一個 LLM（根據使用者輸入生成中間答案）、一個PRM 模型（評估這些答案並打分），以及一個搜尋策略（利用 PRM 反饋指導搜尋過程的後續步驟，直到得出最終答案）組成。

讓我們首先初始化每個模型。對於 LLM，我們將使用 meta-llama/Llama-3.2-1B-Instruct 模型，對於 PRM，我們將使用 RLHFlow/Llama3.1-8B-PRM-Deepseek-Data 模型。

system

import torch
from vllm import LLM
from sal.models.reward_models import RLHFFlow

model_path = "meta-llama/Llama-3.2-1B-Instruct"
prm_path = "RLHFlow/Llama3.1-8B-PRM-Deepseek-Data"

llm = LLM(
    model=model_path,
    gpu_memory_utilization=0.5,  # Utilize 50% of GPU memory
    enable_prefix_caching=True,  # Optimize repeated prefix computations
    seed=42,  # Set seed for reproducibility
)

prm = RLHFFlow(prm_path)

2.1 例項化問題、搜尋策略並呼叫流水線

既然我們已經設定了 LLM 和 PRM，接下來我們將定義問題，選擇一個搜尋策略來檢索相關資訊，並呼叫流水線來處理問題。

例項化問題：在這一步中，我們定義系統將要回答的輸入問題，並考慮給定的上下文。
搜尋策略：系統目前支援以下搜尋策略：best_of_n、beam_search 和 dvts（見圖）。在本例中，我們將使用 best_of_n，但你可以根據需要輕鬆切換到其他任何策略。我們需要為搜尋策略的配置定義一些引數。你可以在這裡檢視完整列表。
呼叫流水線：有了問題和搜尋策略，我們將呼叫推理流水線，透過 LLM 和 PRM 處理輸入，以生成最終答案。

第一步是明確定義系統將要回答的問題。這確保了我們有一個精確的任務讓模型去解決。

question_text = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"
input_batch = {"problem": [question_text]}

接下來，我們定義配置，包括候選答案數量 (N) 等引數，並選擇將要使用的搜尋策略。搜尋策略決定了我們如何探索潛在的答案。在本例中，我們將使用 best_of_n。

有了問題和配置，我們使用選定的搜尋策略生成多個候選答案。這些候選答案根據其相關性和質量進行評估，並返回最終答案。

from sal.config import Config
from sal.search import beam_search, best_of_n, dvts

config = Config()
config.n = 32  # Number of answers to generate during the search

search_result = best_of_n(x=input_batch, config=config, llm=llm, prm=prm)

2.2 顯示最終結果

一旦流水線透過 LLM 和 PRM 處理了問題，我們就可以顯示最終結果。這個結果是模型在考慮了中間答案並使用 PRM 對它們進行評分後得出的輸出。

以下是顯示最終答案的方法：

search_result["pred"][0]

模型的輸出可能包含特殊標記，例如 <|start_header_id|> 或 <|end_header_id|>。為了使答案更易讀，我們可以在顯示給終端使用者之前安全地移除它們。

formatted_output = search_result["pred"][0].replace("<|start_header_id|>assistant<|end_header_id|>\n\n", "").strip()
formatted_output

移除任何特殊標記後，我們可以向用戶顯示最終答案。由於答案是基於 markdown 的，因此可以作為 markdown 正確地渲染顯示。

from IPython.display import display, Markdown

display(Markdown(formatted_output))

3. 全部組裝起來！🧑‍🏭️

現在，讓我們建立一個封裝整個流水線的方法。這將使我們能夠在未來的應用程式中輕鬆重用該過程，使其高效且模組化。

透過結合 LLM、PRM、搜尋策略和結果顯示，我們可以簡化工作流程，並確保其可用於其他任務或問題。

我們簡化了工作流程，確保它可以重複用於不同的任務或問題。此外，我們將跟蹤每個方法所花費的時間，以便我們能夠理解使用每種策略和配置的實際影響。

以下是我們構建該方法的方式：

import time


def generate_with_search_and_learn(question, config, llm, prm, method="best_of_n"):
    """
    Generate an answer for a given question using the search-and-learn pipeline.

    Args:
    - question (str): The input question to generate an answer for.
    - config (Config): Configuration object containing parameters for search strategy.
    - llm (LLM): Pretrained large language model used for generating answers.
    - prm (RLHFFlow): Process reward model used for evaluating answers.
    - method (str): Search strategy to use. Options are 'best_of_n', 'beam_search', 'dvts'. Default is 'best_of_n'.

    Returns:
    - str: The formatted output after processing the question.
    """
    batch = {"problem": [question]}

    start_time = time.time()
    if method == "best_of_n":
        result = best_of_n(x=batch, config=config, llm=llm, prm=prm)
    elif method == "beam_search":
        result = beam_search(examples=batch, config=config, llm=llm, prm=prm)
    elif method == "dvts":
        result = dvts(examples=batch, config=config, llm=llm, prm=prm)

    elapsed_time = time.time() - start_time
    print(f"\nFinished in {elapsed_time:.2f} seconds\n")

    tokenizer = llm.get_tokenizer()
    total_tokens = 0
    for completion in result["completions"]:
        for comp in completion:
            output_tokens = tokenizer.encode(comp)
            total_tokens += len(output_tokens)

    print(f"Total tokens in all completions: {total_tokens}")

    formatted_output = result["pred"][0].replace("<|start_header_id|>assistant<|end_header_id|>\n\n", "").strip()
    return formatted_output

⏳ 3.1 比較每種策略的思考時間

讓我們比較三種方法的思考時間：best_of_n、beam_search 和 dvts。每種方法都使用相同數量的答案進行評估，測量以秒為單位的思考時間和生成的令牌數。

在下面的結果中，best_of_n 方法的思考時間最短，而 dvts 方法耗時最長。然而，由於其搜尋策略更簡單，best_of_n 生成了更多的令牌。

方法	搜尋過程中的答案數量	思考時間 (秒)	生成的令牌數
best_of_n	8	3.54	3087
beam_search	8	10.06	2049
dvts	8	8.46	2544

這種比較說明了不同策略之間的權衡，平衡了思考時間和搜尋過程的複雜性。

1. Best of n (最優 n 選)

我們首先使用 best_of_n 策略。以下是如何跟蹤此方法的思考時間：

>>> question = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"

>>> config.n = 8

>>> formatted_output = generate_with_search_and_learn(
...     question=question, config=config, llm=llm, prm=prm, method="best_of_n"
... )

Finished in 3.54 seconds

Total tokens in all completions: 3087

display(Markdown(formatted_output))

2. Beam Search (束搜尋)

現在，讓我們嘗試使用 beam_search 策略。

>>> config.n = 8
>>> # beam search specific
>>> config.sort_completed = True
>>> config.filter_duplicates = True

>>> formatted_output = generate_with_search_and_learn(
...     question=question, config=config, llm=llm, prm=prm, method="beam_search"
... )

Finished in 10.06 seconds

Total tokens in all completions: 2049

display(Markdown(formatted_output))

3. Diverse Verifier Tree Search (DVTS) (多樣性驗證樹搜尋)

最後，讓我們嘗試 dvts 策略。

>>> config.n = 8
>>> # dvts specific
>>> config.n_beams = config.n // config.beam_width

>>> formatted_output = generate_with_search_and_learn(
...     question=question, config=config, llm=llm, prm=prm, method="dvts"
... )

Finished in 8.46 seconds

Total tokens in all completions: 2544

display(Markdown(formatted_output))

🙋 3.2 使用簡單問題測試系統

在最後一個例子中，我們將使用一個簡單的問題來測試系統，觀察它在簡單情況下的表現。這可以幫助我們驗證系統即使對於基本查詢也能按預期工作。

讓我們嘗試以下問題：

>>> question = "What's the capital of Spain?"

>>> config.n = 32

>>> formatted_output = generate_with_search_and_learn(
...     question=question, config=config, llm=llm, prm=prm, method="best_of_n"
... )

Finished in 1.03 seconds

Total tokens in all completions: 544

display(Markdown(formatted_output))

即使我們設定了更多的候選答案 (N)，思考時間仍然相對較短 (1.03 秒，生成 544 個令牌)。這表明系統能夠高效地處理較簡單的問題，花費較少的時間，同時利用其增強的能力來解決更復雜的問題。

🏆 我們現在有了一個功能齊全的流水線，它利用了測試時計算，使系統能夠為更復雜的查詢“思考更長時間”，同時也為簡單問題保持快速的響應時間。

這種方法確保系統可以根據任務的複雜性調整其思考時間，為簡單和具有挑戰性的問題提供高效且響應迅速的解決方案。

4. 繼續探索與資源 🧑‍🎓️

如果你渴望繼續探索，請務必檢視原始的實驗性部落格以及其中提到的所有參考文獻。這些資源將加深你對測試時計算、其優勢及其在 LLM 中應用的理解。

祝您學習和實驗愉快！🚀

< > 在 GitHub 上更新

←HF Spaces 上的 Phoenix 可觀測性儀表板在 TRL 中使用 GRPO 對 LLM 進行推理後訓練→