開源 AI 食譜文件

在 Hugging Face 文件上使用 LangChain 進行高階 RAG

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

在 Hugging Face 文件上使用 LangChain 進行高階 RAG

作者：Aymeric Roucher

本筆記本演示瞭如何使用 LangChain 構建一個高階 RAG (Retrieval Augmented Generation，檢索增強生成)，以回答使用者關於特定知識庫（此處為 Hugging Face 文件）的問題。

關於 RAG 的入門介紹，您可以檢視這篇其他的指南！

RAG 系統很複雜，包含許多活動部件：這是一個 RAG 的圖示，我們用藍色標註了所有可以增強系統的可能性。

💡 如您所見，這個架構中有許多步驟可以調整：正確地調整系統將帶來顯著的效能提升。

在本筆記本中，我們將深入探討這些藍色標註中的許多內容，看看如何調整您的 RAG 系統並獲得最佳效能。

讓我們深入模型構建吧！ 首先，我們安裝所需的模型依賴項。

!pip install -q torch transformers accelerate bitsandbytes langchain sentence-transformers faiss-cpu openpyxl pacmap datasets langchain-community ragatouille

from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
from datasets import Dataset
import matplotlib.pyplot as plt

pd.set_option("display.max_colwidth", None)  # This will be helpful when visualizing retriever outputs

載入您的知識庫

import datasets

ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)
]

1. 檢索器 - 嵌入 🗂️

檢索器就像一個內部搜尋引擎：給定使用者查詢，它會從您的知識庫中返回一些相關的片段。

然後，這些片段將被輸入到讀取器模型中，以幫助它生成答案。

因此，我們的目標是，給定一個使用者問題，從我們的知識庫中找到最相關的片段來回答該問題。

這是一個寬泛的目標，它留下了一些問題。我們應該檢索多少個片段？這個引數將被命名為 `top_k`。

這些片段應該有多長？這被稱為 `chunk size`（塊大小）。沒有一刀切的答案，但這裡有幾個要點：

🔀 您的 `chunk size` 可以因片段而異。
由於檢索中總會有一些噪音，增加 `top_k` 會增加在檢索到的片段中獲得相關元素的機會。🎯 射出更多的箭會增加你擊中目標的機率。
同時，檢索到的文件的總長度不應太高：例如，對於大多數當前模型，16k 的 token 可能會因“迷失在中間（Lost-in-the-middle）”現象而使您的讀取器模型不堪重負。🎯 只給你的讀取器模型最相關的見解，而不是一大堆書！

在本筆記本中，我們使用 Langchain 庫，因為它為向量資料庫提供了多種多樣的選擇，並允許我們在整個處理過程中保留文件元資料。

1.1 將文件分割成塊

在這一部分，我們將知識庫中的文件分割成更小的塊，這些塊將成為讀取器 LLM 生成答案所依據的片段。
我們的目標是準備一個由語義相關片段組成的集合。因此，它們的大小應該適應於精確的思想：太小會截斷思想，太大則會稀釋思想。

💡 文字分割有很多選擇：按詞分割、按句子邊界分割、遞迴分塊（以樹狀方式處理文件以保留結構資訊）……要了解更多關於分塊的知識，我推薦您閱讀 Greg Kamradt 的這本很棒的筆記本。

遞迴分塊使用一個給定的分隔符列表（從最重要到最不重要的分隔符排序）逐步將文字分解成更小的部分。如果第一次分割沒有得到正確大小或形狀的塊，該方法會使用不同的分隔符在新的塊上重複進行。例如，使用分隔符列表 ["\n\n", "\n", ".", ""]：
- 該方法將首先在任何有雙換行符 "\n\n" 的地方分解文件。
- 生成的文件將再次在單換行符 "\n" 處分割，然後在句末 "." 處分割。
- 最後，如果某些塊仍然太大，它們將在超出最大大小時被分割。
透過這種方法，全域性結構得到了很好的保留，代價是塊大小會有輕微的變化。

這個 Space 可以讓你視覺化不同的分割選項如何影響你得到的塊。

🔬 讓我們對塊大小進行一些實驗，從一個任意的大小開始，看看分割是如何工作的。我們使用 Langchain 的 `RecursiveCharacterTextSplitter` 實現遞迴分塊。

引數 `chunk_size` 控制單個塊的長度：預設情況下，此長度按塊中的字元數計算。
引數 `chunk_overlap` 讓相鄰的塊之間有一些重疊。這減少了某個想法被兩個相鄰塊之間的分割切成兩半的可能性。我們大致將其設定為塊大小的 1/10，您可以嘗試不同的值！

from langchain.text_splitter import RecursiveCharacterTextSplitter

# We use a hierarchical list of separators specifically tailored for splitting Markdown documents
# This list is taken from LangChain's MarkdownTextSplitter class
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # The maximum number of characters in a chunk: we selected this value arbitrarily
    chunk_overlap=100,  # The number of characters to overlap between chunks
    add_start_index=True,  # If `True`, includes chunk's start index in metadata
    strip_whitespace=True,  # If `True`, strips whitespace from the start and end of every document
    separators=MARKDOWN_SEPARATORS,
)

docs_processed = []
for doc in RAW_KNOWLEDGE_BASE:
    docs_processed += text_splitter.split_documents([doc])

我們還必須記住，在嵌入文件時，我們將使用一個接受特定最大序列長度 `max_seq_length` 的嵌入模型。

因此，我們應該確保我們的塊大小低於這個限制，因為任何更長的塊在處理前都會被截斷，從而失去相關性。

>>> from sentence_transformers import SentenceTransformer

>>> # To get the value of the max sequence_length, we will query the underlying `SentenceTransformer` object used in the RecursiveCharacterTextSplitter
>>> print(f"Model's maximum sequence length: {SentenceTransformer('thenlper/gte-small').max_seq_length}")

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
>>> lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]

>>> # Plot the distribution of document lengths, counted as the number of tokens
>>> fig = pd.Series(lengths).hist()
>>> plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
>>> plt.show()

Model's maximum sequence length: 512

👀 正如您所見，塊的長度並未與我們 512 個 token 的限制對齊，並且一些文件超過了限制，因此它們的一部分將在截斷中丟失！

所以我們應該修改 `RecursiveCharacterTextSplitter` 類，以 token 的數量而不是字元的數量來計算長度。
然後我們可以選擇一個特定的塊大小，這裡我們會選擇一個低於 512 的閾值。
- 較小的文件可以讓分割更專注於具體的思想。
- 但太小的塊會把句子分成兩半，從而再次失去意義：適當的調整是一個平衡問題。

>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> from transformers import AutoTokenizer

>>> EMBEDDING_MODEL_NAME = "thenlper/gte-small"


>>> def split_documents(
...     chunk_size: int,
...     knowledge_base: List[LangchainDocument],
...     tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,
... ) -> List[LangchainDocument]:
...     """
...     Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
...     """
...     text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
...         AutoTokenizer.from_pretrained(tokenizer_name),
...         chunk_size=chunk_size,
...         chunk_overlap=int(chunk_size / 10),
...         add_start_index=True,
...         strip_whitespace=True,
...         separators=MARKDOWN_SEPARATORS,
...     )

...     docs_processed = []
...     for doc in knowledge_base:
...         docs_processed += text_splitter.split_documents([doc])

...     # Remove duplicates
...     unique_texts = {}
...     docs_processed_unique = []
...     for doc in docs_processed:
...         if doc.page_content not in unique_texts:
...             unique_texts[doc.page_content] = True
...             docs_processed_unique.append(doc)

...     return docs_processed_unique


>>> docs_processed = split_documents(
...     512,  # We choose a chunk size adapted to our model
...     RAW_KNOWLEDGE_BASE,
...     tokenizer_name=EMBEDDING_MODEL_NAME,
... )

>>> # Let's visualize the chunk sizes we would have in tokens from a common model
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)
>>> lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]
>>> fig = pd.Series(lengths).hist()
>>> plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
>>> plt.show()

➡️ 現在塊長度的分佈看起來好多了！

1.2 構建向量資料庫

我們想為我們知識庫中的所有塊計算嵌入：要了解更多關於句子嵌入的資訊，我們推薦閱讀這篇指南。

檢索如何工作？

一旦所有塊都被嵌入，我們將它們儲存在向量資料庫中。當用戶輸入一個查詢時，它會被之前使用的同一個模型嵌入，然後透過相似性搜尋從向量資料庫中返回最接近的文件。

因此，技術挑戰在於，給定一個查詢向量，如何在包含數千條記錄的資料庫中快速找到該向量的最近鄰。為此，我們需要選擇兩樣東西：一個距離度量，以及一個用於在資料庫中快速找到最近鄰的搜尋演算法。

距離度量

關於距離度量，你可以在這裡找到一份很好的指南。簡而言之：

餘弦相似度計算兩個向量之間相似度的方式是它們相對角度的餘弦值：它允許我們比較向量的方向而不考慮它們的大小。使用它需要對所有向量進行歸一化，將它們重新縮放為單位範數。
點積會考慮向量的模長，這有時會產生不希望的效果，即增加一個向量的長度會使其與其他所有向量更相似。
歐幾里得距離是向量端點之間的距離。

你可以嘗試這個小練習來檢驗你對這些概念的理解。但一旦向量被歸一化，特定距離的選擇就不那麼重要了。

我們特定的模型在餘弦相似度下表現良好，所以我們選擇這個距離，並在嵌入模型和 FAISS 索引的 `distance_strategy` 引數中都進行了設定。使用餘弦相似度時，我們必須對嵌入進行歸一化。

🚨👇 下面的單元格在 A10G 上執行需要幾分鐘！

from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)

KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

👀 為了視覺化最接近文件的搜尋過程，讓我們使用 PaCMAP 將我們的嵌入從 384 維降到 2 維。

💡 *我們選擇 PaCMAP 而不是其他技術，如 t-SNE 或 UMAP，因為它效率高（保留區域性和全域性結構），對初始化引數魯棒且速度快。*

# Embed a user query in the same space
user_query = "How to create a pipeline object?"
query_vector = embedding_model.embed_query(user_query)

import pacmap
import numpy as np
import plotly.express as px

embedding_projector = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=1)

embeddings_2d = [
    list(KNOWLEDGE_VECTOR_DATABASE.index.reconstruct_n(idx, 1)[0]) for idx in range(len(docs_processed))
] + [query_vector]

# Fit the data (the index of transformed data corresponds to the index of the original data)
documents_projected = embedding_projector.fit_transform(np.array(embeddings_2d), init="pca")

df = pd.DataFrame.from_dict(
    [
        {
            "x": documents_projected[i, 0],
            "y": documents_projected[i, 1],
            "source": docs_processed[i].metadata["source"].split("/")[1],
            "extract": docs_processed[i].page_content[:100] + "...",
            "symbol": "circle",
            "size_col": 4,
        }
        for i in range(len(docs_processed))
    ]
    + [
        {
            "x": documents_projected[-1, 0],
            "y": documents_projected[-1, 1],
            "source": "User query",
            "extract": user_query,
            "size_col": 100,
            "symbol": "star",
        }
    ]
)

# Visualize the embedding
fig = px.scatter(
    df,
    x="x",
    y="y",
    color="source",
    hover_data="extract",
    size="size_col",
    symbol="symbol",
    color_discrete_map={"User query": "black"},
    width=1000,
    height=700,
)
fig.update_traces(
    marker=dict(opacity=1, line=dict(width=0, color="DarkSlateGrey")),
    selector=dict(mode="markers"),
)
fig.update_layout(
    legend_title_text="<b>Chunk source</b>",
    title="<b>2D Projection of Chunk Embeddings via PaCMAP</b>",
)
fig.show()

➡️ 在上圖中，您可以看到知識庫文件的空間表示。由於向量嵌入代表了文件的意義，它們在意義上的接近程度應該反映在它們嵌入的接近程度上。

圖中也顯示了使用者查詢的嵌入：我們想找到 `k` 個意義最接近的文件，因此我們選擇 `k` 個最接近的向量。

在 LangChain 向量資料庫實現中，這個搜尋操作由 `vector_database.similarity_search(query)` 方法執行。

這是結果

>>> print(f"\nStarting retrieval for {user_query=}...")
>>> retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)
>>> print("\n==================================Top document==================================")
>>> print(retrieved_docs[0].page_content)
>>> print("==================================Metadata==================================")
>>> print(retrieved_docs[0].metadata)

Starting retrieval for user_query='How to create a pipeline object?'...

==================================Top document==================================
```

## Available Pipelines:
==================================Metadata==================================
&#123;'source': 'huggingface/diffusers/blob/main/docs/source/en/api/pipelines/deepfloyd_if.md', 'start_index': 16887}

2. 讀取器 - LLM 💬

在這一部分，LLM 讀取器會讀取檢索到的上下文來形成其答案。

這裡有一些子步驟，都可以進行調整：

檢索到的文件內容被聚合成“上下文”，有許多處理選項，如 *提示詞壓縮*。
上下文和使用者查詢被聚合成一個提示詞，然後交給 LLM 生成其答案。

2.1. 讀取器模型

讀取器模型的選擇在幾個方面很重要：

讀取器模型的 `max_seq_length` 必須容納我們的提示詞，其中包括檢索器呼叫輸出的上下文：上下文包含 5 個各 512 個 token 的文件，所以我們的目標是上下文長度至少為 4k 個 token。
讀取器模型

在這個例子中，我們選擇了HuggingFaceH4/zephyr-7b-beta，這是一個小巧但功能強大的模型。

由於每週都有許多新模型釋出，您可能希望將此模型替換為最新最好的模型。跟蹤開源 LLM 的最佳方式是檢視開源 LLM 排行榜。

為了加快推理速度，我們將載入模型的量化版本。

from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

READER_MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(READER_MODEL_NAME, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)

READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

READER_LLM("What is 4+4? Answer:")

2.2. 提示詞

下面的 RAG 提示詞模板是我們提供給讀取器 LLM 的內容：將其格式化為讀取器 LLM 的聊天模板非常重要。

我們給它我們的上下文和使用者的問題。

>>> prompt_in_chat_format = [
...     {
...         "role": "system",
...         "content": """Using the information contained in the context,
... give a comprehensive answer to the question.
... Respond only to the question asked, response should be concise and relevant to the question.
... Provide the number of the source document when relevant.
... If the answer cannot be deduced from the context, do not give an answer.""",
...     },
...     {
...         "role": "user",
...         "content": """Context:
... {context}
... ---
... Now here is the question you need to answer.

... Question: {question}""",
...     },
... ]
>>> RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
...     prompt_in_chat_format, tokenize=False, add_generation_prompt=True
... )
>>> print(RAG_PROMPT_TEMPLATE)

<|system|>
Using the information contained in the context, 
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.
<|user|>
Context:
&#123;context}
---
Now here is the question you need to answer.

Question: &#123;question}
<|assistant|>

讓我們在我們之前檢索到的文件上測試一下我們的讀取器！

>>> retrieved_docs_text = [doc.page_content for doc in retrieved_docs]  # We only need the text of the documents
>>> context = "\nExtracted documents:\n"
>>> context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(retrieved_docs_text)])

>>> final_prompt = RAG_PROMPT_TEMPLATE.format(question="How to create a pipeline object?", context=context)

>>> # Redact an answer
>>> answer = READER_LLM(final_prompt)[0]["generated_text"]
>>> print(answer)

To create a pipeline object, follow these steps:

1. Define the inputs and outputs of your pipeline. These could be strings, dictionaries, or any other format that best suits your use case.

2. Inherit the `Pipeline` class from the `transformers` module and implement the following methods:

   - `preprocess`: This method takes the raw inputs and returns a preprocessed dictionary that can be passed to the model.

   - `_forward`: This method performs the actual inference using the model and returns the output tensor.

   - `postprocess`: This method takes the output tensor and returns the final output in the desired format.

   - `_sanitize_parameters`: This method is used to sanitize the input parameters before passing them to the model.

3. Load the necessary components, such as the model and scheduler, into the pipeline object.

4. Instantiate the pipeline object and return it.

Here's an example implementation based on the given context:

```python
from transformers import Pipeline
import torch
from diffusers import StableDiffusionPipeline

class MyPipeline(Pipeline):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.pipe = StableDiffusionPipeline.from_pretrained("my_model")

    def preprocess(self, inputs):
        # Preprocess the inputs as needed
        return &#123;"input_ids":...}

    def _forward(self, inputs):
        # Run the forward pass of the model
        return self.pipe(**inputs).images[0]

    def postprocess(self, outputs):
        # Postprocess the outputs as needed
        return outputs["sample"]

    def _sanitize_parameters(self, params):
        # Sanitize the input parameters
        return params

my_pipeline = MyPipeline()
result = my_pipeline("My input string")
print(result)
```

Note that this implementation assumes that the model and scheduler are already loaded into memory. If they need to be loaded dynamically, you can modify the `__init__` method accordingly.

2.3. 重排 (Reranking)

RAG 的一個好選擇是檢索比最終需要的更多的文件，然後用一個更強大的檢索模型對結果進行重排，最後只保留 `top_k` 個結果。

為此，Colbertv2 是一個很好的選擇：與我們傳統的嵌入模型（雙編碼器）不同，它是一個交叉編碼器，可以計算查詢 token 和每個文件 token 之間更細粒度的互動。

得益於 RAGatouille 庫，它可以輕鬆使用。

from ragatouille import RAGPretrainedModel

RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

3. 整合所有部分！

from transformers import Pipeline


def answer_with_rag(
    question: str,
    llm: Pipeline,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 5,
) -> Tuple[str, List[LangchainDocument]]:
    # Gather documents with retriever
    print("=> Retrieving documents...")
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # Keep only the text

    # Optionally rerank results
    if reranker:
        print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    print("=> Generating answer...")
    answer = llm(final_prompt)[0]["generated_text"]

    return answer, relevant_docs

讓我們看看我們的 RAG 流水線如何回答使用者查詢。

>>> question = "how to create a pipeline object?"

>>> answer, relevant_docs = answer_with_rag(question, READER_LLM, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER)

=> Retrieving documents...

>>> print("==================================Answer==================================")
>>> print(f"{answer}")
>>> print("==================================Source docs==================================")
>>> for i, doc in enumerate(relevant_docs):
...     print(f"Document {i}------------------------------------------------------------")
...     print(doc)

==================================Answer==================================
To create a pipeline object, follow these steps:

1. Import the `pipeline` function from the `transformers` module:

   ```python
   from transformers import pipeline
   ```

2. Choose the task you want to perform, such as object detection, sentiment analysis, or image generation, and pass it as an argument to the `pipeline` function:

   - For object detection:

     ```python
     >>> object_detector = pipeline('object-detection')
     >>> object_detector(image)
     [&#123;'score': 0.9982201457023621,
       'label':'remote',
       'box': &#123;'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
     ...]
     ```

   - For sentiment analysis:

     ```python
     >>> classifier = pipeline("sentiment-analysis")
     >>> classifier("This is a great product!")
     &#123;'labels': ['POSITIVE'],'scores': tensor([0.9999], device='cpu', dtype=torch.float32)}
     ```

   - For image generation:

     ```python
     >>> image = pipeline(
    ... "stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
    ... ).images[0]
     >>> image
     PILImage mode RGB size 7680x4320 at 0 DPI
     ```

Note that the exact syntax may vary depending on the specific pipeline being used. Refer to the documentation for more details on how to use each pipeline.

In general, the process involves importing the necessary modules, selecting the desired pipeline task, and passing it to the `pipeline` function along with any required arguments. The resulting pipeline object can then be used to perform the selected task on input data.
==================================Source docs==================================
Document 0------------------------------------------------------------
# Allocate a pipeline for object detection
>>> object_detector = pipeline('object-detection')
>>> object_detector(image)
[&#123;'score': 0.9982201457023621,
  'label': 'remote',
  'box': &#123;'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
 &#123;'score': 0.9960021376609802,
  'label': 'remote',
  'box': &#123;'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},
 &#123;'score': 0.9954745173454285,
  'label': 'couch',
  'box': &#123;'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},
 &#123;'score': 0.9988006353378296,
  'label': 'cat',
  'box': &#123;'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},
 &#123;'score': 0.9986783862113953,
  'label': 'cat',
  'box': &#123;'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]
Document 1------------------------------------------------------------
# Allocate a pipeline for object detection
>>> object_detector = pipeline('object_detection')
>>> object_detector(image)
[&#123;'score': 0.9982201457023621,
  'label': 'remote',
  'box': &#123;'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},
 &#123;'score': 0.9960021376609802,
  'label': 'remote',
  'box': &#123;'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},
 &#123;'score': 0.9954745173454285,
  'label': 'couch',
  'box': &#123;'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},
 &#123;'score': 0.9988006353378296,
  'label': 'cat',
  'box': &#123;'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},
 &#123;'score': 0.9986783862113953,
  'label': 'cat',
  'box': &#123;'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]
Document 2------------------------------------------------------------
Start by creating an instance of [`pipeline`] and specifying a task you want to use it for. In this guide, you'll use the [`pipeline`] for sentiment analysis as an example:

```py
>>> from transformers import pipeline

>>> classifier = pipeline("sentiment-analysis")
Document 3------------------------------------------------------------
```

## Add the pipeline to 🤗 Transformers

If you want to contribute your pipeline to 🤗 Transformers, you will need to add a new module in the `pipelines` submodule
with the code of your pipeline, then add it to the list of tasks defined in `pipelines/__init__.py`.

Then you will need to add tests. Create a new file `tests/test_pipelines_MY_PIPELINE.py` with examples of the other tests.

The `run_pipeline_test` function will be very generic and run on small random models on every possible
architecture as defined by `model_mapping` and `tf_model_mapping`.

This is very important to test future compatibility, meaning if someone adds a new model for
`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's
impossible to check for actual values, that's why there is a helper `ANY` that will simply attempt to match the
output of the pipeline TYPE.

You also *need* to implement 2 (ideally 4) tests.

- `test_small_model_pt` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
  and test the pipeline outputs. The results should be the same as `test_small_model_tf`.
- `test_small_model_tf` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)
  and test the pipeline outputs. The results should be the same as `test_small_model_pt`.
- `test_large_model_pt` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to
  make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
  sure there is no drift in future releases.
- `test_large_model_tf` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to
  make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make
  sure there is no drift in future releases.
Document 4------------------------------------------------------------
```

2. Pass a prompt to the pipeline to generate an image:

```py
image = pipeline(
	"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
).images[0]
image

✅ 我們現在有了一個功能齊全、效能優越的 RAG 系統。今天就到這裡！恭喜你堅持到了最後 🥳

更進一步 🗺️

這並非旅程的終點！您可以嘗試許多步驟來改進您的 RAG 系統。我們建議以迭代的方式進行：對系統進行小的改動，然後觀察效能是否有所提升。

建立評估流水線

💬 “你無法改進你沒有衡量的模型效能”，甘地曾說… 或者至少 Llama2 告訴我他這麼說過。無論如何，你絕對應該從衡量效能開始：這意味著建立一個小的評估資料集，然後監控你的 RAG 系統在這個評估資料集上的效能。

改進檢索器

🛠️ 您可以使用這些選項來調整結果：

調整分塊方法
- 塊的大小
- 方法：使用不同的分隔符進行分割，使用語義分塊等
更換嵌入模型

👷‍♀️ 可以考慮更多：

嘗試另一種分塊方法，如語義分塊
更改使用的索引（此處為 FAISS）
查詢擴充套件：以略微不同的方式重新表述使用者查詢，以檢索更多文件。

改進讀取器

🛠️ 在這裡，您可以嘗試以下選項來改進結果：

調整提示詞
開啟/關閉重排
選擇一個更強大的讀取器模型

💡 為了進一步改進結果，這裡可以考慮許多選項：

壓縮檢索到的上下文，只保留與回答查詢最相關的部分。
擴充套件 RAG 系統，使其更加使用者友好
- 引用來源
- 使其具有對話性

< > 在 GitHub 上更新

←使用 TGI 的 Messages API 從 OpenAI 遷移到開放 LLM 在零樣本（Zero-shot）文字分類中使用 SetFit 進行資料標註的建議→

開源 AI 食譜

在 Hugging Face 文件上使用 LangChain 進行高階 RAG

載入您的知識庫

1. 檢索器 - 嵌入 🗂️

1.1 將文件分割成塊

1.2 構建向量資料庫

檢索如何工作？

最近鄰搜尋演算法

距離度量

2. 讀取器 - LLM 💬

2.1. 讀取器模型

2.2. 提示詞

2.3. 重排 (Reranking)

3. 整合所有部分！

更進一步 🗺️

建立評估流水線

改進檢索器

改進讀取器