Lighteval

( metric_name: str higher_is_better: bool category: MetricCategory use_case: MetricUseCase sample_level_fn: <built-in function callable> corpus_level_fn: <built-in function callable> )

在整個語料庫上計算的指標，計算發生在聚合階段。

class lighteval.metrics.utils.metric_utils.SampleLevelMetric

( metric_name: str higher_is_better: bool category: MetricCategory use_case: MetricUseCase sample_level_fn: <built-in function callable> corpus_level_fn: <built-in function callable> )

對每個樣本計算指標，然後在語料庫上進行聚合。

class lighteval.metrics.utils.metric_utils.MetricGrouping

( metric_name: list higher_is_better: dict category: MetricCategory use_case: MetricUseCase sample_level_fn: <built-in function callable> corpus_level_fn: dict )

有些指標一次性一起計算更有優勢。例如，如果所有指標的預處理成本都很高且相同，那麼只計算一次會更有意義。

class lighteval.metrics.utils.metric_utils.CorpusLevelMetricGrouping

( metric_name: list higher_is_better: dict category: MetricCategory use_case: MetricUseCase sample_level_fn: <built-in function callable> corpus_level_fn: dict )

在整個語料庫上計算的指標分組，計算發生在聚合階段。

class lighteval.metrics.utils.metric_utils.SampleLevelMetricGrouping

( metric_name: list higher_is_better: dict category: MetricCategory use_case: MetricUseCase sample_level_fn: <built-in function callable> corpus_level_fn: dict )

指標分組在每個樣本上計算，然後在語料庫上進行聚合。

class lighteval.metrics.metrics_corpus.CorpusLevelF1Score

( average: str num_classes: int = 2 )

compute

( items: list )

透過使用 scikit-learn 實現，計算所有語料庫生成項的指標得分。

class lighteval.metrics.metrics_corpus.CorpusLevelPerplexityMetric

( metric_type: str )

compute

( items: list )

計算所有語料庫生成項的指標得分。

class lighteval.metrics.metrics_corpus.CorpusLevelTranslationMetric

( metric_type: str lang: typing.Literal['zh', 'ja', 'ko', ''] = '' )

compute

( items: list )

透過使用 sacrebleu 實現，計算所有語料庫生成項的指標得分。

lighteval.metrics.metrics_corpus.matthews_corrcoef

( items: list ) → float

引數

items (list[dict]) — GenerativeCorpusMetricInput 列表

返回

float

分數

使用 scikit-learn 計算馬修斯相關係數（文件）。

class lighteval.metrics.metrics_sample.ExactMatches

( aggregation_function: typing.Callable[[list[float]], float] = <built-in function max> normalize_gold: typing.Optional[typing.Callable[[str], str]] = None normalize_pred: typing.Optional[typing.Callable[[str], str]] = None strip_strings: bool = False type_exact_match: str = 'full' )

compute

( golds: list predictions: list **kwargs ) → float

引數

golds (list[str]) — 參考目標
predictions (list[str]) — 預測的字串

返回

float

當前樣本項的聚合分數。

針對單個樣本的黃金標準列表和預測列表計算指標。

compute_one_item

( gold: str pred: str ) → float

引數

gold (str) — 可能的參考之一
pred (str) — 可能的預測之一

返回

float

精確匹配分數。匹配則為1，否則為0。

僅比較兩個字串。

class lighteval.metrics.metrics_sample.F1_score

( aggregation_function: typing.Callable[[list[float]], float] = <built-in function max> normalize_gold: typing.Optional[typing.Callable[[str], str]] = None normalize_pred: typing.Optional[typing.Callable[[str], str]] = None strip_strings: bool = False )

compute

( golds: list predictions: list **kwargs ) → float

引數

golds (list[str]) — 參考目標
predictions (list[str]) — 預測的字串

返回

float

當前樣本項的聚合分數。

針對單個樣本的黃金標準列表和預測列表計算指標。

compute_one_item

( gold: str pred: str ) → float

引數

gold (str) — 可能的參考之一
pred (str) — 可能的預測之一

返回

float

基於詞袋的 F1 分數，使用 nltk 計算。

僅比較兩個字串。

class lighteval.metrics.metrics_sample.LoglikelihoodAcc

( logprob_normalization: lighteval.metrics.normalizations.LogProbCharNorm | lighteval.metrics.normalizations.LogProbTokenNorm | lighteval.metrics.normalizations.LogProbPMINorm | None = None )

compute

( gold_ixs: list choices_logprob: list unconditioned_logprob: list[float] | None choices_tokens: list[list[int]] | None formatted_doc: Doc **kwargs ) → int

引數

gold_ixs (list[int]) — 所有正確選項的索引
choices_logprob (list[float]) — 模型所有可能選項的對數機率之和，按選項順序排列。
unconditioned_logprob (list[float] | None) — 用於 PMI 歸一化的無條件對數機率，按選項順序排列。
choices_tokens (list[list[int]] | None) — 用於 Token 歸一化的分詞後選項，按選項順序排列。
formatted_doc (Doc) — 樣本的原始文件。用於獲取原始選項的長度以進行可能的歸一化。

返回

int

評估分數：如果最佳對數機率選項在正確答案中，則為1，否則為0。

計算對數似然準確率：`choices_logprob` 中具有最高對數機率的選項是否存在於 `gold_ixs` 中？

class lighteval.metrics.metrics_sample.NormalizedMultiChoiceProbability

( log_prob_normalization: lighteval.metrics.normalizations.LogProbCharNorm | lighteval.metrics.normalizations.LogProbTokenNorm | lighteval.metrics.normalizations.LogProbPMINorm | None = None aggregation_function: typing.Callable[[numpy.ndarray], float] = <function max at 0x7f25f039e170> )

compute

( gold_ixs: list choices_logprob: list unconditioned_logprob: list[float] | None choices_tokens: list[list[int]] | None formatted_doc: Doc **kwargs ) → float

引數

gold_ixs (list[int]) — 所有正確選項的索引
choices_logprob (list[float]) — 模型所有可能選項的對數機率之和，按選項順序排列。
unconditioned_logprob (list[float] | None) — 用於 PMI 歸一化的無條件對數機率，按選項順序排列。
choices_tokens (list[list[int]] | None) — 用於 Token 歸一化的分詞後選項，按選項順序排列。
formatted_doc (Doc) — 樣本的原始文件。用於獲取原始選項的長度以進行可能的歸一化。

返回

float

最佳對數機率選項是正確選項的機率。

計算對數似然機率：選擇最佳選項的機會。

class lighteval.metrics.metrics_sample.Probability

( normalization: lighteval.metrics.normalizations.LogProbTokenNorm | None = None aggregation_function: typing.Callable[[numpy.ndarray], float] = <function max at 0x7f25f039e170> )

compute

( logprobs: list target_tokens: list **kwargs ) → float

引數

gold_ixs (list[int]) — 所有正確選項的索引
choices_logprob (list[float]) — 模型所有可能選項的對數機率之和，按選項順序排列。
unconditioned_logprob (list[float] | None) — 用於 PMI 歸一化的無條件對數機率，按選項順序排列。
choices_tokens (list[list[int]] | None) — 用於 Token 歸一化的分詞後選項，按選項順序排列。
formatted_doc (Doc) — 樣本的原始文件。用於獲取原始選項的長度以進行可能的歸一化。

返回

float

最佳對數機率選項是正確選項的機率。

計算對數似然機率：選擇最佳選項的機會。

class lighteval.metrics.metrics_sample.Recall

( at: int )

compute

( choices_logprob: list gold_ixs: list **kwargs ) → int

引數

gold_ixs (list[int]) — 所有正確選項的索引
choices_logprob (list[float]) — 模型所有可能選項的對數機率之和，按選項順序排列。

返回

int

得分：如果頂層預測的選項之一是正確的，則為 1，否則為 0。

在請求的深度級別計算召回率：檢視 `n` 個最佳預測選項（具有最高的對數機率），並檢查其中是否有實際的正確選項。

class lighteval.metrics.metrics_sample.MRR

( length_normalization: bool = False )

compute

( choices_logprob: list gold_ixs: list formatted_doc: Doc **kwargs ) → float

引數

gold_ixs (list[int]) — 所有正確選項的索引
choices_logprob (list[float]) — 模型所有可能選項的對數機率之和，按選項順序排列。
formatted_doc (Doc) — 樣本的原始文件。用於獲取原始選項的長度以進行可能的歸一化。

返回

float

MRR 分數。

平均倒數排名。衡量選項排名（按正確性排序）的質量。

class lighteval.metrics.metrics_sample.ROUGE

( methods: str | list[str] multiple_golds: bool = False bootstrap: bool = False normalize_gold: <built-in function callable> = None normalize_pred: <built-in function callable> = None aggregation_function: <built-in function callable> = None tokenizer: object = None )

compute

( golds: list predictions: list **kwargs ) → float or dict

引數

golds (list[str]) — 參考目標
predictions (list[str]) — 預測的字串

返回

float 或 dict

當前樣本各項的聚合分數。如果選擇了多個 rouge 函式，則返回一個對映名稱和分數的字典。

針對單個樣本的一系列正確答案和預測計算指標。

class lighteval.metrics.metrics_sample.BertScore

( normalize_gold: <built-in function callable> = None normalize_pred: <built-in function callable> = None )

compute

( golds: list predictions: list **kwargs ) → dict

引數

golds (list[str]) — 參考目標
predictions (list[str]) — 預測的字串

返回

字典

當前樣本各項的分數。

使用 bert scorer 計算精確率、召回率和 f1 分數。

class lighteval.metrics.metrics_sample.Extractiveness

( normalize_input: <built-in function callable> = <function remove_braces at 0x7f24fcef9870> normalize_pred: <built-in function callable> = <function remove_braces_and_strip at 0x7f24fcef9900> input_column: str = 'text' )

compute

( predictions: list formatted_doc: Doc **kwargs ) → dict[str, float]

引數

predictions (list[str]) — 預測的字串，一個長度為 1 的列表。
formatted_doc (Doc) — 格式化的文件。

返回

dict[str, float]

抽取性分數。

計算預測的抽取性。

此方法計算單個預測相對於輸入文字的覆蓋率、密度和壓縮分數。

class lighteval.metrics.metrics_sample.Faithfulness

( normalize_input: <built-in function callable> = <function remove_braces at 0x7f24fcef9870> normalize_pred: <built-in function callable> = <function remove_braces_and_strip at 0x7f24fcef9900> input_column: str = 'text' )

compute

( predictions: list formatted_doc: Doc **kwargs ) → dict[str, float]

引數

predictions (list[str]) — 預測的字串，一個長度為 1 的列表。
formatted_doc (Doc) — 格式化的文件。

返回

dict[str, float]

忠實度分數。

計算預測的忠實度。

SummaCZS（Summary Content Zero-Shot）模型與可配置的粒度和模型變體一起使用。

class lighteval.metrics.metrics_sample.BLEURT

( )

compute

( golds: list predictions: list **kwargs ) → float

引數

golds (list[str]) — 參考目標
predictions (list[str]) — 預測的字串

返回

float

當前樣本各項的分數。

使用儲存的 BLEURT 評分器計算當前樣本的分數。

class lighteval.metrics.metrics_sample.BLEU

( n_gram: int )

compute

( golds: list predictions: list **kwargs ) → float

引數

golds (list[str]) — 參考目標
predictions (list[str]) — 預測的字串

返回

float

當前樣本各項的分數。

計算正確答案和每個預測之間的句子級 BLEU，然後取平均值。

class lighteval.metrics.metrics_sample.StringDistance

( metric_types: list[str] | str strip_prediction: bool = True )

compute

( golds: list predictions: list **kwargs ) → dict

引數

golds (list[str]) — 可能的正確答案列表。如果包含多個專案，則只保留第一個。
predictions (list[str]) — 預測的字串。

返回

字典

計算出的不同分數

根據正確答案和預測計算所有請求的指標。

edit_similarity

( s1 s2 )

計算兩個字串列表之間的編輯相似度。

編輯相似度也用於論文 Lee, Katherine, et al. “Deduplicating training data makes language models better.” arXiv preprint arXiv:2107.06499 (2021) 中。

longest_common_prefix_length

( s1: ndarray s2: ndarray )

計算最長公共字首的長度。

class lighteval.metrics.metrics_sample.JudgeLLM

( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: BaseModel = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None )

class lighteval.metrics.metrics_sample.JudgeLLMMTBench

( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: BaseModel = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None )

compute

( predictions: list formatted_doc: Doc **kwargs )

使用大語言模型作為裁判來計算生成任務的分數。生成任務可以是多輪的，最多兩輪，在這種情況下，我們返回第1輪和第2輪的分數。同時返回user_prompt和judgement，這些之後會被聚合器忽略。

class lighteval.metrics.metrics_sample.JudgeLLMMixEval

( judge_model_name: str template: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'vllm', 'tgi', 'inference-providers'] short_judge_name: str | None = None response_format: BaseModel = None url: str | None = None hf_provider: str | None = None max_tokens: int | None = None )

compute

( sample_ids: list responses: list formatted_docs: list **kwargs )

使用大語言模型作為裁判來計算生成任務的分數。生成任務可以是多輪的，最多兩輪，在這種情況下，我們返回第1輪和第2輪的分數。同時返回user_prompt和judgement，這些之後會被聚合器忽略。

class lighteval.metrics.metrics_sample.MajAtK

( k: int normalize_gold: <built-in function callable> = None normalize_pred: <built-in function callable> = None strip_strings: bool = False type_exact_match: str = 'full' )

compute

( golds: list predictions: list **kwargs ) → float

引數

golds (list[str]) — 參考目標
predictions (list[str]) — k個預測字串

返回

float

當前樣本項的聚合分數。

針對單個樣本的正確答案和預測列表計算指標。它對模型預測和正確答案進行歸一化（如果需要），並從所有可用答案中選擇最頻繁的答案，然後將其與正確答案進行比較。

class lighteval.metrics.llm_as_judge.JudgeLM

( model: str templates: typing.Callable process_judge_response: typing.Callable judge_backend: typing.Literal['litellm', 'openai', 'transformers', 'tgi', 'vllm', 'inference-providers'] url: str | None = None api_key: str | None = None max_tokens: int = 512 response_format: BaseModel = None hf_provider: typing.Optional[typing.Literal['black-forest-labs', 'cerebras', 'cohere', 'fal-ai', 'fireworks-ai', 'inference-providers', 'hyperbolic', 'nebius', 'novita', 'openai', 'replicate', 'sambanova', 'together']] = None )

引數

model (str) — 模型的名稱。
templates (Callable) — 一個函式，它會考慮問題、選項、答案和正確答案，並返回裁判的提示。
process_judge_response (Callable) — 用於處理裁判響應的函式。
judge_backend (Literal[“openai”, “transformers”, “tgi”, “vllm”]) — 裁判的後端。
url (str | None) — OpenAI API的URL。
api_key (str | None) — OpenAI API的API金鑰（OpenAI或HF金鑰）。
model (str) — 模型的名稱。
template (Callable) — 一個函式，它會考慮問題、選項、答案和正確答案，並返回裁判的提示。
API_MAX_RETRY (int) — API的最大重試次數。
API_RETRY_SLEEP (int) — 重試之間的休眠時間。
client (OpenAI | None) — OpenAI客戶端。
pipe (LLM | AutoModel | None) — Transformers或vllm管道。
process_judge_response (Callable) — 用於處理裁判響應的函式。
url (str | None) — OpenAI API的URL。
api_key (str | None) — OpenAI API的API金鑰（OpenAI或HF金鑰）。
backend (Literal[“openai”, “transformers”, “tgi”, “vllm”]) — 裁判的後端

一個表示使用OpenAI或Transformers庫評估答案的裁判類。

方法：evaluate_answer：使用OpenAI API或Transformers庫評估答案。lazy_load_client：惰性載入OpenAI客戶端或Transformers管道。call_api：呼叫API獲取裁判的響應。call_transformers：呼叫Transformers管道獲取裁判的響應。call_vllm：呼叫VLLM管道獲取裁判的響應。

dict_of_lists_to_list_of_dicts

( dict_of_lists )

引數

dict_of_lists — 一個字典，其中每個值都是一個列表。所有列表的長度應相同。

將一個列表字典轉換為字典列表。

輸出列表中的每個字典將包含輸入字典中每個列表的一個元素，其鍵與輸入字典相同。

示例

dict_of_lists_to_list_of_dicts({'k': [1, 2, 3], 'k2': ['a', 'b', 'c']}) [{'k': 1, 'k2': 'a'}, {'k': 2, 'k2': 'b'}, {'k': 3, 'k2': 'c'}]

evaluate_answer