ggml 簡介

釋出於 2024 年 8 月 13 日

在 GitHub 上更新

贊

229

ggml 是一個用 C 和 C++ 編寫的機器學習 (ML) 庫，專注於 Transformer 推理。該專案是開源的，並由一個不斷壯大的社群積極開發。ggml 類似於 PyTorch 和 TensorFlow 等 ML 庫，但它仍處於開發的早期階段，其一些基礎部分仍在快速變化。

隨著時間的推移，ggml 與 llama.cpp 和 whisper.cpp 等專案一起廣受歡迎。許多其他專案也在底層使用 ggml 來實現在裝置上執行 LLM，包括 ollama、jan、LM Studio、GPT4All。

人們選擇使用 ggml 而非其他庫的主要原因是：

極簡主義：核心庫自成一體，檔案數量少於 5 個。雖然你可能希望包含額外的檔案以支援 GPU，但這是可選的。
易於編譯：你不需要複雜的構建工具。在沒有 GPU 支援的情況下，你只需要 GCC 或 Clang！
輕量級：編譯後的二進位制檔案大小小於 1MB，與通常佔用數百 MB 的 PyTorch 相比非常小。
良好的相容性：支援多種硬體，包括 x86_64、ARM、Apple Silicon、CUDA 等。
支援量化張量：張量可以被量化以節省記憶體（類似於 JPEG 壓縮），並在某些情況下提高效能。
極高的記憶體效率：儲存張量和執行計算的開銷極小。

然而，ggml 也有一些缺點，在使用時需要注意（此列表可能會在 ggml 的未來版本中發生變化）：

並非所有張量操作都支援所有後端。例如，某些操作可能在 CPU 上有效，但在 CUDA 上無效。
使用 ggml 進行開發可能不那麼直接，可能需要深入的底層程式設計知識。
該專案正處於活躍開發中，因此預計會有重大變更。

在本文中，我們將重點介紹 ggml 的基礎知識，以幫助希望開始使用該庫的開發者。我們不涉及更高階的任務，例如使用基於 ggml 構建的 llama.cpp 進行 LLM 推理。相反，我們將探討 ggml 的核心概念和基本用法，為進一步學習和開發打下堅實的基礎。

開始使用

太棒了，那麼你該如何開始呢？

為簡單起見，本指南將向你展示如何在 Ubuntu 上編譯 ggml。實際上，你幾乎可以在任何平臺（包括 Windows、macOS 和 BSD）上編譯 ggml。

# Start by installing build dependencies
# "gdb" is optional, but is recommended
sudo apt install build-essential cmake git gdb

# Then, clone the repository
git clone https://github.com/ggerganov/ggml.git
cd ggml

# Try compiling one of the examples
cmake -B build
cmake --build build --config Release --target simple-ctx

# Run the example
./build/bin/simple-ctx

預期輸出

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

如果你看到了預期的結果，那就意味著我們可以繼續了！

術語與概念

在深入研究 ggml 之前，我們應該瞭解一些關鍵概念。如果你來自 PyTorch 或 TensorFlow 等高階庫，這些概念可能看起來很難理解。但是，請記住 ggml 是一個底層庫。理解這些術語可以讓你更好地控制性能。

ggml_context：一個“容器”，用於存放張量、圖和可選的資料等物件。
ggml_cgraph：表示一個計算圖。可以把它想象成將要傳輸到後端的“計算順序”。
ggml_backend：表示一個用於執行計算圖的介面。有多種型別的後端：CPU（預設）、CUDA、Metal（Apple Silicon）、Vulkan、RPC 等。
ggml_backend_buffer_type：表示一個緩衝區型別。可以把它想象成一個連線到每個 ggml_backend 的“記憶體分配器”。例如，如果要在 GPU 上執行計算，你需要透過 buffer_type（通常縮寫為 buft）在 GPU 上分配記憶體。
ggml_backend_buffer：表示由 buffer_type 分配的緩衝區。記住：一個緩衝區可以容納多個張量的資料。
ggml_gallocr：表示圖記憶體分配器，用於高效地分配計算圖中使用的張量。
ggml_backend_sched：一個可以併發使用多個後端的排程器。在處理大型模型或多個 GPU 時，它可以將計算分佈在不同的硬體（例如 GPU 和 CPU）上。排程器還可以自動將 GPU 不支援的操作分配給 CPU，以確保最佳的資源利用率和相容性。

簡單示例

在此示例中，我們將逐步重現我們在開始使用部分執行的程式碼。我們需要建立兩個矩陣，將它們相乘並得到結果。使用 PyTorch，程式碼如下：

import torch

# Create two matrices
matrix1 = torch.tensor([
  [2, 8],
  [5, 1],
  [4, 2],
  [8, 6],
])
matrix2 = torch.tensor([
  [10, 5],
  [9, 9],
  [5, 4],
])

# Perform matrix multiplication
result = torch.matmul(matrix1, matrix2.T)
print(result.T)

使用 ggml，必須執行以下步驟才能實現相同的結果：

分配 ggml_context 以儲存張量資料
建立張量並設定資料
為 mul_mat 操作建立 ggml_cgraph
執行計算
檢索結果（輸出張量）
釋放記憶體並退出

注意：在此示例中，為簡單起見，我們將張量資料分配在 ggml_context 內部。在實踐中，記憶體應作為裝置緩衝區進行分配，我們將在下一節中看到。

要開始，讓我們建立一個新目錄 examples/demo

cd ggml # make sure you're in the project root

# create C source and CMakeLists file
touch examples/demo/demo.c
touch examples/demo/CMakeLists.txt

此示例的程式碼基於 simple-ctx.cpp

用以下內容編輯 examples/demo/demo.c

#include "ggml.h"
#include "ggml-cpu.h"
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Allocate `ggml_context` to store tensor data
    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += rows_A * cols_A * ggml_type_size(GGML_TYPE_F32); // tensor a
    ctx_size += rows_B * cols_B * ggml_type_size(GGML_TYPE_F32); // tensor b
    ctx_size += rows_A * rows_B * ggml_type_size(GGML_TYPE_F32); // result
    ctx_size += 3 * ggml_tensor_overhead(); // metadata for 3 tensors
    ctx_size += ggml_graph_overhead(); // compute graph
    ctx_size += 1024; // some overhead (exact calculation omitted for simplicity)

    // Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size   =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc   =*/ false,
    };
    struct ggml_context * ctx = ggml_init(params);

    // 2. Create tensors and set data
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);
    memcpy(tensor_a->data, matrix_A, ggml_nbytes(tensor_a));
    memcpy(tensor_b->data, matrix_B, ggml_nbytes(tensor_b));


    // 3. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    // result = a*b^T
    // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
    // the result is transposed
    struct ggml_tensor * result = ggml_mul_mat(ctx, tensor_a, tensor_b);

    // Mark the "result" tensor to be computed
    ggml_build_forward_expand(gf, result);

    // 4. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    ggml_graph_compute_with_ctx(ctx, gf, n_threads);

    // 5. Retrieve results (output tensors)
    float * result_data = (float *) result->data;
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1] /* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0] /* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");

    // 6. Free memory and exit
    ggml_free(ctx);
    return 0;
}

在你建立的 examples/demo/CMakeLists.txt 檔案中寫入以下行

set(TEST_TARGET demo)
add_executable(${TEST_TARGET} demo)
target_link_libraries(${TEST_TARGET} PRIVATE ggml)

編輯 examples/CMakeLists.txt，在末尾新增此行

add_subdirectory(demo)

編譯並執行它

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

預期結果

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

使用後端的示例

ggml 中的“後端”指的是可以處理張量操作的介面。後端可以是 CPU、CUDA、Vulkan 等。

後端抽象了計算圖的執行。一旦定義，就可以使用相應的後端實現，利用可用的硬體來計算圖。請注意，ggml 將自動為計算所需的任何中間張量保留記憶體，並根據這些中間結果的生命週期最佳化記憶體使用。

使用後端進行計算或推理時，需要執行的常見步驟是

初始化 ggml_backend
分配 ggml_context 以儲存張量元資料（我們不需要立即分配張量資料）
建立張量元資料（僅它們的形狀和資料型別）
分配一個 ggml_backend_buffer 來儲存所有張量
將張量資料從主記憶體（RAM）複製到後端緩衝區
為 mul_mat 操作建立 ggml_cgraph
為 cgraph 分配建立一個 ggml_gallocr
可選：使用 ggml_backend_sched 排程 cgraph
執行計算
檢索結果（輸出張量）
釋放記憶體並退出

此示例的程式碼基於 simple-backend.cpp

#include "ggml.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    // initialize data of matrices to perform matrix multiplication
    const int rows_A = 4, cols_A = 2;
    float matrix_A[rows_A * cols_A] = {
        2, 8,
        5, 1,
        4, 2,
        8, 6
    };
    const int rows_B = 3, cols_B = 2;
    float matrix_B[rows_B * cols_B] = {
        10, 5,
        9, 9,
        5, 4
    };

    // 1. Initialize backend
    ggml_backend_t backend = NULL;
#ifdef GGML_USE_CUDA
    fprintf(stderr, "%s: using CUDA backend\n", __func__);
    backend = ggml_backend_cuda_init(0); // init device 0
    if (!backend) {
        fprintf(stderr, "%s: ggml_backend_cuda_init() failed\n", __func__);
    }
#endif
    // if there aren't GPU Backends fallback to CPU backend
    if (!backend) {
        backend = ggml_backend_cpu_init();
    }

    // Calculate the size needed to allocate
    size_t ctx_size = 0;
    ctx_size += 2 * ggml_tensor_overhead(); // tensors
    // no need to allocate anything else!

    // 2. Allocate `ggml_context` to store tensor data
    struct ggml_init_params params = {
        /*.mem_size   =*/ ctx_size,
        /*.mem_buffer =*/ NULL,
        /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_backend_alloc_ctx_tensors()
    };
    struct ggml_context * ctx = ggml_init(params);

    // Create tensors metadata (only there shapes and data type)
    struct ggml_tensor * tensor_a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_A, rows_A);
    struct ggml_tensor * tensor_b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, cols_B, rows_B);

    // 4. Allocate a `ggml_backend_buffer` to store all tensors
    ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, backend);

    // 5. Copy tensor data from main memory (RAM) to backend buffer
    ggml_backend_tensor_set(tensor_a, matrix_A, 0, ggml_nbytes(tensor_a));
    ggml_backend_tensor_set(tensor_b, matrix_B, 0, ggml_nbytes(tensor_b));

    // 6. Create a `ggml_cgraph` for mul_mat operation
    struct ggml_cgraph * gf = NULL;
    struct ggml_context * ctx_cgraph = NULL;
    {
        // create a temporally context to build the graph
        struct ggml_init_params params0 = {
            /*.mem_size   =*/ ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
            /*.mem_buffer =*/ NULL,
            /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_gallocr_alloc_graph()
        };
        ctx_cgraph = ggml_init(params0);
        gf = ggml_new_graph(ctx_cgraph);

        // result = a*b^T
        // Pay attention: ggml_mul_mat(A, B) ==> B will be transposed internally
        // the result is transposed
        struct ggml_tensor * result0 = ggml_mul_mat(ctx_cgraph, tensor_a, tensor_b);

        // Add "result" tensor and all of its dependencies to the cgraph
        ggml_build_forward_expand(gf, result0);
    }

    // 7. Create a `ggml_gallocr` for cgraph computation
    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
    ggml_gallocr_alloc_graph(allocr, gf);

    // (we skip step 8. Optionally: schedule the cgraph using `ggml_backend_sched`)

    // 9. Run the computation
    int n_threads = 1; // Optional: number of threads to perform some operations with multi-threading
    if (ggml_backend_is_cpu(backend)) {
        ggml_backend_cpu_set_n_threads(backend, n_threads);
    }
    ggml_backend_graph_compute(backend, gf);

    // 10. Retrieve results (output tensors)
    // in this example, output tensor is always the last tensor in the graph
    struct ggml_tensor * result = ggml_graph_node(gf, -1);
    float * result_data = malloc(ggml_nbytes(result));
    // because the tensor data is stored in device buffer, we need to copy it back to RAM
    ggml_backend_tensor_get(result, result_data, 0, ggml_nbytes(result));
    printf("mul mat (%d x %d) (transposed result):\n[", (int) result->ne[0], (int) result->ne[1]);
    for (int j = 0; j < result->ne[1] /* rows */; j++) {
        if (j > 0) {
            printf("\n");
        }

        for (int i = 0; i < result->ne[0] /* cols */; i++) {
            printf(" %.2f", result_data[j * result->ne[0] + i]);
        }
    }
    printf(" ]\n");
    free(result_data);

    // 11. Free memory and exit
    ggml_free(ctx_cgraph);
    ggml_gallocr_free(allocr);
    ggml_free(ctx);
    ggml_backend_buffer_free(buffer);
    ggml_backend_free(backend);
    return 0;
}

編譯並執行它，你應該會得到與上一個示例相同的結果。

cmake -B build
cmake --build build --config Release --target demo

# Run it
./build/bin/demo

預期結果

mul mat (4 x 3) (transposed result):
[ 60.00 55.00 50.00 110.00
 90.00 54.00 54.00 126.00
 42.00 29.00 28.00 64.00 ]

列印計算圖

ggml_cgraph 表示計算圖，它定義了將由後端執行的操作順序。列印圖可能是一個有用的除錯工具，尤其是在處理更復雜的模型和計算時。

你可以新增 ggml_graph_print 來列印 cgraph

...

// Mark the "result" tensor to be computed
ggml_build_forward_expand(gf, result0);

// Print the cgraph
ggml_graph_print(gf);

執行它

=== GRAPH ===
n_nodes = 1
 -   0: [     4,     3,     1]          MUL_MAT  
n_leafs = 2
 -   0: [     2,     4]     NONE           leaf_0
 -   1: [     2,     3]     NONE           leaf_1
========================================

此外，你可以將 cgraph 繪製為 graphviz dot 格式

ggml_graph_dump_dot(gf, NULL, "debug.dot");

你可以使用 dot 命令或這個線上網站將 debug.dot 渲染成最終影像。

結論

本文對 ggml 進行了介紹性概述，涵蓋了關鍵概念、一個簡單的使用示例以及一個使用後端的示例。雖然我們已經涵蓋了基礎知識，但 ggml 還有更多值得探索的內容。

在接下來的文章中，我們將更深入地探討其他與 ggml 相關的主題，例如 GGUF 格式、量化以及不同後端的組織和使用方式。此外，你可以訪問 ggml 示例目錄檢視更高階的用例和示例程式碼。敬請期待未來更多 ggml 相關內容！

更多部落格文章

使用 Sentence Transformers v5 訓練和微調稀疏嵌入模型

作者 2025 年 7 月 1 日 • 106

使用 Sentence Transformers v4 訓練和微調 Reranker 模型

作者 2025 年 3 月 26 日 • 155

社群

daynauth

2月24日

simple-backend.cpp 第 10 步的第一行應該是

struct ggml_tensor * result = ggml_graph_node(gf, ggml_graph_n_nodes(gf) - 1);

而不是

struct ggml_tensor * result = gf->nodes[gf->n_nodes - 1];

因為 gf 被用作不透明指標。

ngxson

文章作者 2月25日

這篇博文是在 ggml_graph_n_nodes 引入之前寫的，所以內容不是最新的。如果你願意，可以隨時提交一個 PR 來修正。謝謝。

nsparks

3月4日

在 ggml 專案、llama.cpp、像 ollama 這樣的外部專案之間...更新流程是怎樣的？

對 ggml 的更改應該提交到 ggml-org\ggml，然後拉取到 llama.cpp 和其他專案中嗎？為了避免碎片化，ggml 似乎應該單獨維護，並作為 llama.cpp 的子模組（但我遠非 git 專家）。

dasoran

5月6日

我想我需要在第二個 demo.c 中加上 #include “ggml-cpu.h” 這一行。

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入以發表評論

贊

229