Mini-Infer 架构深潜 (5): `Engine` - 联结万物的“总指挥”

1. `Engine` 的设计哲学：编译与执行的分离

一个推理引擎的 API 设计，最关键的一点是必须分离“一次性”的准备工作和“高频”的执行工作。

build()（编译）: 加载模型、图优化、拓扑排序、内存分配。这些操作非常昂贵，但我们只需要做一次。
forward()（执行）: 运行模型。这个操作必须极其轻量，因为它会被调用成千上万次。

Engine 类的接口 (engine.h) 完美地体现了这种分离。

// mini_infer/runtime/engine.h
#pragma once

#include "mini_infer/graph/graph.h"
#include "mini_infer/backends/backend.h"
// ...

namespace mini_infer {
namespace runtime {

/**
 * @brief Config Inference Engine
 */
struct EngineConfig {
    core::DeviceType device_type{core::DeviceType::CPU};
    int32_t device_id{0};
    // ...
};

/**
 * @brief Inference Engine
 * Execute the graph and manage the inference process
 */
class Engine {
public:
    explicit Engine(const EngineConfig& config);
    ~Engine() = default;
    
    /**
     * @brief Build the engine (编译)
     * 1. 验证图
     * 2. 优化图 (A-Track 任务)
     * 3. 拓扑排序
     * 4. 预分配内存
     */
    core::Status build(std::shared_ptr<graph::Graph> graph);
    
    /**
     * @brief Execute the graph (执行)
     * 1. 注入输入
     * 2. 遍历执行
     * 3. 收集输出
     */
    core::Status forward(
        const std::unordered_map<std::string, std::shared_ptr<core::Tensor>>& inputs,
        std::unordered_map<std::string, std::shared_ptr<core::Tensor>>& outputs
    );
    
    // ... 其他辅助函数 ...
    
private:
    // Engine 是一个“状态机”，它持有所有核心资产：
    EngineConfig config_; 
    std::shared_ptr<graph::Graph> graph_; // 【蓝图】
    std::shared_ptr<backends::Backend> backend_; // 【场地】
    
    // 【编译结果】：排好序的执行计划
    std::vector<std::shared_ptr<graph::Node>> sorted_nodes_; 
    
    // 私有辅助方法
    core::Status allocate_tensors();
    core::Status execute_node(std::shared_ptr<graph::Node> node);
};

} // namespace runtime
} // namespace mini_infer

Engine 是一个状态机。它的私有成员 graph_, backend_, 和 sorted_nodes_ 构成了它的核心状态，这些状态在 build() 期间被建立，在 forward() 期间被（只读）使用。

2. “编译”阶段 (`Engine::build`)：一次性完成所有“重”工作

build 函数是我们 Engine 的“准备”阶段。它是一条精心设计的流水线，它调用了我们之前构建的所有 Graph 算法：

// mini_infer/runtime/engine.cpp
core::Status Engine::build(std::shared_ptr<graph::Graph> graph) {
    if (!graph) return core::Status::ERROR_INVALID_ARGUMENT;
    graph_ = graph;
    
    // 1. 验证图的合法性 (例如：输入输出是否存在)
    auto status = graph_->validate();
    if (status != core::Status::SUCCESS) {
        MI_LOG_ERROR("Graph validation failed");
        return status;
    }
    
    // 2. 【A-Track 任务占位】图优化
    //    这里就是我们未来实现 Conv+ReLU 融合的地方！
    status = graph_->optimize();
    
    // 3. 【核心】获取执行计划
    //    调用我们上一章的拓扑排序，获取唯一的、正确的执行顺序
    status = graph_->topological_sort(sorted_nodes_);
    if (status != core::Status::SUCCESS) {
        MI_LOG_ERROR("Topological sort failed (cycle detected?)");
        return status;
    }
    
    // 4. 【性能关键】预分配所有中间 Tensor
    //    基于 infer_shape() (我们将在下一章实现)
    status = allocate_tensors();
    if (status != core::Status::SUCCESS) {
        MI_LOG_ERROR("Tensor allocation failed");
        return status;
    }
    
    MI_LOG_INFO("Engine built successfully");
    return core::Status::SUCCESS;
}

build 函数的意义在于，它承担了所有的算法复杂性。topological_sort (O(V+E)) 是昂贵的，optimize 更是如此。build 函数将这些开销全部“吸收”，从而保证 forward 函数的“轻盈”。

3. “执行”阶段 (`Engine::forward`)：轻量级的“热”路径

forward 函数是 Engine 的“热路径”（Hot Path）。它的设计目标是极致的简单和高效。

它之所以能做到这一点，全靠 build 阶段准备好的 sorted_nodes_（执行计划）。

// mini_infer/runtime/engine.cpp
core::Status Engine::forward(
    const std::unordered_map<std::string, std::shared_ptr<core::Tensor>>& inputs,
    std::unordered_map<std::string, std::shared_ptr<core::Tensor>>& outputs) {
    
    // --- 1. 注入输入 ---
    // 遍历图的"入口"节点
    for (const auto& input_name : graph_->inputs()) {
        auto it = inputs.find(input_name);
        if (it == inputs.end()) { /* ... 错误：用户没给 ... */ }
        
        // 将用户的 Tensor "注入" 到图中，作为 Input 节点的"输出"
        auto node = graph_->get_node(input_name);
        node->set_output_tensors({it->second});
    }
    
    // --- 2. 顺序执行 【核心】---
    // "无脑"遍历"已排好序"的执行计划 (sorted_nodes_)
    // 这不是递归，这是一个简单的、高性能的 for 循环
    for (auto& node : sorted_nodes_) {
        // execute_node 会处理该节点，为下一个节点准备好数据
        auto status = execute_node(node);
        if (status != core::Status::SUCCESS) {
            MI_LOG_ERROR("Node execution failed: " + node->name());
            return status;
        }
    }
    
    // --- 3. 收集输出 ---
    // 遍历图的"出口"节点
    outputs.clear();
    for (const auto& output_name : graph_->outputs()) {
        auto node = graph_->get_node(output_name);
        // 从"出口"节点中取出最终的 Tensor，交给用户
        if (node && !node->output_tensors().empty()) {
            outputs[output_name] = node->output_tensors()[0];
        }
    }
    
    return core::Status::SUCCESS;
}

这种**“推送式” (Push-based)** 的顺序执行，与 ncnn 的“拉取式” (Pull-based) 递归执行形成了鲜明对比。这是现代框架（如 PyTorch, ONNXRuntime）的标准做法，因为它没有递归开销，逻辑清晰，并且非常容易地映射到异步执行（如 CUDA Streams）。

4. `execute_node`：连接“图”与“算子”的桥梁

forward 循环依赖一个辅助函数 execute_node。这个函数是图（Graph）世界和算子（Operator）世界之间的“翻译官”。

// mini_infer/runtime/engine.cpp
core::Status Engine::execute_node(std::shared_ptr<graph::Node> node) {
    if (!node || !node->get_operator()) { /* ... 错误 ... */ }
    
    // 1. 【翻译输入】
    //    从"图"中收集"上游"节点 (input_node) 的"输出" (output_tensors)
    std::vector<std::shared_ptr<core::Tensor>> input_tensors;
    for (const auto& input_node : node->inputs()) {
        const auto& outputs = input_node->output_tensors();
        if (!outputs.empty()) {
            input_tensors.push_back(outputs[0]); // 假设单输出
        }
    }
    
    // 2. 【获取输出缓冲区】
    //    获取"当前"节点准备好被填充的"输出" (output_tensors)
    //    这些 Tensor 是在 build() -> allocate_tensors() 中被预分配好的
    auto& output_tensors = node->output_tensors();
    
    // 3. 【执行计算】
    //    调用"算子"的 forward，将"翻译好的输入"和"输出缓冲区"传给它
    return node->get_operator()->forward(input_tensors, output_tensors);
}

execute_node 的逻辑是 Mini-Infer 数据流的核心： 一个节点的 inputs，就是它上游 input_nodes 的 outputs。

这个函数完美地将图的拓扑结构（node->inputs()）转换为了 Operator::forward 所需的 std::vector<Tensor>。

总结与展望

Engine 登基，Mini-Infer 的核心架构宣告完成。

我们现在拥有了一个完整的、端到端的推理引擎“骨架”。它能够加载 Graph，通过 Backend 管理硬件，利用 topological_sort 制定执行计划，并通过 forward 循环调用 Operator。

我们所有的架构设计（Blog 1-5）至此已经全部“闭环”。

但这个引擎目前还是“空转”的。allocate_tensors 还是 //TODO，我们甚至连一个 Operator 都还没实现。