Mini-Infer (26): 运行时架构重构 (上) — InferencePlan 与 Build-Time 优化

1. 为什么需要分离 Build-Time 和 Run-Time？

在之前的架构中，Engine 类承担了太多职责：图的构建、优化、内存规划、执行上下文管理、推理执行……这种"大一统"的设计在简单场景下工作良好，但随着功能的增加，问题逐渐暴露：

线程安全问题：多个推理请求共享同一个 Engine 实例时，状态管理变得复杂。
资源浪费：每次推理都需要重新准备某些"不变"的数据结构。
扩展困难：想要支持多 Context 并发推理时，现有架构难以适应。

TensorRT 的解决方案是将推理过程分为两个阶段：

Build-Time (构建期)：解析模型、优化图、规划内存、预加载权重。产物是一个不可变的 ICudaEngine。
Run-Time (运行期)：基于 Engine 创建 IExecutionContext，每个 Context 持有自己的中间张量和状态。

这种分离带来的好处是：

Engine 可以被多个 Context 共享，节省内存。
Context 之间相互独立，天然支持并发推理。
构建开销只需付出一次，后续推理零额外开销。

2. TensorRT 的 ICudaEngine 与 IExecutionContext 设计哲学

在 TensorRT 中：

// Build-Time: 构建 Engine (耗时，但只做一次)
IBuilder* builder = createInferBuilder(logger);
INetworkDefinition* network = builder->createNetworkV2(...);
// ... 添加层 ...
ICudaEngine* engine = builder->buildEngineWithConfig(network, config);

// Run-Time: 创建 Context (轻量，可创建多个)
IExecutionContext* context1 = engine->createExecutionContext();
IExecutionContext* context2 = engine->createExecutionContext();

// 并发推理
std::thread t1([&]{ context1->enqueueV3(stream1); });
std::thread t2([&]{ context2->enqueueV3(stream2); });

核心思想：

ICudaEngine 是不可变的，包含优化后的网络结构、权重、内存布局。
IExecutionContext 是可变的，包含中间激活值、绑定的输入输出缓冲区。

Mini-Infer 借鉴了这一设计，将原有的 Engine 拆分为 InferencePlan 和 ExecutionContext。

3. InferencePlan 核心设计

A. EngineConfig 配置结构

EngineConfig 是构建 InferencePlan 的配置蓝图：

// mini_infer/runtime/inference_plan.h

struct EngineConfig {
    core::DeviceType device_type{core::DeviceType::CPU};
    int32_t device_id{0};
    bool enable_profiling{false};
    bool enable_graph_optimization{true};           // 启用图优化
    bool enable_memory_planning{true};              // 启用内存规划
    size_t memory_alignment{256};                   // 内存对齐 (字节)
    size_t max_workspace_size{1024 * 1024 * 1024};  // 1GB

    // 动态形状支持
    bool enable_dynamic_shapes{false};
    std::shared_ptr<OptimizationProfile> optimization_profile;
};

关键配置项：

配置项	作用	默认值
`device_type`	目标设备 (CPU/CUDA)	CPU
`enable_graph_optimization`	是否启用算子融合等优化	true
`enable_memory_planning`	是否启用静态内存规划	true
`memory_alignment`	内存对齐字节数	256
`enable_dynamic_shapes`	是否支持动态形状	false
`optimization_profile`	动态形状的 min/opt/max 范围	nullptr

B. InferencePlan 类定义

// mini_infer/runtime/inference_plan.h

class InferencePlan : public std::enable_shared_from_this<InferencePlan> {
public:
    explicit InferencePlan(const EngineConfig& config);

    // 构建 Plan (Build-Time 的核心)
    core::Status build(std::shared_ptr<graph::Graph> graph);

    // 创建执行上下文 (Run-Time 的入口)
    std::shared_ptr<ExecutionContext> create_execution_context() const;

    // 执行推理
    core::Status execute(ExecutionContext* ctx) const;

    // 查询接口
    std::vector<std::string> get_input_names() const;
    std::vector<std::string> get_output_names() const;
    const MemoryPlan& get_memory_plan() const;
    const EngineConfig& config() const;

private:
    EngineConfig config_;
    std::shared_ptr<graph::Graph> graph_;
    std::vector<std::shared_ptr<graph::Node>> sorted_nodes_;
    MemoryPlan memory_plan_;
    std::vector<InputBinding> input_bindings_;

    // TensorRT-style: GPU 权重在构建时预加载
    std::unordered_map<std::shared_ptr<const core::Tensor>,
                       std::shared_ptr<core::Tensor>,
                       TensorPtrHash, TensorPtrEqual> gpu_weight_cache_;
};

设计要点：

继承 enable_shared_from_this：因为 ExecutionContext 需要持有对 InferencePlan 的弱引用，确保 Plan 的生命周期正确管理。
gpu_weight_cache_：TensorRT 风格的权重预加载缓存。在 Build-Time 将权重从 CPU 拷贝到 GPU，Run-Time 直接使用，避免每次推理的拷贝开销。
sorted_nodes_：拓扑排序后的节点列表，保证执行顺序正确。

C. 不可变性保证与线程安全

InferencePlan 的所有成员在 build() 完成后都不再修改。这意味着：

多个 ExecutionContext 可以安全地共享同一个 InferencePlan。
不需要加锁保护 Plan 的读取操作。
权重数据一份，多个 Context 共享。

4. 构建流程详解

build() 方法是 Build-Time 的核心，它执行五个步骤：

// mini_infer/runtime/inference_plan.cpp

core::Status InferencePlan::build(std::shared_ptr<graph::Graph> graph) {
    graph_ = graph;

    // Step 1: 图优化 (算子融合、常量折叠等)
    if (config_.enable_graph_optimization) {
        auto status = optimize_graph();
        // ...
    }

    // Step 2: 拓扑排序
    auto status = graph_->checked_topological_sort(sorted_nodes_);
    // ...

    // Step 3: 形状推理
    if (config_.enable_dynamic_shapes && config_.optimization_profile) {
        status = infer_shapes_with_profile();
    }         status = infer_shapes();
    }

    // Step 3.5: 更新张量元数据
    status = update_tensor_properties();

    // Step 4: 内存规划
    if (config_.enable_memory_planning) {
        status = plan_memory();
    }

    // Step 5: 预加载权重到 GPU
    if (config_.device_type == core::DeviceType::CUDA) {
        status = preload_weights_to_gpu();
    }

    return core::Status::SUCCESS;
}

让我们逐一剖析每个步骤。

Step 1: 图优化 (optimize_graph)

core::Status InferencePlan::optimize_graph() {
    auto optimizer = graph::GraphOptimizer:eate_default();
    optimizer.set_verbose(config_.enable_profiling);

    auto status = optimizer.optimize(graph_.get());
    optimization_stats_ = optimizer.get_statistics();

    MI_LOG_INFO("[InferencePlan] Graph optimization completed: " +
                std::to_string(optimization_stats_.total_modifications) + " modification(s)");

    return status;
}

这里调用了我们在 Blog 20-21 中实现的 GraphOptimizer，执行算子融合（如 Conv+ReLU）、常量折叠等优化。

Step 2: 拓扑排序 (checked_topological_sort)

1	auto status = graph_->checked_topological_sort(sorted_nodes_);

拓扑排序确保节点按依赖顺序执行。checked_topological_sort 还会验证图的合法性（无环、无孤立节点等）。

Step 3: 形状推理 (infer_shapes)

形状推理是 Build-Time 的关键步骤。它遍历所有节点，根据输入形状计算输出形状：

core::Status InferencePlan::infer_shapes() {
    for (const auto& node : sorted_nodes_) {
        // 1. 收集输入形状
        std::vector<core::Shape> input_shapes;
        // ... 从上游节点和权重张量收集 ...

        // 2. 创建 Plugin 进行形状推理
        auto plugin = operators::PluginRegistry::instance().create_plugin(
            node->type(), config_.device_type);

        // 3. 调用 Plugin 的形状推理
        std::vector<core::Shape> output_shapes;
        status = plugin->infer_output_metadata(input_shapes, input_dtypes,
                                               output_shapes, output_dtypes);

        // 4. 更新节点的输出张量
        for (size_t i = 0; i < output_shapes.size(); ++i) {
            output_tensors[i]->set_shape_metadata(output_shapes[i]);
        }

        // 5. 缓存 Plugin 供执行时使用
        node->get_operator()->set_cached_plugin(std::move(plugin));
    }
    return core::Status::SUCCESS;
}

注意：这里使用了新的 Plugin 架构（将在 Blog 32-35 详细介绍）。Plugin 同时负责形状推理和执行，避免了旧架构中 Operator 和 Kernel 分离带来的复杂性。

Step 3.5: 更新张量元数据 (update_tensor_properties)

这、数据类型、大小都已正确设置：

core::Status InferencePlan::update_tensor_properties() {
    for (const auto& node : sorted_nodes_) {
        for (auto& tensor : node->output_tensors()) {
            // 检查形状是否有效
            if (tensor->shape().ndim() == 0) {
                return core::Status::ERROR_RUNTIME;
            }

            // 检查是否还有动态维度
            bool has_dynamic_dim = false;
            for (size_t d = 0; d < shape.ndim(); ++d) {
                if (shape[d] < 0) has_dynamic_dim = true;
            }

            // 如果有动态维度但没有 Profile，报错
            if (has_dynamic_dim && con_.enable_dynamic_shapes) {
            turn core::Status::ERROR_INVALID_ARGUMENT;
            }
        }
    }
    return core::Status::SUCCESS;
}

Step 4: 内存规划 (plan_memory)

core::Status InferencePlan::plan_memory() {
    MemoryPlanner planner;
    planner.set_enabled(true);
    planner.set_alignment(config_.memory_alignment);

    memory_plan_ = planner.plan(graph_.get());

    MI_LOG_INFO("[InferencePlan] Memory planning completed:");
    MI_LOG_INFO("[InferencePlan]   Original memory:  " +
                std::to_string(memory_plan_.original_memory / 1024.0) + " KB");
    MI_LOG_INFO("[InferencePlan]   Optimized+
                std::to_string(memory_plan_.total_memory / 1024.0) + " KB");
    MI_LOG_INFO("[InferencePlan]   Memory saving:    " +
                std::to_string(memory_plan_.memory_saving_ratio * 100.0f) + "%");

    return core::Status::SUCCESS;
}

这里调用了 Blog 23 中实现的 MemoryPlanner，使用 Linear Scan 算法分析张量生命周期，实现内存复用。

Step 5: 预加载权重到 GPU (preload_weights_to_gpu)

这是 TensorRT 风格的优化。在 Build-Time 将所有权重从 CPU 拷贝到 GPU，避免 Run-Time 的拷贝开销：

core::Status InferencePlan::preload_weights_to_gpu() {
#ifdef MINI_INFER_USE_CUDA
    for (const auto& node : sorted_nodes_) {
        for (const auto& tensor : node->input_tensors()) {
            if (!tensor || tensor->device() == core::DeviceType::CUDA) {
                continue;  // 跳过空张量或已在 GPU 的张量
            }

            // 创建 GPU 张量并拷贝数据
            auto gpu_tensor = std::make_shared<core::Tensor>(
                tensor->shape(), tensor->dtype(), core::DeviceType::CUDA);

            cudaMemcpy(gpu_tensor->data(), tensor->data(),
                       tensor->size_in_bytes(), cudaMemcpyHostToDevice);

            // 缓存 GPU 张量
            gpu_weight_cache_[tensor] = gpu_tensor;
        }
    }
#endif
    return core::Status::SUCCESS;
}

关键设计：

使用 std::unordered_map 缓存 CPU 张量到 GPU 张量的映射。
使用 TensorPtrHash 和 TensorPtrEqual 基于指针进行哈希和比较，避免内容比较的开销。

5. 与 Engine 的对比：职责边界的重新划分

职责	旧 Engine	新架构
图的持有	Engine	InferencePlan
图优化	Engine	InferencePlan::build()
内存规划	Engine	InferencePlan::build()
权重管理	Engine	InferencePlan
中间张量	Engine	ExecutionContext
推理执行	Engine	InferencePlan::execute() + ExecutionContext
输入/输出绑定	Engine	ExecutionContext

新架构中，Engine 类变成了一acade)，只负责协调 InferencePlan 和 ntext：

class Engine {
public:
    void build(std::shared_ptr<graph::Graph> graph) {
        plan_ = std::make_shared<InferencePlan>(config_);
        plan_->build(graph);
    }

    std::shared_ptr<ExecutionContext> create_context() {
        return plan_->create_execution_context();
    }

    void execute(ExecutionContext* ctx) {
        plan_->execute(ctx);
    }

private:
    std::shared_ptr<InferencePlan> plan_;
};

6. 总结

本篇我们完成了运行时架构的第一部分重构：

分离 Build-Time 和 Run-Time：借鉴 TensorRT 的设计哲学。
InferencePlan 作为不可变的构建产物：持有图、权重、内存规划。
五步构建流程：图优化 → 拓扑排序 → 形状推理 → 内存规划 → 权重预加载。

下一篇，我们将深入 ExecutionContext，看看它如何管理运行时状态、实现零拷贝执行，以及支持动态形状推理。