Mini-Infer (28): Core 数据结构优化 — Storage 与 Tensor 分离

1. 问题背景：为什么 Tensor 需要与 Storage 分离？

在早期的 Mini-Infer 实现中，Tensor 类直接持有数据指针和分配器。这种设计在简单场景下工作良好，但随着功能的增加，问题逐渐暴露：

A. 内存池复用的需求

静态内存规划（Blog 23）要求多个 Tensor 共享同一块预分配的内存。如果 Tensor 直接持有数据指针，就无法优雅地实现这种共享。

┌─────────────────────────────────────────────────────┐
│                  Shared Memory Pool                 │
├─────────┬─────────┬─────────┬─────────┬─────────────┤
│ Tensor A│ Tensor B│ Tensor C│ Tensor D│   ...       │
│ offset=0│offset=1K│offset=3K│offset=4K│             │
└─────────┴─────────┴─────────┴─────────┴─────────────┘

B. 多设备支持的需求

同一个 Tensor 可能需要在 CPU 和 GPU 之间迁移。如果设备信息和数据指针耦合在一起，迁移逻辑会变得复杂。

C. 视图 (View) 共享存储的需求

view() 操作创建一个新的 Tensor，但共享底层数据。这需要引用计数和共享所有权。

1 2	auto tensor = Tensor::create({2, 3, 4}, DataType::FLOAT32); auto view = tensor->view({6, 4}); // 共享数据，不同形状

解决方案：将 Tensor 拆分为两层：

Storage：管理原始内存块（数据 + 容量 + 设备）。
Tensor：管理元数据（形状 + 数据类型 + 步长 + 偏移）+ 引用 Storage。

2. Storage 类设计

A. 类定义

// mini_infer/core/tensor.h

class Storage {
public:
    Storage() = default;
    Storage(size_t capacity_bytes, DeviceType device,
            size_t alignment = kDefaultAlignment);
    Storage(const std::shared_ptr<void>& external, size_t capacity_bytes,
            DeviceType device = DeviceType::CPU);

    void reset(size_t capacity_bytes, DeviceType device,
               size_t alignment = kDefaultAlignment);
    void set_external(const std::shared_ptr<void>& external,
                      size_t capacity_bytes, DeviceType device);

    void* data() const { return buffer_.get(); }
    size_t capacity() const { return capacity_; }
    DeviceType device() const { return device_; }
    bool empty() const { return buffer_ == nullptr; }

private:
    std::shared_ptr<void> buffer_;   // 通用存储
    size_t capacity_{0};              // 容量（字节）
    DeviceType device_{DeviceType::CPU};
};

B. shared_ptr 的通用存储

使用 std::shared_ptr<void> 而不是 std::shared_ptr<float> 或其他具体类型，有几个好处：

类型无关：可以存储任何数据类型（float、int、half 等）。
自定义删除器：可以指定不同的释放方式（CPU free、CUDA free 等）。
引用计数：自动管理生命周期，支持多个 Tensor 共享。

// 分配 CPU 内存
buffer_.reset(ptr, [allocator](void* p) {
    allocator->deallocate(p);
});

// 分配 CUDA 内存
buffer_.reset(cuda_ptr, [cuda_allocator](void* p) {
    cuda_allocator->deallocate(p);
});

C. capacity vs size_in_bytes 的区别

capacity：Storage 实际分配的字节数（可能大于需要的）。
size_in_bytes：Tensor 当前形状需要的字节数。

1	capacity >= size_in_bytes

这种设计允许 Tensor 在不重新分配的情况下改变形状（只要新形状不超过容量）。

D. reset 方法实现

// mini_infer/core/tensor.cpp

void Storage::reset(size_t capacity_bytes, DeviceType device, size_t alignment) {
    if (capacity_bytes == 0) {
        buffer_.reset();
        capacity_ = 0;
        device_ = device;
        return;
    }

    device_ = device;
    // 对齐到指定字节数
    size_t aligned_bytes = ((capacity_bytes + alignment - 1) / alignment) * alignment;

    // 根据设备类型选择分配器
    auto allocator_type = device == DeviceType::CPU
                              ? AllocatorFactory::AllocatorType::CPU
                              : AllocatorFactory::AllocatorType::CUDA;
    auto allocator = AllocatorFactory::get_allocator(allocator_type);

    void* ptr = allocator->allocate(aligned_bytes, alignment);

    // 使用自定义删除器
    buffer_.reset(ptr, [allocator](void* p) {
        allocator->deallocate(p);
    });

    // 零初始化
#ifdef MINI_INFER_USE_CUDA
    if (device == DeviceType::CUDA) {
        cudaMemset(ptr, 0, aligned_bytes);
    } else
#endif
    {
        std::memset(ptr, 0, aligned_bytes);
    }

    capacity_ = aligned_bytes;
}

3. Tensor 的新接口

A. 类定义

// mini_infer/core/tensor.h

class Tensor {
public:
    Tensor() = default;
    Tensor(const Shape& shape, DataType dtype,
           DeviceType device = DeviceType::CPU,
           size_t alignment = kDefaultAlignment);

    // 禁用拷贝，允许移动
    Tensor(const Tensor&) = delete;
    Tensor& operator=(const Tensor&) = delete;
    Tensor(Tensor&&) noexcept = default;
    Tensor& operator=(Tensor&&) noexcept = default;

    // 元数据访问
    const Shape& shape() const { return shape_; }
    DataType dtype() const { return dtype_; }
    DeviceType device() const { return device_;nst std::vector<int64_trides() const { return strides_; }
    size_t storage_offset() const { return storage_offset_; }

    // 数据访问
    void* data();
    const void* data() const;
    size_t size_in_bytes() const;
    size_t capacity() const;
    bool empty() const;

    // 形状操作
    void reshape(const Shape& new_shape);
    void resize(const Shape& new_shape);
    std::shared_ptr<Tensor> view(const Shape& new_shape) const;

    // 外部数据绑定
    void bind_external_data(const std::shared_ptr<void>& data,
                            size_t capacity_bytes, DeviceType device);
    bool bind_external_data_with_offset(const std::shared_ptr<void>& data,
                                        size_t capacity_bytes,
                                        size_t offset_bytes, DeviceType device);

    // 元数据更新（不分配内存）
    void set_shape_metadata(const Shape& shape);
    void set_dtype(DataType dtype);

private:
    Shape shape_;
    DataType dtype_{DataType::FLOAT32};
    std::shared_ptr<Storage> storage_;
    size_t storage_offset_{0};
    std::vector<int64_t> strides_;
    DeviceType device_{DeviceType::CPU};
    size_t alignment_{kDefaultAlignment};
};

###ternal_data / bind_external_data_with_offset

这两个方法是内存池复用的关键：

// mini_infer/core/tensor.cpp

void Tensor::bind_external_data(const std::shared_ptr<void>& data,
                                size_t capacity_bytes, DeviceType device) {
    if (!storage_) {
        storage_ = std::make_shared<Storage>(data, capacity_bytes, device);
    } else {
        storage_->set_external(data, capacity_bytes, device);
    }
    storage_offset_ = 0;
    device_ = device;
    compute_contiguous_strides();
}

bool Tensor::bind_external_data_with_offset(const std::shared_ptr<void>& data,
                                   size_t capacity_bytes,
                                            size_t offset_bytes,
                                            DeviceType device) {
    const size_t required = size_in_bytes();
    // 边界检查
    if (offset_bytes + required > capacity_bytes) {
        return false;
    }

    if (!storage_) {
        storage_ = std::make_shared<Storage>(data, capacity_bytes, device);
    } else {
        storage_->set_external(data, capacity_bytes, device);
    }
    storage_offset_ = offset_bytes;
    device_ = device;
    compute_contiguous_strides();
    return true;
}

使用场景：

// 内存规划器分配一块大内存
auto shared_buffer = allocate_shared_buffer(total_size);

// 多个 Tensor 绑定到不同偏移
tensor_a->bind_external_data_with_offset(shared_buffer, total_size, 0, device);
tensor_b->bind_external_data_with_offset(shared_buffer, total_size, 1024, device);
tensor_c->bind_external_data_with_offset(shared_buffer, total_size, 3072, device);

C. set_shape_metadata vs resize

set_shape_metadata：只更新形状元数据，不分配内存。用于动态形状场景。
resize：更新形状并确保有足够的内存。可能触发重新分配。

void Tensor::set_shape_metadata(const Shape& shape) {
    shape_ = shape;
    compute_contiguous_strides();
    // 不分配内存！
void Tensor::resize(const Shape& new_shape) {
    size_t new_size = static_cast<size_t>(new_shape.numel()) * element_size();
    ensure_contiguous_storage(new_size);  // 可能重新分配
    shape_ = new_shape;
    compute_contiguous_strides();
}

D. storage_offset_ 的作用

storage_offset_ 表示 Tensor 数据在 Storage 中的起始偏移（字节）。

void* Tensor::data() {
    if (!storage_ || storage_->empty()) {
        return nullptr;
    }
    auto base = static_cast<uint8_t*>(storage_->data());
    return base ? base + storage_offset_ : nullptr;
}

这使得多个 Tensor 可以共享同一个 Storage，但指向不同的位置：

Storage: [████████████████████████████████████████]
          ^           ^           ^
          |           |           |
       Tensor A    Tensor B    Tensor C
       offset=0    offset=1K   offset=3K

4. 内存对齐 (kDefaultAlignment)

A. 为什么是 256 字节？

1 2	// mini_infer/core/types.h constexpr size_t kDefaultAlignment = 256;

256 字节对齐的原因：

SIMD 优化：AVX-512 需要 64 字节对齐，256 是其倍数。
GPU 内存访问：CUDA 推荐 256 字节对齐以获得最佳性能。
缓存行友好：大多数 CPU 缓存行是 64 字节，256 是其倍数。
TensorRT 兼容：TensorRT 默认使用 256 字节对齐。

B. 对齐计算

1	size_t aligned_bytes = ((capacity_bytes + alignment 1) / alignment) * alignment;

例如：

需要 1000 字节，对齐到 256 → 分配 1024 字节
需要 256 字节，对齐到 256 → 分配 256 字节
需要 257 字节，对齐到 256 → 分配 512 字节

5. 与 MemoryPlanner 的协同

A. 内存规划流程

1
2
3

1. MemoryPlanner 分析图，计算每个 Tensor 的偏移
2. ExecutionContext 分配一块大的 shared_buffer
3. 每个 Tensor 调用 bind_external_data_with_offset 绑定到对应偏移

B. 零拷贝内存复用

// ExecutionContext::try_bind_tensor_to_pool

const size_t offset = plan.tensor_offsets[node_id];
tensor->bind_external_data_with_offset(
    shared_buffer_, shared_buffer_size_, offset, device_type);

关键点：

不分配新内存：Te接指向 shared_buffer 的某个位置。
共享所有权：多个 Tensor 共享 shared_buffer 的 shared_ptr。
生命周期管理：当所有 Tensor 释放后，shared_buffer 自动释放。

6. view() 方法与零拷贝视图

std::shared_ptr<Tensor> Tensor::view(const Shape& new_shape) const {
    // 元素数量必须相同
    if (new_shape.numel() != shape_.numel()) {
        return nullptr;
    }

    auto view_tensor = std::make_shared<Tensor>();
    view_tensor->shape_ = new_shape;
    view_tensor->dtype_ = dtype_;
    view_tensor->storage_ = storage_;           // 共享 Storage！
    view_tensor->storage_offset_ = storage_offset_;
    view_tensor->device_ = device_;
    view_tensor->alignment_ _;
    view_tensor->compute_contiguous_strides();
    return view_tensor;
}

使用示例：

auto tensor = Tensor::create({2, 3, 4}, DataType::FLOAT32);  // 24 元素
auto view1 = tensor->view({6, 4});   // 24 元素，不同形状
auto view2 = tensor->view({24});     // 24 元素，一维

// 三个 Tensor 共享同一块内存
// 修改 tensor 的数据会影响 view1 和 view2

7. 总结

本篇我们完成了 Core 层的数据结构优化：

Storage 与 Tensor 分离：Storage 管理内存，Tensor 管理元数据。
shared_ptr 通用存储：支持任意数据类型和自定义删除器。
外部数据绑定：bind_external_data_with_offset 实现内存池复用。
256 字节对齐：兼顾 SIMD、GPU 和缓存行优化。
零拷贝视图：view() 方法共享底层 Storage。

这些设计为运行时的零拷贝执行和静态内存规划提供了坚实的基础。下一篇，我们将进入 CUDA 后端，看看 CUDAAllocator 如何管理 GPU 显存。