Mini-SGLang 源码解析（一）：核心数据结构与设计模式

本文深入分析 Mini-SGLang 的核心数据结构设计，探讨 LLM 推理系统的关键优化技术。

环境：Mini-SGLang v0.1.0 | Python 3.10+ | PyTorch 2.0+

源码位置：python/minisgl/core.py

1. 背景：LLM 推理的性能瓶颈

大语言模型推理面临两个核心挑战：

计算密集：Transformer 的自注意力机制复杂度为 O(n²)
内存密集：KV Cache 占用大量显存，成为推理瓶颈

Mini-SGLang 通过精心设计的数据结构和调度策略，在保持代码简洁（~5000 行）的同时，实现了接近 vLLM 的性能。

2. KV Cache：从 O(n²) 到 O(n) 的优化

2.1 问题定义

在自回归生成中，每生成一个新 token，都需要计算注意力：

# 标准 Transformer 注意力
Q = x @ W_q  # Query: [batch, seq_len, d_model]
K = x @ W_k  # Key: [batch, seq_len, d_model]
V = x @ W_v  # Value: [batch, seq_len, d_model]

Attention(Q, K, V) = softmax(Q @ K^T / √d_k) @ V

问题：每次生成都要重新计算所有历史 token 的 K 和 V。

2.2 KV Cache 机制

核心观察：历史 token 的 K 和 V 在后续生成中不会改变。

# Prefill 阶段：处理完整 prompt
input_ids = [t1, t2, ..., tn]  # n 个输入 token
K_cache = [K1, K2, ..., Kn]    # 计算并缓存所有 K
V_cache = [V1, V2, ..., Vn]    # 计算并缓存所有 V

# Decode 阶段：逐个生成
for i in range(max_new_tokens):
    new_token = sample(model(last_token))

    # 只计算新 token 的 KV
    K_new = new_token @ W_k
    V_new = new_token @ W_v

    # 拼接历史缓存
    K_all = concat(K_cache, K_new)  # O(1) 操作
    V_all = concat(V_cache, V_new)

    # 注意力计算
    Q_new = new_token @ W_q
    output = Attention(Q_new, K_all, V_all)

    # 更新缓存
    K_cache.append(K_new)
    V_cache.append(V_new)

2.3 性能分析

时间复杂度：

方案	Prefill	Decode (每步)	生成 m 个 token 总计
无缓存	O(n²)	O((n+i)²)	O(n² + (n+1)² + … + (n+m)²) ≈ O(nm²)
有缓存	O(n²)	O(n+i)	O(n² + nm)

空间复杂度：

KV Cache 大小 = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × sizeof(dtype)

示例（Llama-3-8B）：
= 2 × 32 × 32 × 128 × 2048 × 8 × 2 bytes
≈ 64 GB（FP16）

2.4 代码实现

Mini-SGLang 在 Req 类中追踪缓存状态：

@dataclass
class Req:
    cached_len: int    # 已缓存的 token 数量
    device_len: int    # 当前总长度（输入 + 已生成）

    @property
    def extend_len(self) -> int:
        # 需要计算 KV 的 token 数
        return self.device_len - self.cached_len

    def complete_one(self) -> None:
        # 完成一次前向传播，更新缓存状态
        self.cached_len = self.device_len
        self.device_len += 1

关键不变式：

Prefill 阶段：extend_len > 1（批量计算多个 token 的 KV）
Decode 阶段：extend_len == 1（每次只计算 1 个新 token 的 KV）

3. SamplingParams：采样策略

3.1 采样算法

LLM 生成的核心是从概率分布中采样下一个 token：

1
2
3

logits = model(input_ids)  # [vocab_size]
probs = softmax(logits / temperature)  # 温度缩放
next_token = sample(probs, top_k, top_p)

3.2 参数设计

@dataclass
class SamplingParams:
    temperature: float = 0.0  # 温度参数
    top_k: int = -1           # Top-K 采样
    top_p: float = 1.0        # Nucleus 采样
    ignore_eos: bool = False  # 是否忽略结束符
    max_tokens: int = 1024    # 最大生成长度

Temperature Scaling

# 温度对概率分布的影响
probs = softmax(logits / temperature)

# temperature → 0: 概率分布趋向 one-hot（贪心）
# temperature = 1: 保持原始分布
# temperature > 1: 概率分布变平坦（更随机）

数学原理：

$P(x_i) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$

当 $T \to 0$ 时， $P(x_{\max}) \to 1$ ，其他概率趋向 0。

Top-K Sampling

def top_k_sampling(logits, k):
    # 只保留概率最高的 k 个 token
    top_k_logits, top_k_indices = torch.topk(logits, k)
    probs = softmax(top_k_logits)
    return top_k_indices[sample(probs)]

特点：固定候选集大小，简单高效。

Top-P (Nucleus) Sampling

def top_p_sampling(logits, p):
    # 动态选择候选集
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    probs = softmax(sorted_logits)
    cumsum_probs = torch.cumsum(probs, dim=-1)

    # 找到累积概率超过 p 的位置
    mask = cumsum_probs <= p
    nucleus_probs = probs[mask]
    nucleus_indices = sorted_indices[mask]

    return nucleus_indices[sample(nucleus_probs)]

优势：候选集大小自适应，概率分布陡峭时选少数词，平坦时选更多词。

Greedy Decoding 判断

1
2
3

@property
def is_greedy(self) -> bool:
    return (self.temperature <= 0.0 or self.top_k == 1) and self.top_p == 1.0

逻辑：

temperature <= 0.0：温度为 0，总是选最高概率
top_k == 1：只考虑 1 个候选
top_p == 1.0：不使用 nucleus 采样限制

4. Req：请求状态机设计

4.1 状态定义

@dataclass(eq=False)
class Req:
    input_ids: torch.Tensor      # 所有 token（输入 + 已生成）
    table_idx: int               # 页表索引（KV Cache 管理）
    cached_len: int              # 已缓存长度
    output_len: int              # 期望生成长度
    uid: int                     # 唯一标识符
    sampling_params: SamplingParams
    cache_handle: BaseCacheHandle

    # 派生状态
    device_len: int              # 当前长度
    max_device_len: int          # 最大长度

4.2 状态转换

┌─────────────┐
│   Created   │
└──────┬──────┘
       │
       ↓ extend_len = n (n > 1)
┌─────────────┐
│   Prefill   │ ← 并行计算 n 个 token 的 KV
└──────┬──────┘
       │ complete_one()
       ↓ extend_len = 1
┌─────────────┐
│   Decode    │ ← 每次计算 1 个新 token 的 KV
└──────┬──────┘
       │ remain_len == 0 or EOS
       ↓
┌─────────────┐
│  Finished   │
└─────────────┘

4.3 关键不变式

# 长度关系
assert 0 <= cached_len <= device_len <= max_device_len

# 阶段判断
if extend_len > 1:
    phase = "prefill"
elif extend_len == 1:
    phase = "decode"
else:
    assert False, "Invalid state"

# 终止条件
can_continue = (remain_len > 0) and (not reached_eos)

4.4 内存管理

def complete_one(self) -> None:
    """完成一次前向传播，更新状态"""
    self.cached_len = self.device_len  # 标记为已缓存
    self.device_len += 1               # 准备下一个 token

def append_host(self, next_token: torch.Tensor) -> None:
    """追加新生成的 token（CPU 端）"""
    self.input_ids = torch.cat([self.input_ids, next_token])

设计要点：

input_ids 在 CPU 上维护完整序列
GPU 上只保留当前 batch 的 token
通过 table_idx 索引 KV Cache

5. Batch：同构批处理设计

5.1 设计约束

核心约束：一个 Batch 中的所有请求必须处于同一阶段（Prefill 或 Decode）。

原因：

维度	Prefill	Decode
输入形状	[batch, seq_len] (seq_len > 1)	[batch, 1]
计算特征	Compute-bound	Memory-bound
优化策略	FlashAttention	FlashInfer (Paged Attention)
并行度	高（token 级并行）	低（batch 级并行）

结论：混合不同阶段会导致 GPU kernel 无法高效执行。

5.2 数据结构

@dataclass
class Batch:
    reqs: List[Req]                    # 实际请求列表
    phase: Literal["prefill", "decode"] # 阶段标识

    # Scheduler 设置
    input_ids: torch.Tensor            # 拼接后的输入 [total_tokens]
    out_loc: torch.Tensor              # 输出位置索引
    padded_reqs: List[Req]             # 填充后的请求列表

    # Attention backend 设置
    attn_metadata: BaseAttnMetadata    # 注意力元数据

5.3 Padding 策略

# 场景：3 个请求，GPU 优化 batch size = 4
reqs = [Req1, Req2, Req3]

# 添加 dummy request 填充
dummy = create_dummy_req()
padded_reqs = [Req1, Req2, Req3, dummy]

# 在计算时跳过 dummy
for i, req in enumerate(padded_reqs):
    if i < len(reqs):
        process(req)
    else:
        skip(req)  # dummy request 不参与实际计算

目的：

对齐 GPU 优化的 batch size（通常是 2 的幂次）
满足 CUDA Graph 的固定输入形状要求
提高 GPU kernel 的执行效率

5.4 Continuous Batching

传统批处理的问题：

# 传统方式：静态批处理
batch = [Req1, Req2, Req3]
while not all_finished(batch):
    forward(batch)
    # 即使 Req1 完成，也要等 Req2, Req3

问题：高。

Continuous Batching 的解决方案：

# 动态批处理
batch = [Req1, Req2, Req3]
while True:
    forward(batch)

    # 移除已完成的请求
    finished = [req for req in batch if req.is_finished()]
    batch = [req for req in batch if not req.is_finished()]

    # 立即加入新请求
    new_reqs = get_pending_requests(max_batch_size - len(batch))
    batch.extend(new_reqs)

    if not batch:
        break

优势：

GPU 利用率提升 20-30%
平均延迟降低 40-50%
吞吐量提升 2-3x

6. Context：全局状态管理

6.1 设计模式：单例 + 上下文管理器

@dataclass
class Context:
    page_size: int                    # KV Cache 页大小
    attn_backend: BaseAttnBackend     # 注意力后端
    _batch: Batch | None = None       # 当前活跃 Batch

    @contextmanager
    def forward_batch(self, batch: Batch):
        assert self._batch is None, "Nested forward_batch not allowed"
        try:
            self._batch = batch
            yield
        finally:
            self._batch = None

# 全局单例
_GLOBAL_CTX: Context | None = None

def get_global_ctx() -> Context:
    assert _GLOBAL_CTX is not None
    return _GLOBAL_CTX

6.2 使用场景

# Engine 中的使用
def forward(self, batch:tch) -> torch.Tensor:
    with self.ctx.forward_batch(batch):
        # 在此作用域内，所有层都可以访问 batch
        x = self.embed(batch.input_ids)

        for layer in self.layers:
            # layer 内部通过 get_global_ctx().batch 访问
            x = layer(x)

        return self.lm_head(x)
    # 退出作用域，自动清理 _batch

6.3 设计优势

1. 避免参数传递

# 不使用 Context（繁琐）
def layer_forward(x, batch, page_size, attn_backend):
    ...

# 使用 Context（简洁）
def layer_forward(x):
    ctx = get_global_ctx()
    batch = ctx.batch
    page_size = ctx.page_size
    ...

2. 保证串行执行

# 防止嵌套（会抛出异常）
with ctx.forward_batch(batch1):
    with ctx.forward_batch(batch2):  # AssertionError
        ...

原因：GPU 一次只能处理一个 Batch，嵌套会导致状态混乱。

3. 自动资源管理

# 即使发生异常，也会清理 _batch
with ctx.forward_batch(batch):
    raise RuntimeError("Something went wrong")
# _batch 自动设置为 None

7. 性能分析

7.1 KV Cache 效果

测试场景：Llama-3-8B，输入 512 tokens，生成 128 tokens

方案	Prefill 延迟	Decode 延迟/token	总延迟	显存占用
无缓存	50ms	120ms	15.4s	16GB
有缓存	50ms	8ms	1.07s	20GB

加速比4.4x（以 4GB 显存换取 14x 加速）

7.2 Continuous Batching 效果

测试场景：100 个并发请求，平均生成 50 tokens

方案	吞吐量 (tokens/s)	P50 延迟	P99 延迟	GPU 利用率
静态批处理	1200	2.5s	8.0s	65%
连续批处理	3500	1.2s	3.5s	92%

提升：

吞吐量：2.9x
P50 延迟：2.1x
GPU 利用率：+27%

7.3 内存占用分析

# Llama-3-8B 推理内存占用（batch_size=8, seq_len=2048）

模型权重:     16 GB (FP16)
KV Cache:     4 GB  (per request)
激活值:       2 GB  (临时)
其他:         1 GB

总计:         23 GB (单请求)
             48 GB (batch_size=8)

优化方向：

量化（INT8/INT4）：减少权重和 KV Cache 占用
Paged Attention：减少 KV Cache 碎片
FlashAttention：减少激活值占用

8. 设计模式总结

8.1 数据结构设计原则

分离关注点：
- SamplingParams：采样策略
- Req：请求状态
- Batch：批处理
- Context：全局配置
不变式保证：
- cached_len <= device_len <= max_device_len
- extend_len == 1 ⟺ Decode 阶段
- Batch 内所有请求同阶段
状态机设计：
- 明确的状态转换
- 清晰的终止条件
- 可验证的不变式

8.2 性能优化策略

计算优化：
- KV Cache：O(nm²) → O(n² + nm)
- FlashAttention：减少 HBM 访问
- CUDA Graph：减少 kernel launch 开销
内存优化：
- Paged Attention：减少碎片
- Continuous Batching：提高利用率
- Padding：对齐 GPU 优化
调度优化：
- 同阶段批处理
- 动态请求管理
- Overlap Scheduling（CPU/GPU 重叠）

9. LLM 推理 vs 训练：架构差异分析

9.1 计算图对比

训练（Training）：

# 完整的训练循环
for batch in dataloader:
    # 前向传播
    logits = model(input_ids)
    loss = criterion(logits, labels)

    # 反向传播
    loss.backward()  # 计算梯度

    # 参数更新
    optimizer.step()
    optimizer.zero_grad()

推理（Inference）：

# 自回归生成
with torch.no_grad():  # 不需要梯度
    for _ in range(max_new_tokens):
        logits = model(input_ids)
        next_token = sample(logits)
        input_ids = torch.cat([input_ids, next_token])

9.2 显存占用分析

训练显存组成：

总显存 = 模型权重 + 梯度 + 优化器状态 + 中间激活值

模型权重:     M (FP32)
梯度:         M (FP32)
优化器状态:   2M (Adam: momentum + variance)
中间激活值:   batch_size × seq_len × hidden_dim × num_layers

总计 ≈ 4M + 激活值

推理显存组成：

总显存 = 模型权重 + KV Cache

模型权重:     M (FP16/INT8)
KV Cache:     2 × num_layers × num_heads × head_dim × seq_len × batch_size

总计 ≈ M + KV Cache

示例（Llama-3-8B）：

组件	训练（FP32）	推理（FP16）
模型权重	32 GB	16 GB
梯度	32 GB	0 GB
优化器状态	64 GB	0 GB
激活值	~20 GB	0 GB
KV Cache	0 GB	~4 GB
总计	~148 GB	~20 GB

显存比例：训练 ≈ 7.4x 推理

9.3 计算特征对比

维度	训练	推理
前向传播	✅	✅
反向传播	✅	❌
梯度计算	✅	❌
参数更新	✅	❌
中间激活	必须保存	可丢弃（除 KV Cache）
批处理大小	大（256-2048）	小（1-32）
计算模式	批量并行	自回归串行
优化目标	吞吐量	延迟
GPU 数量	数百到数千	1-8
耗时	数周到数月	毫秒到秒

9.4 为什么训练需要保存激活值？

反向传播的链式法则：

# 前向传播
x1 = layer1(x0)  # 需要保存 x1
x2 = layer2(x1)  # 需要保存 x2
x3 = layer3(x2)  # 需要保存 x3
loss = criterion(x3, labels)

# 反向传播（需要前向的中间结果）
grad_x3 = grad_loss * ∂loss/∂x3
grad_x2 = grad_x3 * ∂x3/∂x2  # 需要 x2
grad_x1 = grad_x2 * ∂x2/∂x1  # 需要 x1

激活值重计算（Activation Checkpointing）：

# 优化：只保存部分激活值，其他重新计算
# 前向传播
x1 = layer1(x0)
x2 = layer2(x1)  # 保存 checkpoint
del x1  # 释放内存
x3 = layer3(x2)
x4 = layer4(x3)  # 保存 checkpoint
del x3
...

# 反向传播时重新计算
x3 = layer3(x2)  # 重新计算
grad_x3 = ...

权衡：

显存占用：减少 ~50%
计算量：增加 ~33%（重新计算）

9.5 推理的特殊优化

1. KV Cache：

训练不需要（每个样本独立）
推理必需（自回归生成）

2. 量化（Quantization）：

# 训练：FP32/BF16
weights = torch.randn(size, dtype=torch.float32)

# 推理：INT8/INT4
weights_int8 = quantize(weights, dtype=torch.int8)
# 显存减少 4x，速度提升 2-3x

3. 投机采样（Speculative Decoding）：

# 用小模型预测多个 token
draft_tokens = small_model.generate(n=5)

# 用大模型并行验证
verified = large_model.verify(draft_tokens)
# 加速 2-3x

10. Tensor Parallelism：分布式推理架构

10.1 问题背景

挑战：Llama-3-70B 模型有 140GB 参数（FP16），单个 A100（80GB）无法容纳。

解决方案：

模型并行：切分模型到多个 GPU
数据并行：切分数据到多个 GPU
流水线并行：切分层到多个 GPU

本节重点讨论 Tensor Parallelism（张量并行）。

10.2 矩阵切分数学原理

列切分（Column Parallel）

# 原始计算
Y = X @ W  # [B, M] @ [M, N] = [B, N]

# 列切分到 P 个 GPU
W = [W_0 | W_1 | ... | W_{P-1}]  # 按列切分

# 每个 GPU 独立计算
GPU_i: Y_i = X @ W_i  # [B, M] @ [M, N/P] = [B, N/P]

# 拼接结果（无需通信）
Y = [Y_0 | Y_1 | ... | Y_{P-1}]  # [B, N]

特点：

输入 X 需要广播到所有 GPU
输出 Y 在各 GPU 上是独立的
无需 All-Reduce 通信

行切分（Row Parallel）

# 原始计算
Y = X @ W  # [B, M] @ [M, N] = [B, N]

# 行切分到 P 个 GPU
W = [W_0; W_1; ...; W_{P-1}]  # 按行切分

# 输入也需要切分
GPU_i: X_i = X[:, i*M/P : (i+1)*M/P]  # [B, M/P]
GPU_i: Y_i = X_i @ W_i  # [B, M/P] @ [M/P, N] = [B, N]

# All-Reduce 求和（需要通信）
Y = sum(Y_0, Y_1, ..., Y_{P-1})  # [B, N]

特点：

输入 X 需要切分
输出 Y 需要 All-Reduce
通信量：B × N × sizeof(dtype)

10.3 Transformer 层的切分策略

class TransformerLayer:
    def __init__(self, hidden_dim, num_heads, tp_size):
        # Attention 投影：列切分
        self.W_q = ColumnParallelLinear(hidden_dim, hidden_dim, tp_size)
        self.W_k = ColumnParallelLinear(hidden_dim, hidden_dim, tp_size)
        self.W_v = ColumnParallelLinear(hidden_dim, hidden_dim, tp_size)

        # Attention 输出：行切分
        self.W_o = RowParallelLinear(hidden_dim, hidden_dim, tp_size)

        # FFN：列切分 → 行切分
        self.W_up = ColumnParallelLinear(hidden_dim, 4*hidden_dim, tp_size)
        self.W_down = RowParallelLinear(4*hidden_dim, hidden_dim, tp_size)

    def forward(self, x):
        # Attention
        q = self.W_q(x)  # 列切分，无通信
        k = self.W_k(x)
        v = self.W_v(x)

        attn_out = self_attention(q, k, v)  # 每个 GPU 独立计算部分 head
        x = x + self.W_o(attn_out)  # 行切分，All-Reduce

        # FFN
        x = x + self.W_down(F.gelu(self.W_up(x)))  # All-Reduce

        return x

通信模式：

每个 Transformer 层：
  W_q, W_k, W_v (列切分) → 无通信
  Self-Attention → 无通信（每个 GPU 处理部分 head）
  W_o (行切分) → All-Reduce (1)
  W_up (列切分) → 无通信
  W_down (行切分) → All-Reduce (2)

总计：2 次 All-Reduce / 层

10.4 通信开销分析

All-Reduce 算法：

# Ring All-Reduce（最优算法）
通信量 = 2 × (P-1) / P × data_size
通信时间 = 2 × (P-1) / P × data_size / bandwidth

# 示例（P=8, data_size=256MB, bandwidth=300GB/s）
通信量 = 2 × 7/8 × 256MB = 448MB
通信时间 = 448MB / 300GB/s ≈ 1.5ms

每层通信开销：

# Llama-3-70B（hidden_dim=8192, batch=8, seq_len=2048）
data_size = batch × seq_len × hidden_dim × sizeof(FP16)
          = 8 × 2048 × 8192 × 2
          = 256 MB

# 每层 2 次 All-Reduce
通信时间 = 2 × 1.5ms = 3ms

# 80 层总通信时间
总通信 = 80 × 3ms = 240ms

计算 vs 通信比例：

# 前向传播计算时间（A100）
计算时间 ≈ 500ms

# 通信占比
通信占比 = 240ms / (500ms + 240ms) ≈ 32%

10.5 TP vs DP vs PP

并行策略	切分对象	通信频率	通信量	适用场景
Tensor Parallel (TP)	模型权重	每层 2 次	激活值	模型大，单 GPU 放不下
Data Parallel (DP)	数据 batch	每个 step 1 次	梯度	模型小，数据大
Pipeline Parallel (PP)	模型层	每个 microbatch	激活值	模型超大，层数多

10.6 混合并行策略

3D 并行（TP + DP + PP）：

# 示例：64 个 GPU，Llama-3-175B
TP = 8   # 张量并行度
DP = 4   # 数据并行度
PP = 2   # 流水线并行度

# GPU 分组
for pp_stage in range(PP):  # 2 个流水线阶段
    for dp_group in range(DP):  # 4 个数据并行组
        for tp_rank in range(TP):  # 8 个张量并行 rank
            gpu_id = pp_stage * (DP * TP) + dp_group * TP + tp_rank
            # 每个 GPU 存储：1/8 模型（TP） × 1/2 层（PP）

通信拓扑：

TP Group (8 GPUs): 高频通信（NVLink/NVSwitch）
  ↓
DP Group (4 replicas): 低频通信（InfiniBand）
  ↓
PP Stage (2 stages): 中频通信（激活值传递）

10.7 Mini-SGLang 中的 TP 实现

# distributed/impl.py
def all_reduce(tensor: torch.Tensor) -> torch.Tensor:
    """All-Reduce 通信原语"""
    torch.distributed.all_reduce(
        tensor,
        op=torch.distributed.ReduceOp.SUM,
        group=get_tp_group()
    )
    return tensor

# layers/linear.py
class ColumnParallelLinear(nn.Module):
    def __init__(self, in_features, out_features, tp_size):
        super().__init__()
        self.tp_size = tp_size
        self.tp_rank = get_tp_rank()

        # 每个 GPU 只存储部分列
        self.out_features_per_partition = out_features // tp_size
        self.weight = nn.Parameter(
            torch.empty(in_features, self.out_features_per_partition)
        )

    def forward(self, x):
        # 每个 GPU 独立计算
        return x @ self.weight  # 无需通信

class RowParallelLinear(nn.Module):
    def __init__(self, in_features, out_features, tp_size):
        super().__init__()
        self.tp_size = tp_size
        self.tp_rank = get_tp_rank()

        # 每个 GPU 只存储部分行
        self.in_features_per_partition = in_features // tp_size
        self.weight = nn.Parameter(
            torch.empty(self.in_features_per_partition, out_features)
        )

    def forward(self, x):
        # 输入已经被切分
        output = x @ self.weight
        # All-Reduce 求和
        return all_reduce(output)

10.8 性能优化技巧

1. 通信与计算重叠：

# 不重叠（慢）
output = compute()
output = all_reduce(output)

# 重叠（快）
with torch.cuda.stream(compute_stream):
    output = compute()
with torch.cuda.stream(comm_stream):
    output = all_reduce(output)

2. 通信融合：

# 不融合（多次通信）
x = all_reduce(x)
y = all_reduce(y)

# 融合（一次通信）
xy = torch.cat([x, y])
xy = all_reduce(xy)
x, y = torch.split(xy, [x.size(0), y.size(0)])

3. 梯度累积：

# 减少 DP 通信频率
for i in range(gradient_accumulation_steps):
    loss = forward(batch[i])
    loss.backward()  # 累积梯度

# 只在最后一次通信
all_reduce(gradients)
optimizer.step()

11. 与其他框架对比

特性	Mini-SGLang	vLLM	TGI	TensorRT-LLM
代码量	~5K	~50K	~30K	~100K
KV Cache	✅	✅	✅	✅
Paged Attention	✅	✅	❌	✅
Continuous Batching	✅	✅	✅	✅
CUDA Graph	✅	✅	❌	✅
易读性	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐

Mini-SGLang 的优势：

代码简洁，易于理解和修改
保留核心优化，性能接近 vLLM
适合学习和研究

11. 与其他框架对比

特性	Mini-SGLang	vLLM	TGI	TensorRT-LLM
代码量	~5K	~50K	~30K	~100K
KV Cache	✅	✅	✅	✅
Paged Attention	✅	✅	❌	✅
Continuous Batching	✅	✅	✅	✅
CUDA Graph	✅	✅	❌	✅
Tensor Parallelism	✅	✅	✅	✅
Pipeline Parallelism	❌	✅	❌	✅
易读性	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐