附录 B · 参数速查

附录 · B

参数速查

需要查表时直接拉到这里。所有符号按"架构 / 注意力 / MoE / 残差 / 训练"五类组织，配数学渲染。

符号	含义	V4-Pro	V4-Flash
$L$	Transformer 层数	61	43
$d$	hidden size（主干维）	7168	4096
$\|V\|$	词表大小（BBPE，沿用 V3）	$\approx 128\text{K}$	$\approx 128\text{K}$
$N_{\text{total}}$	总参数量	1.6 T	284 B
$N_{\text{act}}$	每 token 激活参数量	49 B	13 B
$T_{\text{train}}$	训练 token 总量	33 T	32 T
$\eta_{\max}$	峰值学习率	$2.0 \times 10^{-4}$	$2.7 \times 10^{-4}$
seq schedule	序列长度阶梯（4 段）	4K → 16K → 64K → 1M	4K → 16K → 64K → 1M

符号	含义	V4-Pro	V4-Flash
$m$	CSA 压缩率（每 $m$ 个 token 压成 1 个）	4	4
$k$	CSA Top-k 选择数	1024	512
$m'$	HCA 重压缩率	128	128
$n_h$	核心 attention head 数	128	64
$d_h$	每个 head 的 value dim	512	512
$d_c$	query 压缩 latent 维（MQA 共享 KV）	1536	1536
$n^I_h$	Lightning Indexer head 数	64	64
$c^I$	Indexer head dim	128	128
$g$	Grouped Output Projection 组数	16	16
$d_g$	每组中间 dim	1024	1024
$n_{\text{win}}$	SWA 窗口长度	128	128
$r_{\text{rope}}$	Partial RoPE 维数（仅最后 r 维做位置编码）	64	64

符号	含义	V4-Pro	V4-Flash
$E_{\text{routed}}$	路由专家数	384	256
$E_{\text{shared}}$	共享专家数	1	1
$K$	每 token 激活专家数（top-K）	6	6
$d_{\text{ff}}$	专家 FFN 中间维	3072	2048
$W$	MegaMoE wave 数	4–6	4–6
$C / B$	硬件协设公式 $\le 2d$ FLOPs/Byte	$\le 14336$	$\le 8192$

符号	含义	V4-Pro	V4-Flash
$n_{\text{hc}}$	mHC 扩展因子（残差通路数）	4	4
$t_{\max}$	Sinkhorn 迭代步数	20	20
$\\|B\\|_2$	残差矩阵谱范数（约束 $\le 1$）	$\le 1.0$	$\le 1.0$

符号 / 项	含义	V4 取值
三档 RL ctx	Non-think / Think High / Think Max 训练上下文	8K / 128K / 384K
$G$	GRPO 组内采样数	8
$N_{\text{teacher}}$	OPD 蒸馏的领域 teacher 数	10+
$D_{\text{KL}}(\pi_\theta \\| \pi_E)$	OPD 损失（reverse KL，on-policy）	—
WAL 粒度	rollout write-ahead log	token-level
FP4 范围	rollout / teacher / reference forward	FP4 (E2M1)
backward 精度	梯度反向	FP8 (E4M3)

mHC 残差更新（Ch03 §2.2）：

X_{l+1} \;=\; B_l X_l \;+\; C_l\, F_l(A_l X_l), \qquad B_l \in \mathrm{Birkhoff}(n_{\text{hc}})

Sinkhorn 投影（把任意非负矩阵压到双随机面）：

\widehat{B} \leftarrow \widehat{B} \,/\, \mathbf{1}^\top \widehat{B} \quad\text{（行归一）},\qquad \widehat{B} \leftarrow \widehat{B} \,/\, \widehat{B}\, \mathbf{1} \quad\text{（列归一）}

CSA 单 query 复杂度（Ch04 §3）：

\mathcal{O}_{\text{CSA}} \;=\; \underbrace{\frac{n}{m} \cdot d_h \cdot n^I_h}_{\text{indexer score}} \;+\; \underbrace{k \cdot d_h \cdot n_h}_{\text{core attention}}

Polar 分解 / Muon 目标（Ch06 §2）：

M = U \Sigma V^\top \;\Longrightarrow\; \mathrm{Polar}(M) = U V^\top \;\;\text{（所有 }\sigma_i = 1\text{）}

Newton–Schulz 一步（Ch06 §3，矩阵形式）：

M \;\leftarrow\; a\, M \;+\; b\, (MM^\top) M \;+\; c\, (MM^\top)^2 M

展开到 SVD 坐标里就是把每个奇异值独立施加一个标量五次多项式 $p(\sigma) = a\sigma + b\sigma^3 + c\sigma^5$，$U, V$ 自始至终不变。

MegaMoE 流水加速比（Ch07）：

T_{\text{wave}} \;=\; T \cdot \!\left(1 + \frac{4}{W}\right), \qquad \frac{T_{\text{naive}}}{T_{\text{wave}}} \;=\; \frac{5W}{W + 4} \xrightarrow{W \to \infty} 5\times

硬件协设公式（Ch07）：

\frac{C}{B} \;\le\; 2d \;\;\;(\text{FLOPs/Byte})

Stochastic Rounding（Ch11 §1，FP32→BF16 通信减半）：

\mathrm{SR}(x) = \begin{cases} \lceil x \rceil & \text{w.p. } p \\ \lfloor x \rfloor & \text{w.p. } 1 - p \end{cases},\qquad p = \frac{x - \lfloor x \rfloor}{\lceil x \rceil - \lfloor x \rfloor}, \quad \mathbb{E}[\mathrm{SR}(x)] = x

Two-stage CP 输出长度（Ch11 §3）：

\text{Stage 1 输出长度（每 rank）} \;=\; \frac{s}{m} + 1 \;\;\text{（}+1\text{ 是邻居桥接 token）}

KV cache 异构对齐（Ch12）：

\text{Block 大小} \;=\; \mathrm{lcm}(m, m') \;=\; \mathrm{lcm}(4, 128) \;=\; 128

Sample-level Attention Mask（Ch13 §2）：

M_{ij} \;=\; \begin{cases} 0 & \text{if } j \le i \;\;\land\;\; \mathrm{sid}(i) = \mathrm{sid}(j) \\ -\infty & \text{otherwise} \end{cases}

Anticipatory Routing（Ch14 §2，路由用旧权重 + 激活用新权重）：

r_t \;=\; \mathrm{topk}\!\left(W_r^{\,(t-\Delta t)} \cdot x_t\right), \qquad y_t \;=\; \sum_{i \in r_t} \mathrm{Expert}_i^{\,(t)}(x_t)

SwiGLU Clamping（Ch14 §3）：

g \leftarrow \mathrm{clamp}(g,\, -10,\, 10), \qquad u \leftarrow \min(u,\, 10)

GRPO Advantage（Ch16 §2）：

A_i \;=\; \frac{r_i - \mathrm{mean}(\{r_j\}_{j=1}^G)}{\mathrm{std}(\{r_j\}_{j=1}^G)}, \qquad \mathcal{L}_{\text{GRPO}} \;=\; -\frac{1}{G}\sum_{i=1}^G A_i \, \log \pi_\theta(y_i \mid x)

OPD 损失（Ch18 §2，多教师 reverse KL 加权和）：

\mathcal{L}_{\text{OPD}}(\theta) \;=\; \sum_{i=1}^N w_i \,\cdot\, \mathbb{E}_{x \sim \mathcal{D}_i,\; y \sim \pi_\theta(\cdot \mid x)}\!\left[\,D_{\text{KL}}\!\big(\pi_\theta \,\big\|\, \pi_{E_i}\big)\,\right]

KL 散度（reverse 形式）：

D_{\text{KL}}(\pi_\theta \| \pi_E) \;=\; \sum_{v=1}^{|V|} \pi_\theta(v) \,\log\, \frac{\pi_\theta(v)}{\pi_E(v)} \;\;\text{（每 token 位置一次）}

— 完 —
本笔记作于 2026-04，基于 DeepSeek-V4 Preview 技术报告整理，非 DeepSeek 官方文档。