NVIDIA AI/GPU Networking 面试两天冲刺作战手册

岗位：System Software Architect, AI and GPU Networking
面试时间：2026-06-15 20:00-22:00 GMT+8
当前目标：用 2 天周末把 JD 技术要求恢复到“面试可深挖、能画图、能 debug、最好有实战证据”的状态。

0. 这份文档怎么用

你要的不是“CV 包装文档”，而是：

JD 要求什么技能。
你 CV 里有什么对应证据。
哪些是真匹配，哪些是短板。
面试会问哪些技术问题。
面试官顺着每个技能点会怎么继续深挖。
每个深挖点需要什么实验、代码、benchmark 或架构图作为证据。
到什么程度才算“掌握”，不是只知道概念。

这份文档不要按“任务数量”读。正确读法是：

先看 JD 技能点。
再看你 CV 里能接上的证据。
然后准备这个技能点下的追问链。
最后才做对应的延展任务/实验。

也就是说，任务不是主线。任务只是为了支撑面试回答的证据。

这里的“2 天完全掌握”定义为面试验收标准：

能画出核心 data path / control path。
能解释关键机制，不只是背定义。
能回答二级、三级追问。
能给出 debug checklist。
至少对 CUDA/NCCL/LLM serving 做出可运行实验或完整实验设计。
对 RDMA/UCX/GPUNetIO 这种需要硬件环境的内容，能说明真实搭建方法、可观测指标、失败时怎么排查。

现实边界：

2 天内不可能真正成为 UCX/NIXL/GPUNetIO 内核专家。
但可以达到这个岗位首轮技术面试需要的“系统架构级掌握”：知道每层解决什么问题、数据怎么走、哪里会慢、怎么验证。

当前准备环境：

你现在有一台 Windows GPU 机器，这是本次准备的主环境。
Windows 原生负责：NVIDIA driver、CUDA Toolkit、Nsight Systems / Nsight Compute GUI、CUDA Samples。
WSL2 Ubuntu 负责：Linux CUDA CLI、Python/vLLM、UCX/NCCL 工具、接近 NVIDIA AI infra 的 Linux 工作流。
没有 RDMA/RoCE/NVIDIA NIC/DOCA 环境时，不硬装 GPUNetIO；这部分用架构图、实验设计、debug checklist 补齐。

推荐目录：

learning-cuda/
  NVIDIA_Interview_Master_Plan_Deep_CN.md
  labs/
    llm_serving/
    cuda_profiling/
    nccl_tests/
    ucx_rdma_notes/
    diagrams/

0.1 最重要：不要按章节号从头读到尾

这份文档现在内容很多，如果你从 0 -> 17 顺序读，会很容易失焦。
正确结构应该是四层：

flowchart TB
    L1[第一层: 面试地图] --> L2[第二层: 技术主线]
    L2 --> L3[第三层: 实战证据]
    L3 --> L4[第四层: 口头回答]

    L1 --> A[JD 要求 / CV 对应 / 追问树]
    L2 --> B[Serving / NIXL / NCCL / UCX / GPUNetIO / CUDA / Architecture]
    L3 --> C[CUDA code / vLLM benchmark / NCCL tests / UCX checklist / data path 图]
    L4 --> D[英文回答 / 深挖回答 / 反问]

你每天打开文档时，应该先问自己：

我现在是在补哪个 JD 技能点？
这个技能点我 CV 里怎么接？
面试官会从哪里开始问？
他继续追问会追到哪一层？
我需要跑什么实验或画什么图来支撑？
我最后能不能用英文 60-90 秒讲清楚？

所以推荐阅读路径不是章节顺序，而是下面这个顺序。

0.2 推荐阅读路径：按面试准备顺序读

第 0 轮：只建立地图，不学细节

目标：知道这个岗位到底考什么，不要被术语淹没。

读这些：

顺序	读哪里	读完要得到什么
1	`1. JD 技能要求总表`	知道 JD 技能全景。
2	`1.0 主索引`	知道每个技能点对应 CV、追问和证据。
3	`1.0.2 面试追问树`	知道面试会怎么从一问追到三问。
4	`12. 最终验收清单`	知道最后要交付哪些图、playbook、英文回答。

这一轮不要跑代码，也不要深挖 UCX/NIXL/GPUNetIO。
你只需要能回答：

这个岗位的核心不是“会背很多库名”，而是：
AI inference runtime + GPU communication/data movement + CUDA profiling + system architecture。

第 1 轮：先准备最高概率被问的主线

目标：先把面试最可能问到的部分讲顺。

按这个顺序：

顺序	技能主线	为什么先学	对应章节
1	AI inference / model serving	JD 明确写 inference/model serving，也是和你 CV 最容易接上的强项。	第 4 章 + 第 2.6
2	CUDA / Nsight / GPU performance	JD 明确写 C++/Python/CUDA/GPU programming/profiling。	第 9 章 + 第 2.5
3	NCCL / collectives / parallelism	AI/GPU networking 一定会问通信库和并行模式。	第 6 章 + 第 2.7
4	RDMA / UCX / GPUDirect	GPU networking 的底层数据路径。	第 7 章 + 第 2.8

这一轮每个技能只做三件事：

1. 读 JD 要求是什么。
2. 读 CV 怎么接。
3. 读面试会问什么和标准回答。

先不要把每个实验都跑完。
先保证你能口头讲出每条技术主线。

第 2 轮：补 NVIDIA 特有技术栈

目标：把 NVIDIA 相关但你 production 经验不强的内容补到“能解释、能设计验证”的程度。

按这个顺序：

顺序	技术	学习目标	对应章节
1	NIXL	知道它解决 inference state/KV movement，不要和 NCCL 混。	第 3.4 + 第 5 章
2	Dynamo	知道它是 distributed inference framework/orchestration 方向。	第 5 章
3	GPUNetIO	知道 GPU-centric networking 的场景、代价、边界。	第 3.5 + 第 8 章
4	GR00T	作为 NVIDIA Physical AI 视野加分项，不抢主线。	第 16 章

这一轮最重要的是“不乱说”。
你要能明确区分：

NCCL: collective communication across ranks。
NIXL: inference state/data movement，比如 KV transfer。
UCX: transport abstraction，可走 TCP/RDMA/shared memory/CUDA-aware path。
GPUDirect RDMA: RNIC 直接访问 GPU memory 的 data path。
GPUNetIO: 更 GPU-centric 的 packet/network processing 编程模型。
GR00T: robotics VLA workload，不是这个 JD 的核心要求。

第 3 轮：做证据实验，不是做学生任务

目标：每个实验都要能服务一个面试追问。

只做这些 P0 实验：

实验	支撑的问题	读哪里
CUDA memory coalescing / pinned copy / Nsight	“你怎么证明你懂 CUDA profiling？”	第 2.5
vLLM benchmark matrix	“TTFT/TPOT/P99 你怎么测？”	第 2.6
NCCL all_reduce_perf	“all-reduce 慢怎么定位？”	第 2.7
UCX/RDMA/GPUDirect 检查和 data path 图	“本机能验证什么？真实集群怎么查？”	第 2.8

实验完成后不要只保存输出。你要把结果转成面试语言：

我跑了什么？
控制变量是什么？
看了哪些指标？
结果说明什么？
如果结果异常，我会怎么查？
这和 JD 哪个技能点相关？

第 4 轮：只练口头回答

目标：把技术内容压缩成可讲的英文回答。

读这些：

顺序	读哪里	用法
1	第 14 章高频 Q&A	逐题口头回答，不要默读。
2	第 10 章 Architecture / Prototype	准备系统设计类追问。
3	第 15 章反问面试官	选 2-3 个问题。
4	第 16.13 GR00T 怎么自然提到	只在合适场景作为加分项。

0.3 每个技能点的标准学习结构

后面每个技能点都应该按同一个结构学习。
如果某一章内容太长，你只抓这个结构：

1. 是什么
   这个技术的定义是什么？

2. 为什么需要它
   它解决系统里的什么瓶颈？

3. JD 要求到什么级别
   是会用、会 debug、会设计，还是要能写底层实现？

4. 你 CV 怎么接
   你已有经历能不能自然接上？
   哪些地方不能夸大？

5. 面试第一问
   最可能先问什么？

6. 深挖追问
   如果对方懂这个技术，会继续问什么？

7. Debug playbook
   慢了、错了、P99 高了，你怎么查？

8. 实战证据
   你跑什么实验、画什么图、看什么指标？

9. 英文回答
   60-90 秒能不能讲清楚？

你可以把每个技术点都整理成这张表：

字段	你要写出的内容
技术名	例如 NCCL / UCX / NIXL / GPUNetIO / CUDA stream。
一句话定义	不超过 30 秒能解释清楚。
解决的问题	latency、throughput、memory pressure、CPU overhead、network bottleneck。
数据路径	数据从哪里到哪里，中间经过 CPU/GPU/NIC/memory 哪些层。
关键指标	TTFT、TPOT、P99、algbw、busbw、bandwidth、occupancy、stall、copy time。
常见故障	slow rank、TCP fallback、memory-bound、queueing、KV pressure、copy/sync gap。
Debug 顺序	先看什么，再看什么，最后怎么定位。
CV 连接点	你的经历中哪一段可以支撑这个能力。
不能夸大的边界	没有 production ownership 的地方要诚实。
实验证据	代码、benchmark、图、checklist。
英文回答	面试时直接说的版本。

0.4 当前文档的重新分区

为了避免你迷路，可以把全文看成 6 个区：

分区	章节	用途	什么时候看
A. 面试地图	0-1	JD/CV/追问/优先级	每次开始学习先看
B. 证据实验库	2	环境、代码、benchmark、看结果	需要证明某个回答时看
C. 数据移动主线	3	UCX/NIXL/GPUNetIO 总学习路径	学 NVIDIA 特有栈时看
D. 技能深挖	4-10	每个 JD 技能点的详细解释	按技能点逐个学
E. 冲刺和验收	11-15	时间表、验收、Q&A、反问	面试前集中练
F. 加分补充	16-17	GR00T 和参考资料	主线完成后看

最推荐的实际学习顺序：

0 -> 1 -> 4 -> 9 -> 6 -> 7 -> 5 -> 8 -> 10 -> 14 -> 12 -> 15 -> 16

不是：

0 -> 1 -> 2 -> 3 -> 4 -> ... -> 17

1. JD 技能要求总表：JD 要什么、你有什么、怎么补

JD 技能	JD 实际含义	你 CV 现状	匹配度	主要缺口	周末补法
AI inference / model serving	理解 prefill/decode、KV cache、TTFT/TPOT/P99、throughput/memory trade-off	有 LLM inference integration、KV cache、speculative decoding、batching 表述	高	需要更底层 runtime/data movement 语言	画 serving path，跑或设计 vLLM benchmark，准备 debug high TTFT/TPOT
Dynamo / NIXL	NVIDIA 分布式推理 orchestration 和 inference data movement	CV 写了正在 deepen	中低	缺生产经验	学层次、NIXL transfer lifecycle、KV cache movement 场景
NCCL / collectives	多 GPU collective communication，训练/并行通信基础	有 NCCL concepts	中	缺实测证据	跑 `nccl-tests` 或至少能解释 all-reduce debug
Parallelism	DP/TP/PP/FSDP 的通信模式	CV 有 distributed AI systems 学习	中	缺项目经验	做通信模式表，练 DP/TP/PP/FSDP 问答
RDMA / RoCE / GPUDirect	AI cluster 网络数据路径、GPU-NIC 数据搬运	有 networking/backpressure，RDMA 概念学习	中	缺 RDMA 集群实战	画 TCP/RDMA/GPUDirect path，学 QP/CQ/MR，准备 UCX/RDMA debug
UCX	高性能 transport abstraction	CV 写 UCX concepts	中低	缺 UCP/UCT/transport 细节	学 UCX layering，能解释 UCX 慢怎么查
GPUNetIO	GPU-centric networking，GPU 参与网络数据路径	只有概念	低	缺实战	学 CPU-centric vs GPU-centric path，准备使用场景/代价
CUDA / Nsight	GPU execution model、memory hierarchy、profiling	有 Vulkan/WebGPU/Nsight/RenderDoc	中	CUDA kernel/Nsight Compute 证据不足	写 3 个 CUDA 小实验或实验设计，掌握 Systems vs Compute
C++ / Python / CUDA / Linux systems	JD 明确要求 C++、Python，最好有 CUDA/GPU programming；系统主线是 C++/CUDA/Linux，Python 用于 AI framework/benchmark/自动化	C++/Python/Linux 强，有 GPU profiling 和 CUDA 学习	高	CUDA/C++ GPU networking 实战证据不足	C++/CUDA 跑 profiling 小实验；Python 跑 vLLM benchmark；Bash 组织 `nsys`/`ncu`/`nccl-tests`/`ucx_info`
System architecture / prototype	设计优化、验证、进入 roadmap 判断	强	高	需要 GPU data movement 语境	准备 prototype 模板：hypothesis -> microbenchmark -> E2E validation

1.0 主索引：按 JD 技能点组织，而不是按任务组织

你现在复习时应该把每个技能点当成一条面试主线：

flowchart LR
    JD[JD 技能要求] --> CV[CV 里的对应证据]
    CV --> Match[匹配点和短板]
    Match --> Q[面试官会问什么]
    Q --> Deep[二级/三级追问]
    Deep --> Evidence[需要什么证据]
    Evidence --> Task[延展任务/实验/图]

下面这张表是全篇文档的主入口。
后面的 CUDA/vLLM/NCCL/UCX 实验都只是这张表最后一列的证据来源。

JD 技能点	你 CV 怎么接	面试官大概率先问	继续深挖会问	你要给出的核心回答	对应延展任务/证据
C++ / Python / CUDA / Linux systems	C++/Python/Linux、performance-sensitive code、GPU profiling、Vulkan/WebGPU 经验	这个岗位按 C++ 还是 Python？你 CUDA 到什么程度？	你写过什么 CUDA？怎么证明你能 debug GPU 性能？C++ 在系统路径里怎么用？	C++/CUDA/Linux 是系统主战场；Python/Bash 用于 AI framework、benchmark、自动化；我的 GPU 经验可迁移，但会用 CUDA/Nsight 验证 NVIDIA-specific 行为。	CUDA memory access、pinned copy、Nsight Systems/Compute workflow；第 2.5、9 章
AI inference / model serving	CV 里有 LLM inference integration、KV cache、batching、speculative decoding、observability	一个 LLM 请求从进来到出 token 的路径是什么？	TTFT 高怎么查？TPOT 高怎么查？P99 为什么变差？KV cache 为什么重要？	把 serving 拆成 queueing、tokenization、scheduling、prefill、KV allocation、decode、streaming；指标要按阶段解释。	vLLM benchmark 矩阵、TTFT/TPOT/P99 分析；第 2.6、4 章
Dynamo / NIXL	CV 写了在 deepen distributed inference、NIXL/UCX concepts	NIXL 是什么？Dynamo 是什么？	NIXL 和 NCCL 区别？为什么 inference 需要 data movement library？KV transfer 慢影响什么？	NIXL 面向 inference state movement，典型是 KV/cache/state transfer；NCCL 面向 rank 间 collective tensor communication。	KV movement data path 图、NIXL transfer lifecycle、slow transfer checklist；第 3、5、14.7
NCCL / collectives / parallelism	CV 有 distributed systems、communication path diagnosis、tail latency/backpressure	all-reduce 是什么？NCCL 做什么？	small message vs large message 为什么瓶颈不同？slow rank 怎么定位？DP/TP/FSDP 分别用什么 collective？	NCCL 是 GPU collective 通信库；先用 message size、rank mapping、topology、transport、nccl-tests 复现问题，再谈优化。	nccl-tests all_reduce_perf、algbw/busbw 解读、parallelism 通信表；第 2.7、6 章
RDMA / RoCE / UCX / GPUDirect	CV 有 networking、backpressure、Linux systems、cross-layer debug	RDMA 为什么比 TCP 快？UCX 是什么？	RoCE 为什么麻烦？registered memory 是什么？GPUDirect RDMA 条件是什么？UCX 慢怎么查？	RDMA 减少 CPU/kernel involvement；UCX 是 transport abstraction；GPUDirect RDMA 需要 GPU/NIC/driver/topology/fabric 支持。	TCP vs RDMA vs GPUDirect mermaid 图、ucx_info/checklist；第 2.8、3、7 章
GPUNetIO	CV 只有概念学习，不能装成 production owner	GPUNetIO 是什么？和 GPUDirect RDMA 区别？	什么时候值得用？代价是什么？没有 DOCA/NIC 怎么验证？	GPUNetIO 是 GPU-centric networking，让 GPU 更直接参与 packet/data path；不是所有场景都该用，只有 CPU-mediated networking 成为瓶颈时才值得考虑。	CPU-centric vs GPU-centric data path、适用/不适用场景、DOCA 环境边界；第 3.5、8、14.11
CUDA / Nsight / GPU performance	CV 有 Vulkan/WebGPU/Nsight/RenderDoc/GPU workload profiling	Nsight Systems 和 Nsight Compute 区别？	memory-bound vs compute-bound 怎么判断？coalescing 是什么？occupancy 越高越好吗？	Systems 看 end-to-end timeline；Compute 看 kernel 内部；优化方向必须由 memory throughput、SM utilization、warp stalls、copy/sync gap 决定。	memory coalescing、shared-memory tiling、pinned memory copy 三个实验；第 2.5、9 章
System architecture / prototype / roadmap	CV 强项：系统设计、prototype、observability、roadmap、debug	如何设计一个 GPU networking optimization prototype？	microbenchmark 和 E2E metric 怎么取舍？什么情况下不进入 roadmap？怎么处理未知技术短板？	先定义 hypothesis 和 baseline，再做 microbenchmark，最后用 E2E TTFT/TPOT/P99/throughput/资源/复杂度决定是否值得。	prototype 模板、P99 debug playbook、30 天 ramp-up 回答；第 10、14.15、14.17
GR00T / Physical AI	最近学习的 NVIDIA 方向，可作为平台视野加分项	GR00T 是什么？和 LLM 有什么区别？	它和这个岗位有什么关系？会用 NCCL/NIXL/GPUNetIO 吗？	GR00T 是 robotics VLA workload；和本 JD 的安全连接点是 GPU inference、profiling、low-latency deployment、data movement，不要说成主线。	GR00T 作为末尾补充，不抢主线；第 16 章

如果时间不够，只按这个优先级准备：

P0:
  AI inference/model serving
  CUDA/Nsight/GPU performance
  NCCL/collectives
  UCX/RDMA/GPUDirect

P1:
  Dynamo/NIXL
  GPUNetIO
  System architecture/prototype

P2:
  GR00T/Physical AI
  ROCm/Metal/MPS/Vulkan/WebGPU 对照

1.0.1 每个技能点应该怎么复习

每个技能点都按同一个模板准备，不要先陷入实验细节：

1. JD 原话是什么？
2. 这个技能在系统里解决什么问题？
3. 我 CV 里哪段经历可以接上？
4. 我不能夸大的边界是什么？
5. 面试官第一问是什么？
6. 面试官第二层会追到哪里？
7. 如果给一个故障场景，我怎么 debug？
8. 我有什么实验/图/benchmark 能支撑？
9. 我最后怎么用英文 60-90 秒回答？

对应到文档：

你要找的内容	去哪里看
总技能地图	第 1 章
环境和证据实验	第 2 章
UCX/NIXL/GPUNetIO 学习路径	第 3 章
单项技能深挖	第 4-10 章
两天时间表	第 11 章
最终验收	第 12 章
高频 Q&A	第 14 章
GR00T 补充	第 16 章

1.0.2 面试追问树：问题是主线，任务是延展

下面这些是你真正要准备的“问题链”。
每条问题链最后才对应实验、代码或图。

A. AI inference / model serving

flowchart TB
    A[LLM request path 是什么] --> B[prefill 和 decode 区别]
    B --> C[TTFT / TPOT / P99 分别代表什么]
    C --> D[TTFT 高怎么 debug]
    C --> E[TPOT 高怎么 debug]
    C --> F[P99 高怎么 debug]
    D --> G[queueing / scheduler / prefill / KV allocation]
    E --> H[decode loop / KV cache / memory bandwidth / collectives]
    F --> I[batching / admission control / tail latency]
    G --> J[延展: vLLM benchmark]
    H --> J
    I --> J

你要形成的回答能力：

我不是只知道 vLLM 这个名字。
我能把 serving 请求拆成阶段，并把每个指标映射到具体瓶颈。
如果 TTFT/TPOT/P99 异常，我知道先看什么、后看什么。

B. Dynamo / NIXL

flowchart TB
    A[Dynamo 是什么] --> B[为什么 distributed inference 需要 orchestration]
    B --> C[NIXL 是什么]
    C --> D[为什么 inference state movement 和 collective 不一样]
    D --> E[NIXL vs NCCL]
    D --> F[KV cache movement]
    F --> G[transfer slow 怎么 debug]
    G --> H[延展: KV movement data path / NIXL lifecycle]

你要形成的回答能力：

NIXL 不是 NCCL 的替代品。
NIXL 更像 inference data/state movement layer。
典型场景是 prefill/decode disaggregation 或 KV cache transfer。

C. NCCL / collectives / parallelism

flowchart TB
    A[NCCL 是什么] --> B[collective 是什么]
    B --> C[all-reduce / all-gather / reduce-scatter]
    C --> D[DP / TP / PP / FSDP 分别用什么通信]
    C --> E[small message vs large message]
    E --> F[latency-bound vs bandwidth-bound]
    D --> G[slow all-reduce 怎么定位]
    F --> G
    G --> H[rank mapping / topology / transport / NCCL_DEBUG]
    H --> I[延展: nccl-tests all_reduce_perf]

你要形成的回答能力：

我不会把所有通信慢都说成 NCCL bug。
我会先看 message size、rank、topology、transport、是否能用 nccl-tests 复现。

D. RDMA / RoCE / UCX / GPUDirect

flowchart TB
    A[传统 TCP path 为什么慢] --> B[RDMA 是什么]
    B --> C[为什么需要 registered memory]
    C --> D[QP / CQ / MR 是什么]
    B --> E[RoCE 为什么难]
    B --> F[UCX 是什么]
    F --> G[UCX transport 怎么选]
    B --> H[GPUDirect RDMA 是什么]
    H --> I[GPU-NIC direct path 需要什么条件]
    G --> J[UCX/RDMA 慢怎么 debug]
    I --> J
    J --> K[延展: ucx_info / data path 图 / slow-path checklist]

你要形成的回答能力：

我能画清楚 CPU-staged TCP path、RDMA path、GPUDirect RDMA path。
我也能诚实说明本机没有 RDMA/NIC/DOCA 时不能验证什么。

E. GPUNetIO

flowchart TB
    A[GPUNetIO 是什么] --> B[它和 GPUDirect RDMA 区别是什么]
    B --> C[为什么要 GPU-centric networking]
    C --> D[什么时候值得用]
    D --> E[什么时候不值得用]
    C --> F[代价和风险是什么]
    F --> G[debug 难点是什么]
    G --> H[延展: CPU-centric vs GPU-centric path 图]

你要形成的回答能力：

GPUNetIO 不是所有网络问题的默认答案。
只有当 CPU-mediated networking 成为关键路径，并且 GPU 直接消费网络数据时，它才可能值得考虑。

F. CUDA / Nsight / GPU performance

flowchart TB
    A[CUDA execution model] --> B[thread / block / grid / warp]
    B --> C[memory hierarchy]
    C --> D[global / shared / pinned memory]
    D --> E[coalescing / tiling / copy overlap]
    E --> F[memory-bound vs compute-bound]
    F --> G[Nsight Systems vs Nsight Compute]
    G --> H[GPU utilization 低怎么定位]
    H --> I[延展: CUDA memory / matmul / pinned copy 实验]

你要形成的回答能力：

我先用 Nsight Systems 找 critical path。
再用 Nsight Compute 看具体 kernel 的 memory、occupancy、stall、instruction 指标。

G. System architecture / prototype

flowchart TB
    A[给一个优化方向] --> B[先定义 hypothesis]
    B --> C[找 baseline]
    C --> D[做 microbenchmark]
    D --> E[做最小 prototype]
    E --> F[接入 end-to-end workload]
    F --> G[看 TTFT / TPOT / P99 / throughput / cost]
    G --> H[决定是否进 roadmap]

你要形成的回答能力：

Architect 不是只会说概念。
我会把优化假设、验证指标、风险、fallback、维护成本一起放进判断。

1.1 这个岗位到底按 C++ 还是 Python 准备？

按 JD 原文判断，不是二选一。JD 写的是：

Strong programming background in C++, Python, and ideally CUDA or other GPU programming models.

所以结论是：

语言要求：C++ 和 Python 都要。
面试主战场：C++ / CUDA / Linux systems。
实验和自动化工具：Python / Bash。
加分项：CUDA 或其他 GPU programming models。

为什么不是“只准备 Python”：

这是 System Software Architect, AI and GPU Networking，不是普通 Python AI application engineer。
JD 关注 runtime systems、communication libraries、GPU networking、data movement、prototype、performance optimization。
UCX、NCCL、GPUNetIO、CUDA、RDMA 这些底层栈主要是 C/C++/CUDA/Linux 语境。
JD 同时要求 AI frameworks，所以 Python 也重要，但它更偏 PyTorch/vLLM/JAX/TensorFlow 集成、benchmark、实验 orchestration、数据分析和自动化。

为什么也不能说“只要 C++”：

JD 明确写了 Python。
AI framework 生态主要靠 Python 暴露接口。
vLLM、PyTorch、benchmark scripts、实验数据分析会大量用 Python。
Architect 面试会看你能不能快速 prototype，而 Python 是最快的验证工具。

你面试中要表现出的语言层次：

层次	应该用什么	为什么
CUDA kernel / GPU memory 实验	C++ / CUDA	证明你理解 GPU execution、memory hierarchy、profiling
NCCL / UCX / 系统通信理解	C/C++/Linux 语境	这些库和 runtime 本身主要是系统级接口
vLLM / serving benchmark	Python	serving 框架和 benchmark 脚本更快验证 TTFT/TPOT/P99
实验自动化 / 日志分析	Python / Bash	快速组织 benchmark、画表、处理输出
架构设计 / prototype 说明	C++ + Python 都可以	C++ 证明底层能力，Python 证明快速验证能力

按面试准备优先级排序：

C++/CUDA/Linux systems：最高优先级，因为这是岗位底层技术可信度。
Python：必须会，用来跑 AI framework、vLLM benchmark、自动化实验。
Bash/Linux tooling：必须会，用来串起 nvidia-smi、nsys、ncu、nccl-tests、ucx_info。

准备优先级：

flowchart TB
    Role[JD Role: System Software Architect AI/GPU Networking]
    Role --> Core[C++ / CUDA / Linux Systems]
    Role --> Support[Python / Bash Tooling]
    Core --> CUDAExp[CUDA Memory / Kernel Profiling]
    Core --> Comm[NCCL / UCX / RDMA Concepts]
    Core --> Runtime[Runtime / Data Movement Design]
    Support --> VLLM[vLLM Serving Benchmark]
    Support --> Automation[Benchmark Automation]
    Support --> Analysis[Result Parsing / Plotting]

你应该这样回答语言问题：

The JD asks for both C++ and Python, ideally with CUDA or another GPU programming model. I would position myself as a systems engineer who can work in C++/CUDA/Linux for performance-critical runtime and communication paths, while using Python for AI framework integration, serving benchmarks, experiment orchestration, and analysis.

如果面试官问你更强的是 C++ 还是 Python，建议回答：

My production background includes both C++ and Python, but for this NVIDIA role I would emphasize systems-level C++ experience, performance-sensitive code paths, and my ability to use Python/Bash to build fast benchmark and diagnostic workflows. I would not present myself as only a Python ML application engineer.

你这两天的代码准备也按这个分工：

C++/CUDA：至少准备/跑一个 CUDA memory access 或 copy benchmark。
Python：跑或设计 vLLM benchmark，整理 TTFT/TPOT/P99。
Bash：记录环境、运行命令、组织 labs/ 输出。

1.2 CUDA 对应的 AMD / macOS 技术怎么理解

JD 主线是 NVIDIA，所以面试准备仍然以 CUDA / NCCL / UCX / GPUDirect / GPUNetIO 为主。
但你有 Vulkan/WebGPU、跨平台 GPU/graphics 经验，可以用 AMD/macOS 技术做类比，证明你理解的是 GPU computing 的共性，而不是只背 CUDA 名词。

核心原则：

面试主语言：CUDA / NVIDIA stack。
迁移类比：AMD ROCm/HIP、Apple Metal/MPS、Vulkan/WebGPU。
不要把 AMD/macOS 讲成这个岗位主技术栈。

技术映射表：

NVIDIA / CUDA 语境	AMD / ROCm 语境	macOS / Apple 语境	你要掌握的共性
CUDA	HIP / ROCm	Metal Compute / MPS / MPS Graph	GPU kernel programming、memory hierarchy、parallel execution
CUDA kernel	HIP kernel	Metal compute kernel	thread/block/grid 或 threadgroup/thread execution 的并行模型
CUDA thread block	HIP block	Metal threadgroup	一组 threads 在同一 compute unit 上协作
CUDA warp	wavefront / wave	SIMD-group / simdgroup	lockstep/SIMD 执行、divergence 成本
shared memory	LDS / shared memory	threadgroup memory	on-chip memory、tiling、reuse、bank conflict
global memory	global memory	device memory	高延迟大容量 memory、coalescing/locality
CUDA stream	HIP stream	command queue / command buffer	async execution、overlap、dependency
CUDA event	HIP event	command buffer completion / GPU counters	timing、sync、dependency measurement
Nsight Systems	rocprof / rocprofiler / omnitrace	Instruments / Xcode GPU tools	timeline、kernel、copy、sync、CPU/GPU overlap
Nsight Compute	rocprof compute profiling	Xcode GPU counters / Metal debugger	kernel-level bottleneck、occupancy、memory throughput
NCCL	RCCL	没有完全等价的 Apple 集群 collective 栈	distributed GPU collective communication
GPUDirect RDMA	ROCm RDMA / peer-direct 类能力	macOS 不常见同级 AI cluster data path	GPU-NIC direct data movement
CUDA-aware MPI/UCX	ROCm-aware MPI/UCX	不常见	transport 是否理解 GPU memory

怎么在面试里使用这段经验：

My hands-on GPU experience includes Vulkan/WebGPU-style profiling and GPU workload analysis. I understand that CUDA is the primary stack for this NVIDIA role, so I use those experiences as transferable mental models: GPU execution hierarchy, memory locality, async command execution, profiling timelines, and bottleneck isolation. I would still validate NVIDIA-specific behavior with CUDA, Nsight Systems, Nsight Compute, NCCL, and UCX.

如果面试官问“你不是 CUDA production，很大问题吗？”，可以这样答：

I would be explicit that my production GPU-adjacent work is not CUDA-kernel production ownership. The transferable part is my understanding of GPU workload behavior, profiling, memory/resource trade-offs, and systems debugging. For this role, I am closing the NVIDIA-specific gap by running CUDA/Nsight experiments and mapping my Vulkan/WebGPU mental model to CUDA execution, memory hierarchy, and profiling tools.

两天内的学习优先级：

先用 CUDA/Nsight 跑实验，因为这是 NVIDIA 面试主线。
AMD/macOS 只作为对照理解，不花时间搭 ROCm/Metal 环境。
面试里只在被问到“你过去 GPU 经验怎么相关”时讲 AMD/macOS/Vulkan/WebGPU 类比。
讲完类比后立刻落回 NVIDIA：CUDA、Nsight、NCCL、UCX、GPUDirect。

2. 证据实验库：为 JD/CV 追问准备可验证材料

这一章不是新的主线目录，也不是让你机械完成一堆学生作业。
它的作用只有一个：当面试官顺着 JD/CV 技能点追问时，你能拿出“我跑过/我知道怎么跑/我知道怎么看结果”的证据。

所以这一章应该这样使用：

先回到第 1 章，确定当前技能点。
再判断面试官可能追问什么。
最后只做能支撑这个追问的实验。

阅读顺序应该是：

flowchart LR
    A[JD 技能点] --> B[CV 对应证据]
    B --> C[面试追问]
    C --> D[选择实验]
    D --> E[跑代码/看指标]
    E --> F[形成回答证据]
    F --> G[回到 Q&A]

实验优先级：

优先级	实验	支撑哪个 JD/CV 追问
P0	CUDA memory/coalescing、pinned copy、Nsight workflow	你到底懂不懂 CUDA/GPU profiling？
P0	vLLM serving benchmark	你能不能解释 TTFT/TPOT/P99 和 LLM serving runtime？
P0	NCCL all-reduce benchmark 或实验设计	你能不能解释 collective 通信和 slow all-reduce？
P0	UCX/RDMA/GPUDirect data path + 本机能力检查	你是否知道真实 GPU networking 边界？
P1	NIXL/KV movement 架构图	你是否理解 inference data movement？
P1	GPUNetIO CPU-centric vs GPU-centric path	你是否知道 GPU-centric networking 的使用场景和代价？
P2	ROCm/Metal/MPS/Vulkan/WebGPU 对照	只用于解释你的跨平台 GPU 经验如何迁移到 CUDA。

2.1 你的环境分工：Windows 原生 + WSL2 Ubuntu

flowchart TB
    Win[Windows GPU Machine]
    Native[Windows Native]
    WSL[WSL2 Ubuntu]

    Native --> Driver[NVIDIA Driver / nvidia-smi]
    Native --> CUDAWin[CUDA Toolkit for Windows]
    Native --> NsightGUI[Nsight Systems / Nsight Compute GUI]
    Native --> Samples[CUDA Samples / Visual Studio]

    WSL --> LinuxCUDA[Linux CUDA CLI]
    WSL --> Python[vLLM / Python Benchmarks]
    WSL --> NCCL[NCCL Tests if Available]
    WSL --> UCX[UCX Tools if Available]

推荐优先级：

先确认 Windows 原生 nvidia-smi、nvcc、Nsight 可用。
再确认 WSL2 Ubuntu 里 nvidia-smi 可用。
如果 WSL2 能看到 GPU，后续 vLLM/NCCL/UCX 尽量放 WSL2。
如果 WSL2 暂时不可用，不要卡太久；CUDA/Nsight 先用 Windows 原生跑，Linux 生态内容写实验设计。

2.2 Windows 原生环境检查

在 PowerShell 里跑：

nvidia-smi
nvcc --version
python --version
where nsys
where ncu

你要记录：

Windows environment:
  GPU model:
  Driver version:
  CUDA version:
  Python version:
  Nsight Systems available: yes/no
  Nsight Compute available: yes/no

怎么看结果：

nvidia-smi 能显示 GPU：driver OK。
nvcc --version 能显示 CUDA：CUDA Toolkit OK。
where nsys / where ncu 能找到路径：Nsight CLI 可用。
如果 Nsight CLI 找不到，但 GUI 已安装，也可以先用 GUI。

2.3 WSL2 Ubuntu 环境检查

在 WSL2 Ubuntu 里跑：

nvidia-smi
python3 --version
uname -a
which nvcc || true
which nsys || true
which ncu || true

你要记录：

WSL2 environment:
  nvidia-smi works: yes/no
  Python version:
  nvcc available: yes/no
  nsys available: yes/no
  ncu available: yes/no

怎么看结果：

WSL2 中 nvidia-smi 能看到 GPU：可以跑 Linux CUDA/Python GPU workflow。
WSL2 中没有 nvcc 但能看到 GPU：仍可跑部分 Python GPU workload，但 CUDA 编译实验需要安装 toolkit。
WSL2 中看不到 GPU：vLLM/NCCL/UCX 实操会受限，先做 Windows CUDA/Nsight + Linux 实验设计。

2.4 环境检查结果怎么判断

不用先创建任何文档。你现在只需要看命令输出，并判断这台机器能跑什么。

最重要的是这四个结论：

结论	怎么判断	后续动作
Windows CUDA 可用	PowerShell 里 `nvidia-smi` 和 `nvcc --version` 都正常	可以跑 CUDA C++/Nsight 主线实验
Windows Nsight 可用	`where nsys` 或 `where ncu` 找到路径，或 GUI 能打开	可以做 timeline/kernel profiling
WSL2 GPU 可用	WSL2 里 `nvidia-smi` 正常	可以尝试 vLLM、NCCL、UCX Linux 工具链
真实 RDMA/GPUNetIO 不可验证	没有 NVIDIA NIC/IB/RoCE/DOCA 环境	只做 data path 图、benchmark design、debug checklist

最理想结果：

Windows:
  nvidia-smi works
  nvcc works
  Nsight Systems / Compute available

WSL2:
  nvidia-smi works
  python3 works
  optional: nvcc / nsys / ncu available

如果只有 Windows CUDA 可用，也够你完成最关键的 CUDA/Nsight 主线。
如果 WSL2 也能看到 GPU，再继续做 vLLM/NCCL/UCX。

2.5 任务链 1：CUDA / Nsight，证明你能做 GPU 性能定位

这不是孤立的 CUDA 入门练习。它对应 JD 的这些考察点：

C++ / Python, ideally CUDA or other GPU programming models
performance profiling and optimization
defining and using hardware features
design and prototype features and optimizations

也对应你 CV 里的这些点：

Vulkan/WebGPU/GPU workload profiling。
C++/Python/Linux。
performance-sensitive path、bottleneck isolation、diagnostic tooling。

你要用这组任务证明三件事：

1. 我知道 GPU 程序慢在哪里要分层看。
2. 我会用 Nsight Systems 看全局 timeline。
3. 我会用 Nsight Compute 看 kernel 内部瓶颈。

任务结构：

flowchart TB
    JD[JD: CUDA or GPU programming + profiling + hardware features]
    CV[CV: Vulkan/WebGPU profiling + C++/Python/Linux + bottleneck isolation]
    Goal[Goal: prove GPU performance debugging ability]

    JD --> Goal
    CV --> Goal

    Goal --> T1[Task 1: Memory Access Pattern]
    Goal --> T2[Task 2: Shared Memory Tiling]
    Goal --> T3[Task 3: Host-Device Transfer]
    Goal --> T4[Task 4: Nsight Workflow]

    T1 --> M1[Metrics: kernel time, bandwidth, load efficiency, stalls]
    T2 --> M2[Metrics: occupancy, shared memory, global load/store]
    T3 --> M3[Metrics: H2D/D2H time, overlap, sync gaps]
    T4 --> M4[Metrics: timeline, CUDA API, kernel detail]

    M1 --> Q1[追问: coalescing / memory-bound / layout]
    M2 --> Q2[追问: shared memory / occupancy / bank conflict]
    M3 --> Q3[追问: pinned memory / DMA / stream overlap]
    M4 --> Q4[追问: Systems vs Compute / critical path]

    Q1 --> Evidence[面试证据: I can locate GPU bottlenecks by measurement]
    Q2 --> Evidence
    Q3 --> Evidence
    Q4 --> Evidence

2.5.0 这组任务里的技术词汇是什么

词汇	是什么	在面试里为什么重要
CUDA kernel	运行在 GPU 上的函数，由 CPU host 端 launch。	证明你知道 GPU 工作不是普通 CPU 函数调用。
thread / block / grid	CUDA 并行层次：thread 是最小执行单元，block 是一组 threads，grid 是一次 kernel launch 的所有 blocks。	面试会问 execution model。
warp	NVIDIA GPU 中通常 32 个 threads 组成一个执行组。	coalescing、divergence、stall 都和 warp 有关。
global memory	GPU 上容量大但延迟高的显存。	大多数 tensor/KV/network buffer 最终都在 global memory。
shared memory	每个 block 内共享的 on-chip memory，延迟低但容量小。	tiled matmul、data reuse、bank conflict 都依赖它。
coalescing	同一个 warp 中相邻 threads 访问连续地址，硬件合并 memory transactions。	判断 memory layout 是否合理。
pinned memory	page-locked host memory，适合 DMA 和 async transfer。	连接 CUDA copy、GPUDirect、RDMA、NIXL data movement。
CUDA stream	CUDA 操作的异步队列。	用于 kernel/copy overlap，减少同步等待。
CUDA event	GPU timeline 上的计时/同步标记。	用来测 kernel/copy 时间。
occupancy	一个 SM 上活跃 warps/blocks 的利用程度。	不是越高越好，要和 memory/register/shared memory trade-off 一起看。
memory-bound	性能主要受 memory bandwidth/latency 限制。	LLM/KV/cache/network buffer 常见。
compute-bound	性能主要受算力/指令吞吐限制。	GEMM/tensor core 类任务常见。
Nsight Systems	全局 timeline profiler。	回答“时间花在哪里”。
Nsight Compute	单 kernel profiler。	回答“这个 kernel 为什么慢”。

2.5.1 任务 1：coalesced vs uncoalesced memory access

任务说明：

问题	说明
是什么	比较 GPU threads 连续访问 global memory 和非连续/stride 访问 global memory 的性能差异。
考察什么	CUDA memory hierarchy、memory coalescing、global memory bandwidth、Nsight Compute 指标解读。
为什么考察	AI inference、KV cache、network buffer、tensor layout 本质上都依赖 GPU memory access pattern；系统架构师不能只看 GPU utilization，要知道内存访问会成为瓶颈。
意义是什么	证明你能从“代码慢”追到“memory transaction / bandwidth / layout”层面，而不是只停留在应用层。
前提/输入	一段 C++/CUDA kernel；同样数据规模；两个访问模式：连续访问和 stride/random 访问。
运行环境	优先 Windows 原生 PowerShell + CUDA Toolkit + Nsight Compute；也可以 WSL2 Ubuntu + `nvcc` + `ncu`。
用什么软件跑	`nvcc` 编译 CUDA C++；`ncu` 或 Nsight Compute GUI 看 kernel memory 指标；可用 `nsys` 看 timeline。
输出/指标	kernel time、effective bandwidth、global load/store efficiency、warp stall reason。
成功标准	能解释为什么连续访问更快，以及 Nsight Compute 哪些指标能支持这个判断。
失败时怎么解释	如果差异不明显，检查数据规模是否太小、编译优化、cache 效应、stride 是否真正造成非合并访问。
会扩展到的问题	KV cache layout、tensor layout、network buffer layout、memory-bound kernel 定位。

你要知道是什么：

GPU global memory 访问不是“每个 thread 随便读都一样快”。
相邻 thread 访问连续地址时，硬件可以合并 memory transaction，带宽更高。
stride/random/uncoalesced access 会造成更多 memory transaction，吞吐下降。

实际怎么跑：

代码文件：memory_access.cu

#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>

#define CHECK_CUDA(call) do {                                      \
    cudaError_t err = (call);                                      \
    if (err != cudaSuccess) {                                      \
        std::fprintf(stderr, "CUDA error %s:%d: %s\n",             \
                     __FILE__, __LINE__, cudaGetErrorString(err)); \
        std::exit(1);                                              \
    }                                                             \
} while (0)

__global__ void coalesced_copy(const float* __restrict__ in,
                               float* __restrict__ out,
                               int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        out[i] = in[i];
    }
}

__global__ void strided_copy(const float* __restrict__ in,
                             float* __restrict__ out,
                             int n,
                             int stride) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        int idx = (i * stride) % n;
        out[i] = in[idx];
    }
}

float run_coalesced(const float* d_in,
                    float* d_out,
                    int n,
                    int blocks,
                    int threads,
                    int repeat) {
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    CHECK_CUDA(cudaEventRecord(start));
    for (int r = 0; r < repeat; ++r) {
        coalesced_copy<<<blocks, threads>>>(d_in, d_out, n);
    }
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    CHECK_CUDA(cudaGetLastError());
    float ms = 0.0f;
    CHECK_CUDA(cudaEventElapsedTime(&ms, start, stop));
    CHECK_CUDA(cudaEventDestroy(start));
    CHECK_CUDA(cudaEventDestroy(stop));
    return ms / repeat;
}

float run_strided(const float* d_in,
                  float* d_out,
                  int n,
                  int blocks,
                  int threads,
                  int repeat,
                  int stride) {
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    CHECK_CUDA(cudaEventRecord(start));
    for (int r = 0; r < repeat; ++r) {
        strided_copy<<<blocks, threads>>>(d_in, d_out, n, stride);
    }
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    CHECK_CUDA(cudaGetLastError());
    float ms = 0.0f;
    CHECK_CUDA(cudaEventElapsedTime(&ms, start, stop));
    CHECK_CUDA(cudaEventDestroy(start));
    CHECK_CUDA(cudaEventDestroy(stop));
    return ms / repeat;
}

int main() {
    const int n = 1 << 26;          // 67M floats, about 256 MB
    const int repeat = 50;
    const int threads = 256;
    const int blocks = (n + threads - 1) / threads;
    const int stride = 32;
    const size_t bytes = size_t(n) * sizeof(float);

    float* h = static_cast<float*>(std::malloc(bytes));
    for (int i = 0; i < n; ++i) h[i] = float(i % 1024);

    float *d_in = nullptr, *d_out = nullptr;
    CHECK_CUDA(cudaMalloc(&d_in, bytes));
    CHECK_CUDA(cudaMalloc(&d_out, bytes));
    CHECK_CUDA(cudaMemcpy(d_in, h, bytes, cudaMemcpyHostToDevice));

    float coalesced_ms = run_coalesced(d_in, d_out, n, blocks, threads, repeat);
    float strided_ms = run_strided(d_in, d_out, n, blocks, threads, repeat, stride);

    double gb = double(bytes) / 1e9;
    std::printf("coalesced: %.3f ms, %.2f GB/s\n", coalesced_ms, gb / (coalesced_ms / 1000.0));
    std::printf("strided  : %.3f ms, %.2f GB/s, stride=%d\n", strided_ms, gb / (strided_ms / 1000.0), stride);

    CHECK_CUDA(cudaFree(d_in));
    CHECK_CUDA(cudaFree(d_out));
    std::free(h);
    return 0;
}

Windows PowerShell：

nvcc -O3 memory_access.cu -o memory_access.exe
.\memory_access.exe
ncu .\memory_access.exe

WSL2 Ubuntu：

nvcc -O3 memory_access.cu -o memory_access
./memory_access
ncu ./memory_access

程序里写两个 kernel：

kernel A:
  out[i] = in[i]
  连续访问，预期 coalesced。

kernel B:
  out[i] = in[i * stride]
  或通过 index table 做 scattered access，预期 uncoalesced。

固定变量：

数据大小相同。
block size 相同。
重复次数相同。
只改变 memory access pattern。

怎么看结果：

coalesced 版本 kernel time 更低。
effective bandwidth 更高。
uncoalesced 版本可能出现更多 memory dependency / memory throttle。

怎么读 Nsight Compute：

看 memory throughput 是否接近硬件能力。
看 global load/store efficiency 或 memory transaction 相关指标。
看 warp stall reason 是否指向 memory dependency。
如果 SM compute utilization 不高但 memory stall 高，说明 memory-bound。

对方可能追问：

什么是 memory coalescing？
为什么 LLM/GPU networking 也关心 memory layout？
如果 GPU utilization 不高但 kernel 很慢，怎么判断是不是 memory-bound？

你可以这样回答：

Memory coalescing means adjacent threads access adjacent memory so the GPU can combine memory transactions. If access is strided or scattered, effective bandwidth drops. I would validate this with Nsight Compute by checking memory throughput, global load efficiency, and warp stall reasons. In inference or networking data paths, layout still matters because KV cache blocks, tensors, and network buffers all eventually become memory access patterns.

2.5.2 任务 2：naive matmul vs tiled shared-memory matmul

任务说明：

问题	说明
是什么	比较直接从 global memory 做矩阵乘法，和用 shared memory tiling 复用数据的矩阵乘法。
考察什么	shared memory、data reuse、occupancy、register/shared-memory trade-off、memory-bound vs compute-bound 判断。
为什么考察	LLM/GPU workload 大量时间花在矩阵运算和 memory movement 上；即使岗位不是纯 kernel engineer，也要求你理解 GPU 优化的基本机制。
意义是什么	证明你能解释“为什么优化有效”，而不是只说“用了 shared memory 所以快”。
前提/输入	两个 matmul kernel；一个 naive，一个使用 tile + shared memory；相同矩阵规模。
运行环境	优先 Windows 原生 PowerShell + CUDA Toolkit + Nsight Compute；也可以 WSL2 Ubuntu + `nvcc` + `ncu`。
用什么软件跑	`nvcc` 编译；程序内部用 CUDA event 计时；`ncu`/Nsight Compute 看 occupancy、memory throughput、shared memory。
输出/指标	kernel time、global memory throughput、shared memory usage、achieved occupancy、SM utilization。
成功标准	能说明 tiled 版本为什么减少 global memory traffic，并能解释 occupancy 和资源使用的 trade-off。
失败时怎么解释	如果 tiled 不快，检查 tile size、bank conflict、register pressure、occupancy 下降、矩阵规模是否太小。
会扩展到的问题	LLM GEMM、attention kernel、memory reuse、kernel-level optimization 判断。

你要知道是什么：

naive matmul 每次从 global memory 读大量重复数据。
tiled matmul 把一块数据放进 shared memory，让一个 block 内的 threads 复用。
这体现了 GPU 优化的核心思想：提高 data reuse，减少 global memory traffic。

实际怎么跑：

代码文件：matmul_tiling.cu

#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
#include <cmath>

#define CHECK_CUDA(call) do {                                      \
    cudaError_t err = (call);                                      \
    if (err != cudaSuccess) {                                      \
        std::fprintf(stderr, "CUDA error %s:%d: %s\n",             \
                     __FILE__, __LINE__, cudaGetErrorString(err)); \
        std::exit(1);                                              \
    }                                                             \
} while (0)

constexpr int TILE = 16;

__global__ void matmul_naive(const float* A,
                             const float* B,
                             float* C,
                             int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= n || col >= n) return;

    float sum = 0.0f;
    for (int k = 0; k < n; ++k) {
        sum += A[row * n + k] * B[k * n + col];
    }
    C[row * n + col] = sum;
}

__global__ void matmul_tiled(const float* A,
                             const float* B,
                             float* C,
                             int n) {
    __shared__ float As[TILE][TILE];
    __shared__ float Bs[TILE][TILE];

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < n; t += TILE) {
        int a_col = t + threadIdx.x;
        int b_row = t + threadIdx.y;

        As[threadIdx.y][threadIdx.x] =
            (row < n && a_col < n) ? A[row * n + a_col] : 0.0f;
        Bs[threadIdx.y][threadIdx.x] =
            (b_row < n && col < n) ? B[b_row * n + col] : 0.0f;

        __syncthreads();

        for (int k = 0; k < TILE; ++k) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < n && col < n) {
        C[row * n + col] = sum;
    }
}

float time_naive(const float* d_A,
                 const float* d_B,
                 float* d_C,
                 int n,
                 dim3 grid,
                 dim3 block,
                 int repeat) {
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    CHECK_CUDA(cudaEventRecord(start));
    for (int r = 0; r < repeat; ++r) {
        matmul_naive<<<grid, block>>>(d_A, d_B, d_C, n);
    }
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    CHECK_CUDA(cudaGetLastError());
    float ms = 0.0f;
    CHECK_CUDA(cudaEventElapsedTime(&ms, start, stop));
    CHECK_CUDA(cudaEventDestroy(start));
    CHECK_CUDA(cudaEventDestroy(stop));
    return ms / repeat;
}

float time_tiled(const float* d_A,
                 const float* d_B,
                 float* d_C,
                 int n,
                 dim3 grid,
                 dim3 block,
                 int repeat) {
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    CHECK_CUDA(cudaEventRecord(start));
    for (int r = 0; r < repeat; ++r) {
        matmul_tiled<<<grid, block>>>(d_A, d_B, d_C, n);
    }
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    CHECK_CUDA(cudaGetLastError());
    float ms = 0.0f;
    CHECK_CUDA(cudaEventElapsedTime(&ms, start, stop));
    CHECK_CUDA(cudaEventDestroy(start));
    CHECK_CUDA(cudaEventDestroy(stop));
    return ms / repeat;
}

int main() {
    const int n = 1024;
    const int repeat = 5;
    const size_t bytes = size_t(n) * n * sizeof(float);

    float* h_A = static_cast<float*>(std::malloc(bytes));
    float* h_B = static_cast<float*>(std::malloc(bytes));
    for (int i = 0; i < n * n; ++i) {
        h_A[i] = float((i % 13) - 6) / 13.0f;
        h_B[i] = float((i % 17) - 8) / 17.0f;
    }

    float *d_A = nullptr, *d_B = nullptr, *d_C = nullptr;
    CHECK_CUDA(cudaMalloc(&d_A, bytes));
    CHECK_CUDA(cudaMalloc(&d_B, bytes));
    CHECK_CUDA(cudaMalloc(&d_C, bytes));
    CHECK_CUDA(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    dim3 block(TILE, TILE);
    dim3 grid((n + TILE - 1) / TILE, (n + TILE - 1) / TILE);

    float naive_ms = time_naive(d_A, d_B, d_C, n, grid, block, repeat);
    float tiled_ms = time_tiled(d_A, d_B, d_C, n, grid, block, repeat);

    double flops = 2.0 * double(n) * double(n) * double(n);
    std::printf("N=%d\n", n);
    std::printf("naive: %.3f ms, %.2f GFLOP/s\n",
                naive_ms, flops / (naive_ms / 1000.0) / 1e9);
    std::printf("tiled: %.3f ms, %.2f GFLOP/s\n",
                tiled_ms, flops / (tiled_ms / 1000.0) / 1e9);

    CHECK_CUDA(cudaFree(d_A));
    CHECK_CUDA(cudaFree(d_B));
    CHECK_CUDA(cudaFree(d_C));
    std::free(h_A);
    std::free(h_B);
    return 0;
}

Windows PowerShell：

nvcc -O3 matmul_tiling.cu -o matmul_tiling.exe
.\matmul_tiling.exe
ncu .\matmul_tiling.exe

WSL2 Ubuntu：

nvcc -O3 matmul_tiling.cu -o matmul_tiling
./matmul_tiling
ncu ./matmul_tiling

写两个版本：

A:
  naive C = A x B
  直接从 global memory 读。

B:
  tiled C = A x B
  每个 block 把 tile 放入 shared memory，然后复用。

矩阵规模：

512x512 / 1024x1024 / 2048x2048
按你的 GPU 能力选择，先从小规模保证正确，再扩大。

测什么：

kernel time
achieved occupancy
shared memory usage
global memory load/store throughput
SM utilization

怎么看结果：

tiled 版本通常更快，但 tile size 不一定越大越好。
shared memory 用太多可能降低 occupancy。
occupancy 高不代表一定快，要结合 memory throughput 和 stall reason 看。

怎么读 Nsight Compute：

如果 global memory load 明显下降，说明 tiling 起作用。
如果 shared memory 使用增加但 occupancy 大幅下降，要解释资源 trade-off。
如果 stall reason 指向 memory dependency，说明仍可能 memory-bound。
如果 compute pipeline utilization 高，才更接近 compute-bound。

对方可能追问：

shared memory 为什么快？
occupancy 是什么，为什么不是越高越好？
bank conflict 是什么？
你怎么判断一个 kernel 是 memory-bound 还是 compute-bound？

你可以这样回答：

Shared memory is on-chip and much lower latency than global memory, so tiling can improve data reuse. But shared memory and registers are limited resources; using too much can reduce occupancy. I would not optimize for occupancy alone. I would check whether the kernel is memory-bound