Sequence 07 - 实战实验和验证证据
实验不是为了刷题,而是为了回答面试追问。
1. 实验总图
flowchart TB
Labs[Practical Labs] --> CUDA[CUDA/Nsight Labs]
Labs --> VLLM[vLLM Serving Labs]
Labs --> NCCL[NCCL Labs]
Labs --> UCX[UCX/RDMA Labs]
Labs --> Design[Design-only Labs]
CUDA --> C1[Memory coalescing]
CUDA --> C2[Pinned memory copy]
CUDA --> C3[Nsight workflow]
VLLM --> V1[Input length vs TTFT]
VLLM --> V2[Output length vs TPOT]
VLLM --> V3[Request rate vs P99]
NCCL --> N1[all_reduce_perf]
NCCL --> N2[algbw/busbw interpretation]
UCX --> U1[ucx_info]
UCX --> U2[transport/device boundary]
Design --> D1[NIXL KV transfer design]
Design --> D2[GPUNetIO path design]
2. CUDA memory coalescing
目标:
证明你知道 GPU memory access pattern 会影响性能。
实验:
- kernel A:
out[i] = in[i] - kernel B:
out[i] = in[(i * stride) % n]
看什么:
| 指标 | 意义 |
|---|---|
| kernel time | 哪个版本慢 |
| effective bandwidth | memory access 是否高效 |
| Nsight Compute memory metrics | memory transaction / load efficiency |
| warp stalls | 是否 memory dependency |
面试回答:
Coalesced access lets adjacent threads access adjacent memory, which allows the GPU to combine memory transactions. Strided or scattered access increases transactions and lowers effective bandwidth.
3. Pinned memory copy
目标:
证明你理解 CPU-GPU data movement,也能连接到 GPUDirect/RDMA/NIXL。
实验:
- pageable host memory + H2D/D2H
- pinned host memory + H2D/D2H
- 可选:pinned + async copy + stream overlap
看什么:
| 指标 | 意义 |
|---|---|
| H2D bandwidth | host to device copy efficiency |
| D2H bandwidth | device to host copy efficiency |
| timeline gap | 是否有同步或 CPU 等待 |
| overlap | copy 和 kernel 是否重叠 |
4. vLLM benchmark
目标:
把 serving 指标和 runtime 阶段对应起来。
实验矩阵:
| 变量 | 值 |
|---|---|
| input length | 128 / 512 / 2048 |
| output length | 64 / 128 / 512 |
| request rate | 1 / 4 / 16 |
看什么:
| 指标 | 对应阶段 |
|---|---|
| TTFT | queueing / scheduling / prefill |
| TPOT | decode |
| P99 | saturation / batching / tail |
| GPU memory | KV cache pressure |
| GPU utilization | batching/compute efficiency |
5. NCCL all-reduce
目标:
证明你能用 microbenchmark 抽离通信问题。
命令:
git clone https://github.com/NVIDIA/nccl-tests.git labs/nccl_tests/nccl-tests
cd labs/nccl_tests/nccl-tests
make MPI=0
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
看什么:
| 输出 | 意义 |
|---|---|
| size | message size |
| time | collective time |
| algbw | algorithm bandwidth |
| busbw | bus bandwidth |
| NCCL_DEBUG | topology / transport / rank info |
6. UCX/RDMA capability check
目标:
诚实判断本机能验证什么,不能验证什么。
命令:
ucx_info -v
ucx_info -d
nvidia-smi topo -m
ibv_devinfo || true
判断:
| 结果 | 含义 |
|---|---|
| 只有 TCP/shared memory | 可学 UCX tooling,不能验证 RDMA |
| 有 RDMA device | 可进一步做 RDMA perftest |
| WSL2 看到 GPU | 不代表支持 GPUDirect RDMA |
| 没有 NVIDIA NIC/DOCA | 不能验证 GPUNetIO |
7. NIXL KV transfer 设计实验
本地没有 NIXL 环境也要会设计:
flowchart TB
Baseline[Baseline serving] --> Measure[Measure TTFT/TPOT/P99]
Measure --> Disagg[Prefill/Decode Disaggregation]
Disagg --> KV[KV Blocks Produced]
KV --> Transfer[Transfer via NIXL-like path]
Transfer --> Decode[Decode Worker]
Decode --> E2E[End-to-end Metrics]
E2E --> Compare[Compare with baseline]
要看:
- transfer latency
- decode start delay
- GPU idle time
- TTFT/P99 impact
- memory usage
- failure/fallback behavior
8. GPUNetIO 设计实验
本地没有 DOCA/NVIDIA NIC,也要能画:
flowchart TB
Packet[Network Packet] --> CPU[CPU Networking Stack]
CPU --> HostBuf[Host Buffer]
HostBuf --> GPUCopy[Copy to GPU]
GPUCopy --> GPU[GPU Processing]
Packet --> RNIC[RNIC/DOCA]
RNIC --> GPUNetIO[GPUNetIO]
GPUNetIO --> GPU2[GPU Processing]
CPU --> Cost1[CPU overhead]
GPUCopy --> Cost2[Copy latency]
GPUNetIO --> Risk[Complexity/debug risk]
面试结论:
I would only consider GPUNetIO if CPU-mediated networking and CPU-GPU copy are on the critical path and the GPU directly consumes the network data. Otherwise the added complexity may not be justified.