Sequence 07 - 实战实验和验证证据

实验不是为了刷题，而是为了回答面试追问。

1. 实验总图

flowchart TB
    Labs[Practical Labs] --> CUDA[CUDA/Nsight Labs]
    Labs --> VLLM[vLLM Serving Labs]
    Labs --> NCCL[NCCL Labs]
    Labs --> UCX[UCX/RDMA Labs]
    Labs --> Design[Design-only Labs]

    CUDA --> C1[Memory coalescing]
    CUDA --> C2[Pinned memory copy]
    CUDA --> C3[Nsight workflow]

    VLLM --> V1[Input length vs TTFT]
    VLLM --> V2[Output length vs TPOT]
    VLLM --> V3[Request rate vs P99]

    NCCL --> N1[all_reduce_perf]
    NCCL --> N2[algbw/busbw interpretation]

    UCX --> U1[ucx_info]
    UCX --> U2[transport/device boundary]

    Design --> D1[NIXL KV transfer design]
    Design --> D2[GPUNetIO path design]

2. CUDA memory coalescing

目标：

证明你知道 GPU memory access pattern 会影响性能。

实验：

kernel A：out[i] = in[i]
kernel B：out[i] = in[(i * stride) % n]

看什么：

指标	意义
kernel time	哪个版本慢
effective bandwidth	memory access 是否高效
Nsight Compute memory metrics	memory transaction / load efficiency
warp stalls	是否 memory dependency

面试回答：

Coalesced access lets adjacent threads access adjacent memory, which allows the GPU to combine memory transactions. Strided or scattered access increases transactions and lowers effective bandwidth.

3. Pinned memory copy

目标：

证明你理解 CPU-GPU data movement，也能连接到 GPUDirect/RDMA/NIXL。

实验：

pageable host memory + H2D/D2H
pinned host memory + H2D/D2H
可选：pinned + async copy + stream overlap

看什么：

指标	意义
H2D bandwidth	host to device copy efficiency
D2H bandwidth	device to host copy efficiency
timeline gap	是否有同步或 CPU 等待
overlap	copy 和 kernel 是否重叠

4. vLLM benchmark

目标：

把 serving 指标和 runtime 阶段对应起来。

实验矩阵：

变量	值
input length	128 / 512 / 2048
output length	64 / 128 / 512
request rate	1 / 4 / 16

看什么：

指标	对应阶段
TTFT	queueing / scheduling / prefill
TPOT	decode
P99	saturation / batching / tail
GPU memory	KV cache pressure
GPU utilization	batching/compute efficiency

5. NCCL all-reduce

目标：

证明你能用 microbenchmark 抽离通信问题。

命令：

git clone https://github.com/NVIDIA/nccl-tests.git labs/nccl_tests/nccl-tests
cd labs/nccl_tests/nccl-tests
make MPI=0
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

看什么：

输出	意义
size	message size
time	collective time
algbw	algorithm bandwidth
busbw	bus bandwidth
NCCL_DEBUG	topology / transport / rank info

6. UCX/RDMA capability check

目标：

诚实判断本机能验证什么，不能验证什么。

命令：

ucx_info -v
ucx_info -d
nvidia-smi topo -m
ibv_devinfo || true

判断：

结果	含义
只有 TCP/shared memory	可学 UCX tooling，不能验证 RDMA
有 RDMA device	可进一步做 RDMA perftest
WSL2 看到 GPU	不代表支持 GPUDirect RDMA
没有 NVIDIA NIC/DOCA	不能验证 GPUNetIO

7. NIXL KV transfer 设计实验

本地没有 NIXL 环境也要会设计：

flowchart TB
    Baseline[Baseline serving] --> Measure[Measure TTFT/TPOT/P99]
    Measure --> Disagg[Prefill/Decode Disaggregation]
    Disagg --> KV[KV Blocks Produced]
    KV --> Transfer[Transfer via NIXL-like path]
    Transfer --> Decode[Decode Worker]
    Decode --> E2E[End-to-end Metrics]
    E2E --> Compare[Compare with baseline]

要看：

transfer latency
decode start delay
GPU idle time
TTFT/P99 impact
memory usage
failure/fallback behavior

8. GPUNetIO 设计实验

本地没有 DOCA/NVIDIA NIC，也要能画：

flowchart TB
    Packet[Network Packet] --> CPU[CPU Networking Stack]
    CPU --> HostBuf[Host Buffer]
    HostBuf --> GPUCopy[Copy to GPU]
    GPUCopy --> GPU[GPU Processing]

    Packet --> RNIC[RNIC/DOCA]
    RNIC --> GPUNetIO[GPUNetIO]
    GPUNetIO --> GPU2[GPU Processing]

    CPU --> Cost1[CPU overhead]
    GPUCopy --> Cost2[Copy latency]
    GPUNetIO --> Risk[Complexity/debug risk]

面试结论：

I would only consider GPUNetIO if CPU-mediated networking and CPU-GPU copy are on the critical path and the GPU directly consumes the network data. Otherwise the added complexity may not be justified.