Sequence 04 - JD 技术深挖学习
本文件只讲 JD 会深挖的技术。每个技术都按“是什么、为什么、怎么查、和 CV 怎么接”来学。
1. 总技术栈图
flowchart TB
Workload[AI Workloads] --> Serving[Inference / Model Serving]
Workload --> Training[Training / Parallelism]
Serving --> Prefill[Prefill]
Serving --> Decode[Decode]
Serving --> KV[KV Cache]
Serving --> Runtime[Runtime Systems]
Runtime --> Dynamo[Dynamo]
Runtime --> NIXL[NIXL]
Training --> NCCL[NCCL]
Training --> Parallel[DP / TP / PP / FSDP]
Runtime --> Comm[Communication Libraries]
Comm --> UCX[UCX]
Comm --> MPI[MPI]
Comm --> GPUNetIO[GPUNetIO]
UCX --> RDMA[RDMA / RoCE]
RDMA --> GDR[GPUDirect RDMA]
Serving --> CUDA[CUDA / GPU Programming]
CUDA --> Nsight[Nsight Systems / Compute]
2. AI inference / model serving
是什么:
把模型变成在线服务。它不是一次 Python 函数调用,而是一个由队列、调度、prefill、decode、KV cache、streaming、metrics 组成的 runtime system。
关键机制:
| 机制 | 解释 | 常见瓶颈 |
|---|---|---|
| Queueing | 请求进入系统后的等待 | P99/TTFT 变差 |
| Scheduling | 决定哪些请求一起执行 | fairness、batching、tail latency |
| Prefill | 处理输入 prompt,生成 KV cache | 长 prompt、TTFT |
| Decode | 逐 token 生成 | TPOT、KV cache read、collectives |
| KV cache | attention 历史状态 | 显存压力、state movement |
| Continuous batching | 动态合批 | throughput 提升,P99 风险 |
面试深挖图:
flowchart LR
Req[Request] --> Q[Queue]
Q --> S[Scheduler]
S --> P[Prefill]
P --> K[KV Cache]
K --> D[Decode Loop]
D --> Out[Streaming Output]
Q --> TTFT[TTFT]
P --> TTFT
D --> TPOT[TPOT]
Q --> P99[P99]
S --> P99
3. NIXL
是什么:
NIXL 是 inference data/state movement 方向的库,重点不是 collective,而是把 inference runtime 里的状态和数据高效搬动,例如 KV cache 或 prefill/decode disaggregation 里的 state transfer。
和 NCCL 区别:
| 维度 | NIXL | NCCL |
|---|---|---|
| 主要场景 | inference state/data movement | distributed collective tensor communication |
| 典型对象 | KV cache、request state、buffer | gradients、activation shards、tensor partitions |
| 操作类型 | point-to-point / transfer abstraction | all-reduce/all-gather/reduce-scatter |
| 指标 | transfer latency、overlap、TTFT/TPOT 影响 | algbw、busbw、collective time |
flowchart TB
PrefillWorker[Prefill Worker] --> KVBlock[KV Blocks]
KVBlock --> NIXL[NIXL Transfer]
NIXL --> DecodeWorker[Decode Worker]
DecodeWorker --> Decode[Decode Tokens]
NIXL --> Metric1[Transfer latency]
NIXL --> Metric2[Overlap with compute]
NIXL --> Metric3[TTFT impact]
4. UCX / RDMA / GPUDirect
UCX 是什么:
UCX 是通信 transport abstraction。它不是某一种网络,而是可以在底层选择 TCP、shared memory、RDMA、CUDA-aware path 等。
RDMA 是什么:
RDMA 允许 RNIC 直接访问远端 registered memory,减少 CPU/kernel involvement。
GPUDirect RDMA 是什么:
在硬件和驱动支持下,RNIC 可以直接访问 GPU memory,减少 CPU staging。
flowchart TB
subgraph TCP[CPU-staged TCP path]
G1[GPU Memory] --> H1[Host Staging]
H1 --> CPU[CPU / Kernel Network Stack]
CPU --> NIC[NIC]
NIC --> NET[Network]
end
subgraph RDMA[GPUDirect RDMA path]
G2[GPU Memory] --> RNIC[RNIC]
RNIC --> RNET[RDMA Network]
RNET --> Remote[Remote GPU/Host Memory]
end
慢了怎么查:
| 层 | 检查什么 |
|---|---|
| transport | 是否走 RDMA,还是 TCP fallback |
| memory | CPU pageable、pinned、GPU memory |
| registration | MR 是否重复注册,registration cache 是否命中 |
| topology | GPU 和 NIC 是否同 PCIe/NUMA 近端 |
| message size | 小消息 latency,大消息 bandwidth |
| progress | polling、completion、progress thread |
| fabric | RoCE PFC/ECN/QoS、IB counters、congestion |
5. GPUNetIO
是什么:
GPUNetIO 是 NVIDIA DOCA 里的 GPU-centric networking 能力,让 GPU 更直接参与网络 packet/data path。
它和 GPUDirect RDMA:
| 技术 | 重点 |
|---|---|
| GPUDirect RDMA | NIC 直接访问 GPU memory 的 data path 能力。 |
| GPUNetIO | GPU 参与网络数据处理/packet path 的编程模型和 runtime。 |
什么时候值得用:
flowchart TB
Data[Network Data] --> CPUPath[CPU handles packet/data path]
CPUPath --> Copy[Copy/Sync to GPU]
Copy --> GPU[GPU consumes data]
Data --> GPUNetIO[GPUNetIO GPU-centric path]
GPUNetIO --> GPU2[GPU consumes/processes data]
CPUPath --> Bottleneck[CPU overhead / copy / latency bottleneck]
Bottleneck --> Need[Consider GPUNetIO]
不要乱说:
不是所有网络场景都该用 GPUNetIO。
如果 CPU path 不在 critical path,上 GPUNetIO 可能只增加复杂度。
6. CUDA / Nsight / performance
必须掌握:
| 概念 | 面试解释 |
|---|---|
| warp | NVIDIA GPU 通常 32 threads 的执行组。 |
| coalescing | 相邻线程访问连续地址,减少 memory transaction。 |
| shared memory | block 内低延迟 on-chip memory,用于 tiling/data reuse。 |
| pinned memory | page-locked host memory,适合 DMA/async copy。 |
| occupancy | SM 上活跃 warp/block 程度,但不是越高越好。 |
| memory-bound | 受 memory bandwidth/latency 限制。 |
| compute-bound | 受 arithmetic pipeline/tensor core 限制。 |
| Nsight Systems | 看整体 timeline。 |
| Nsight Compute | 看单 kernel 细节。 |
flowchart LR
Slow[Program Slow] --> Systems[Nsight Systems]
Systems --> Timeline[CPU/GPU timeline]
Timeline --> Kernel{Kernel dominates?}
Kernel -->|Yes| Compute[Nsight Compute]
Kernel -->|No| API[Check CPU/API/copy/sync/queueing]
Compute --> Metrics[occupancy/memory/stalls/instructions]
Metrics --> Fix[coalescing/tiling/copy overlap/kernel redesign]
7. System architecture / prototype
Architect 面试不会只问定义,会问你怎么判断一个优化是否值得做。
标准方法:
flowchart TB
Problem[Problem] --> Hypothesis[Hypothesis]
Hypothesis --> Baseline[Baseline Metrics]
Baseline --> Micro[Microbenchmark]
Micro --> Prototype[Small Prototype]
Prototype --> E2E[End-to-end Validation]
E2E --> Decision[Roadmap Decision]
Decision --> Rollout[Rollout / Fallback / Observability]
回答模板:
I would start with a hypothesis and baseline metrics. Then I would isolate the communication or runtime pattern with a microbenchmark, build the smallest prototype, and finally validate it against end-to-end metrics such as TTFT, TPOT, P99 latency, throughput, memory usage, and operational complexity.