Sequence 04 - JD 技术深挖学习

本文件只讲 JD 会深挖的技术。每个技术都按“是什么、为什么、怎么查、和 CV 怎么接”来学。

1. 总技术栈图

flowchart TB
    Workload[AI Workloads] --> Serving[Inference / Model Serving]
    Workload --> Training[Training / Parallelism]

    Serving --> Prefill[Prefill]
    Serving --> Decode[Decode]
    Serving --> KV[KV Cache]
    Serving --> Runtime[Runtime Systems]

    Runtime --> Dynamo[Dynamo]
    Runtime --> NIXL[NIXL]

    Training --> NCCL[NCCL]
    Training --> Parallel[DP / TP / PP / FSDP]

    Runtime --> Comm[Communication Libraries]
    Comm --> UCX[UCX]
    Comm --> MPI[MPI]
    Comm --> GPUNetIO[GPUNetIO]

    UCX --> RDMA[RDMA / RoCE]
    RDMA --> GDR[GPUDirect RDMA]

    Serving --> CUDA[CUDA / GPU Programming]
    CUDA --> Nsight[Nsight Systems / Compute]

2. AI inference / model serving

是什么：

把模型变成在线服务。它不是一次 Python 函数调用，而是一个由队列、调度、prefill、decode、KV cache、streaming、metrics 组成的 runtime system。

关键机制：

机制	解释	常见瓶颈
Queueing	请求进入系统后的等待	P99/TTFT 变差
Scheduling	决定哪些请求一起执行	fairness、batching、tail latency
Prefill	处理输入 prompt，生成 KV cache	长 prompt、TTFT
Decode	逐 token 生成	TPOT、KV cache read、collectives
KV cache	attention 历史状态	显存压力、state movement
Continuous batching	动态合批	throughput 提升，P99 风险

面试深挖图：

flowchart LR
    Req[Request] --> Q[Queue]
    Q --> S[Scheduler]
    S --> P[Prefill]
    P --> K[KV Cache]
    K --> D[Decode Loop]
    D --> Out[Streaming Output]

    Q --> TTFT[TTFT]
    P --> TTFT
    D --> TPOT[TPOT]
    Q --> P99[P99]
    S --> P99

3. NIXL

是什么：

NIXL 是 inference data/state movement 方向的库，重点不是 collective，而是把 inference runtime 里的状态和数据高效搬动，例如 KV cache 或 prefill/decode disaggregation 里的 state transfer。

和 NCCL 区别：

维度	NIXL	NCCL
主要场景	inference state/data movement	distributed collective tensor communication
典型对象	KV cache、request state、buffer	gradients、activation shards、tensor partitions
操作类型	point-to-point / transfer abstraction	all-reduce/all-gather/reduce-scatter
指标	transfer latency、overlap、TTFT/TPOT 影响	algbw、busbw、collective time

flowchart TB
    PrefillWorker[Prefill Worker] --> KVBlock[KV Blocks]
    KVBlock --> NIXL[NIXL Transfer]
    NIXL --> DecodeWorker[Decode Worker]
    DecodeWorker --> Decode[Decode Tokens]

    NIXL --> Metric1[Transfer latency]
    NIXL --> Metric2[Overlap with compute]
    NIXL --> Metric3[TTFT impact]

4. UCX / RDMA / GPUDirect

UCX 是什么：

UCX 是通信 transport abstraction。它不是某一种网络，而是可以在底层选择 TCP、shared memory、RDMA、CUDA-aware path 等。

RDMA 是什么：

RDMA 允许 RNIC 直接访问远端 registered memory，减少 CPU/kernel involvement。

GPUDirect RDMA 是什么：

在硬件和驱动支持下，RNIC 可以直接访问 GPU memory，减少 CPU staging。

flowchart TB
    subgraph TCP[CPU-staged TCP path]
        G1[GPU Memory] --> H1[Host Staging]
        H1 --> CPU[CPU / Kernel Network Stack]
        CPU --> NIC[NIC]
        NIC --> NET[Network]
    end

    subgraph RDMA[GPUDirect RDMA path]
        G2[GPU Memory] --> RNIC[RNIC]
        RNIC --> RNET[RDMA Network]
        RNET --> Remote[Remote GPU/Host Memory]
    end

慢了怎么查：

层	检查什么
transport	是否走 RDMA，还是 TCP fallback
memory	CPU pageable、pinned、GPU memory
registration	MR 是否重复注册，registration cache 是否命中
topology	GPU 和 NIC 是否同 PCIe/NUMA 近端
message size	小消息 latency，大消息 bandwidth
progress	polling、completion、progress thread
fabric	RoCE PFC/ECN/QoS、IB counters、congestion

5. GPUNetIO

是什么：

GPUNetIO 是 NVIDIA DOCA 里的 GPU-centric networking 能力，让 GPU 更直接参与网络 packet/data path。

它和 GPUDirect RDMA：

技术	重点
GPUDirect RDMA	NIC 直接访问 GPU memory 的 data path 能力。
GPUNetIO	GPU 参与网络数据处理/packet path 的编程模型和 runtime。

什么时候值得用：

flowchart TB
    Data[Network Data] --> CPUPath[CPU handles packet/data path]
    CPUPath --> Copy[Copy/Sync to GPU]
    Copy --> GPU[GPU consumes data]

    Data --> GPUNetIO[GPUNetIO GPU-centric path]
    GPUNetIO --> GPU2[GPU consumes/processes data]

    CPUPath --> Bottleneck[CPU overhead / copy / latency bottleneck]
    Bottleneck --> Need[Consider GPUNetIO]

不要乱说：

不是所有网络场景都该用 GPUNetIO。
如果 CPU path 不在 critical path，上 GPUNetIO 可能只增加复杂度。

6. CUDA / Nsight / performance

必须掌握：

概念	面试解释
warp	NVIDIA GPU 通常 32 threads 的执行组。
coalescing	相邻线程访问连续地址，减少 memory transaction。
shared memory	block 内低延迟 on-chip memory，用于 tiling/data reuse。
pinned memory	page-locked host memory，适合 DMA/async copy。
occupancy	SM 上活跃 warp/block 程度，但不是越高越好。
memory-bound	受 memory bandwidth/latency 限制。
compute-bound	受 arithmetic pipeline/tensor core 限制。
Nsight Systems	看整体 timeline。
Nsight Compute	看单 kernel 细节。

flowchart LR
    Slow[Program Slow] --> Systems[Nsight Systems]
    Systems --> Timeline[CPU/GPU timeline]
    Timeline --> Kernel{Kernel dominates?}
    Kernel -->|Yes| Compute[Nsight Compute]
    Kernel -->|No| API[Check CPU/API/copy/sync/queueing]
    Compute --> Metrics[occupancy/memory/stalls/instructions]
    Metrics --> Fix[coalescing/tiling/copy overlap/kernel redesign]

7. System architecture / prototype

Architect 面试不会只问定义，会问你怎么判断一个优化是否值得做。

标准方法：

flowchart TB
    Problem[Problem] --> Hypothesis[Hypothesis]
    Hypothesis --> Baseline[Baseline Metrics]
    Baseline --> Micro[Microbenchmark]
    Micro --> Prototype[Small Prototype]
    Prototype --> E2E[End-to-end Validation]
    E2E --> Decision[Roadmap Decision]
    Decision --> Rollout[Rollout / Fallback / Observability]

回答模板：

I would start with a hypothesis and baseline metrics. Then I would isolate the communication or runtime pattern with a microbenchmark, build the smallest prototype, and finally validate it against end-to-end metrics such as TTFT, TPOT, P99 latency, throughput, memory usage, and operational complexity.