Sequence 06 - 深度面试 Q&A

每个回答都包含：中文理解、英文回答、继续追问、扩展方向。

1. 自我介绍

Q: Tell me about yourself.

中文思路：

不要流水账。
直接绑定 JD：systems/performance/AI inference/networking/GPU profiling。
诚实说 NVIDIA-specific stack 正在补深。

英文回答：

I am a systems and performance engineer with experience in performance-sensitive production systems, AI/LLM engineering pipelines, distributed service infrastructure, and diagnostic tooling. My strongest areas are system architecture, bottleneck isolation, profiling-driven optimization, observability, and prototype-driven validation. For this NVIDIA role, the most relevant parts are my AI inference integration work, LLM serving path awareness, communication-path diagnosis in distributed systems, GPU workload profiling experience with Vulkan/WebGPU/Nsight, and C++/Python/Linux systems background. I am also actively deepening NVIDIA-specific areas such as CUDA, NCCL, UCX, NIXL, GPUNetIO, and inference data movement.

继续追问：

追问	答法
你最强的相关能力是什么？	performance/debug/architecture/prototype。
最大短板是什么？	直接 production owning NIXL/GPUNetIO 不足，但能快速补齐 data path 和 benchmark。

2. AI serving

Q: What is the end-to-end path of an LLM inference request?

英文回答：

I break an LLM request into queueing, tokenization, scheduling, prefill, KV cache allocation, decode, and streaming output. TTFT is mainly affected by queueing, scheduling, tokenization, and prefill. TPOT is mainly affected by the decode loop, KV cache access, memory bandwidth, and sometimes tensor-parallel communication. P99 latency usually reflects saturation, batching behavior, queueing, and interference.

继续追问图：

flowchart TB
    Q[LLM request path] --> A[TTFT high]
    Q --> B[TPOT high]
    Q --> C[P99 high]
    A --> A1[queueing/scheduler/prefill/KV allocation]
    B --> B1[decode loop/KV read/memory/collectives]
    C --> C1[batching/saturation/admission control/interference]

Q: How would you debug high TTFT?

英文回答：

I would separate TTFT into queueing time, tokenization, scheduling delay, prefill execution, KV allocation, and any data movement before decode starts. I would correlate request traces with GPU timeline and server metrics. If prompt length drives TTFT, prefill is likely dominant. If TTFT increases only under load, queueing, batching, or admission control may be the cause.

Q: How would you debug bad TPOT?

英文回答：

TPOT reflects per-token decode efficiency. I would inspect decode kernel time, KV cache access pattern, memory bandwidth, synchronization, and tensor-parallel collectives if used. If GPU utilization is low, I would check scheduling and small-batch inefficiency. If memory bandwidth or KV access dominates, I would investigate cache layout, paging, and data movement.

3. NIXL / NCCL / UCX

Q: What is the difference between NIXL and NCCL?

英文回答：

NCCL is primarily a collective communication library across ranks, with operations such as all-reduce, all-gather, and reduce-scatter. NIXL is more related to inference data or state movement, for example moving KV cache or request state between workers in a disaggregated inference system. I would not treat NIXL as a replacement for NCCL; they solve different communication patterns.

继续追问：

追问	回答
KV transfer 慢影响什么？	TTFT、decode start delay、P99、GPU idle。
怎么验证 NIXL 有价值？	baseline -> KV transfer benchmark -> E2E serving metrics。

Q: What is UCX?

英文回答：

UCX is a high-performance communication framework and transport abstraction. It can use different underlying transports such as shared memory, TCP, RDMA, and CUDA-aware paths. In a GPU networking context, I would use UCX to reason about transport selection, GPU memory awareness, RDMA capability, and fallback behavior.

4. NCCL / collectives

Q: Why is all-reduce important?

英文回答：

All-reduce aggregates data across ranks and returns the result to every rank. It is central to data-parallel training for gradient synchronization and can also appear in tensor-parallel workloads depending on model partitioning. Its performance depends on message size, topology, transport, rank mapping, and overlap with computation.

Q: Small message and large message bottlenecks?

英文回答：

Small messages are often latency and overhead dominated: launch overhead, synchronization, and per-message progress matter more. Large messages are usually bandwidth and topology dominated, so PCIe, NVLink, NIC, and fabric bandwidth become more important. I would inspect algbw, busbw, message size, and NCCL_DEBUG logs.

5. RDMA / GPUDirect / GPUNetIO

Q: Why is RDMA useful for AI/GPU networking?

英文回答：

RDMA reduces CPU and kernel involvement by allowing the RNIC to directly access registered memory. For AI/GPU networking, this can reduce data movement overhead and improve throughput and latency, especially when communication is on the critical path.

Q: GPUDirect RDMA vs GPUNetIO?

英文回答：

GPUDirect RDMA is a data path capability where an RNIC can directly access GPU memory under the right hardware, driver, topology, and fabric conditions. GPUNetIO is more of a GPU-centric networking programming model in the DOCA ecosystem, where GPU can participate more directly in packet or network data processing. I would consider GPUNetIO only when CPU-mediated networking is on the critical path and the GPU directly consumes or processes network data.

6. CUDA / Nsight

Q: Nsight Systems vs Nsight Compute?

英文回答：

Nsight Systems answers where time is spent across the whole application: CPU work, CUDA API calls, kernel launches, memory copies, streams, synchronization, and idle gaps. Nsight Compute answers why a specific kernel is slow: memory throughput, occupancy, warp stalls, instruction mix, shared memory usage, and other kernel-level metrics.

Q: Memory-bound vs compute-bound?

英文回答：

If memory bandwidth is close to the hardware limit while compute utilization is low or stalls point to memory dependency, I treat it as memory-bound. If compute pipelines are saturated and memory bandwidth is not limiting, it is more compute-bound. The optimization direction differs: memory-bound workloads need better locality, coalescing, tiling, or reduced memory traffic; compute-bound workloads need better arithmetic efficiency or tensor-core utilization.

7. CV 项目深挖

Q: Tell me about your AI risk-control and LLM platform.

英文回答：

The project connected feature generation, online inference, retrieval-assisted LLM analysis, anomaly detection, strategy execution, and feedback loops. My focus was production integration: observability, replayability, correctness validation, inference-path efficiency, and iterative optimization. From an NVIDIA perspective, the relevant parts are serving path decomposition, latency/throughput tradeoffs, model output validation, and diagnosing bottlenecks in production workflows.

Q: Your GPU experience is Vulkan/WebGPU, not CUDA. Is that enough?

英文回答：

I would not claim CUDA production kernel ownership. My GPU-adjacent experience is in Vulkan/WebGPU workload analysis and profiling using tools like RenderDoc and Nsight. The transferable part is understanding GPU execution behavior, memory/resource tradeoffs, profiling timelines, and bottleneck isolation. For this role, I am closing the CUDA-specific gap with focused CUDA/Nsight experiments.

8. Architecture / prototype

Q: How would you design a communication optimization prototype?

英文回答：

I would start with a clear hypothesis, such as KV transfer blocking decode or collective communication limiting TPOT. Then I would measure a baseline, build a microbenchmark to isolate the communication pattern, implement the smallest prototype, and validate it in an end-to-end workload. I would judge success by TTFT, TPOT, P99 latency, throughput, resource usage, failure modes, fallback behavior, and maintainability.

flowchart LR
    Hypothesis --> Baseline
    Baseline --> Microbenchmark
    Microbenchmark --> Prototype
    Prototype --> E2E
    E2E --> Roadmap