Inference Engines for LLMs & Local AI Hardware (2026 Edition)

这篇文章是 Self-hosted LLMs Local AI Hardware 系列第 3 部分，主题是 LLM 推理引擎。文章的核心观点是：先确定硬件策略、工作负载形态和 serving model，再选择推理引擎。

系列位置

把 LLM Inference Engines 定义为 traffic cop、memory manager、kernel dispatcher、scheduler、cache accountant、parallelism planner、API surface 和 deployment framework 的组合。
按用途划分引擎家族：portable local runtimes、Apple / unified-memory runtimes、consumer CUDA quant engines、production serving engines，以及 Dynamo 这类 orchestration layer。
建立引擎选择映射：llama.cpp 负责 portability，MLX / MLX-LM 负责 Mac-first workflow，ExLlamaV2/V3 负责消费级 CUDA 量化本地推理，vLLM 是开源生产服务默认起点，SGLang 适合 long context / MoE / routing，TensorRT-LLM 追求 NVIDIA max performance。
将 LLM Inference Phases、KV-Cache、Memory Bandwidth for LLM Inference、interconnect、scheduler quality 和 runtime overhead 放进同一个性能模型。
总结 LLM Inference Benchmarking 所需维度：model、weights、engine、hardware、workload、metrics，以及 TTFT、TPOT、p95/p99、KV cache hit rate、prefill/decode throughput 等指标。

根据本次 ingest 要求，三篇系列文章中只有本篇添加了中文翻译。英文原文保持在前，中文翻译作为独立章节附在文末。