LLM Inference Benchmarking

LLM 推理基准测试（LLM Inference Benchmarking）是评估推理引擎、硬件和 workload 组合的测量方法。有效 benchmark 需要描述模型、权重、引擎、硬件、负载形态和指标，而不是只报告单用户 tokens per second。

必须描述的上下文

Model：模型名、架构、参数量、MoE active params。
Weights：dtype、quant format、group size、calibration。
Engine：版本、commit、backend、flags。
Hardware：GPU SKU、memory capacity、bandwidth、interconnect、CPU、RAM。
Workload：输入 / 输出长度分布、concurrency、streaming、shared prefixes、structured output。
Metrics：TTFT、TPOT、end-to-end latency、p50/p95/p99、tokens per second、requests per second、GPU memory usage、KV cache hit rate、prefill throughput、decode throughput、cost per 1M tokens。

推理 benchmark 的目的不是找到“最快引擎”，而是确认某个推理引擎、硬件和 workload 组合是否满足产品约束。不同产品可能优化 latency、throughput、cost、privacy、portability 或 developer speed。