Skip to Content

Defeating Nondeterminism In LLM Inference

Batch-Invariant Kernels As A Practical Path To Reproducible Generation
Horace He Thinking Machines Lab

Reproducible outputs at temperature 0 should be straightforward in principle, the sampler always picks the highest probability token, yet production LLM endpoints still produce different completions for the same prompt. 

A recent article by Horace He and collaborators at Thinking Machines argues that the root cause is not generic GPU concurrency but the lack of batch invariance in core kernels. 

In short: numerics change when batch size or sequence slicing changes, so two identical prompts can follow different numerical paths to logits and diverge during greedy decoding.

This article distills the main arguments, verifies key claims against external sources, and summarizes practical steps to make inference deterministic. The original post can be found here: Defeating Nondeterminism in LLM Inference (He, 2025).

Key Takeaways

  • Floating-point non-associativity exists, but most forward-pass kernels used in LLMs are run-to-run deterministic on fixed shapes/configs; the big swings arise when the batch context changes (PyTorch Docs).

  • Dynamic batching and sequence chunking alter dispatch decisions and reduction strategies, leading to different numerics for the same request (vLLM Team, 2024; SGLang Team, 2025).

  • To eliminate this source of drift, make reduction-heavy ops batch-invariant: RMSNorm, matrix multiplication, and attention. Keep each element's reduction order fixed regardless of batch size.

  • For attention decoding, use a fixed split-size (not a fixed number of splits) for Split-KV so the reduction order is invariant to how many tokens are processed at once.

  • Deterministic mode incurs moderate overhead but enables identical temperature-0 completions and unlocks true on-policy RL training.

  • Reference implementation and vLLM integration are available: thinking-machines-lab/batch_invariant_ops (He, 2025).

Overview: Why Batch Invariance Beats "Concurrency + Floating Point"

One popular claims is that floating-point arithmetic is non-associative: (a+b)+ca+(b+c). Changing the order of additions can change results. That said, many GPU kernels used in transformer forward passes avoid atomics and use fixed reduction trees, which makes them bitwise repeatable for the same input shape and algorithm. 

You can repeatedly run mm(A,B) and get identical outputs when all else is constant. The nondeterminism shows up when the context changes - notably the batch size, the position of an element within the batch, or the way a sequence is chunked - which causes the library to choose different reduction strategies or instruction shapes.

Modern inference engines rely on dynamic batching and speculative decoding to maximize throughput. As the vLLM documentation notes, "the same requests might be batched differently" and can "lead to slightly different logit/logprob values at each step" (vLLM Team, 2024). SGLang echoes this, attributing most nondeterminism to dynamic batching and prefix caching, and recommending single-request execution for greater determinism (SGLang Team, 2025).

1.23 × 10³
+
2.34 × 10¹
=
1.2534 × 10³
exact: 1253.4
Credit: Thinking Machines AI: If we require 3 digits of precision to represent 1230 and 3 digits of precision to represent 23.4. However, adding these 2 numbers together results in a number that requires 5 digits of precision to represent (1253.4). Our floating-point format must then drop the 34 off the end. In some sense, we have effectively rounded our original 23.4 to 20.0 before adding it.

Atomic Adds And Run-To-Run Determinism

A common hypothesis is that GPU concurrency plus floating-point non-associativity causes nondeterminism via atomic adds. In practice, atomic adds are rarely used in the LLM forward pass; libraries achieve determinism and performance by (1) exploiting parallelism along batch-like axes so each reduction stays within a core, and (2) using tree/split reductions with a deterministic clean-up or semaphore ordering (NVIDIA CUTLASS, 2024). 

FlashAttention backward is a notable case where avoiding atomics changes the algorithm (extra recomputation versus the original paper Dao, 2023), but the forward pass of typical transformer inference is run-to-run deterministic given fixed shapes and kernels (He, 2025).

Why, then, do users observe nondeterminism? Because batch size and sequence slicing vary with server load and scheduling. If kernel numerics are not batch-invariant, the same request can see different reduction orders and instruction choices depending on co-scheduled work, yielding different logits and greedy-decoding paths (He, 2025).

Why It's Important

Deterministic inference strengthens scientific and engineering workflows: repeatable evaluation, reliable regression testing, reproducible A/B results, and exact replay for debugging. 

It also enables stronger training setups. If sampling and training are bitwise identical, reinforcement learning remains truly on-policy instead of drifting off-policy and needing importance weighting to correct the mismatch. The article demonstrates this in an RLVR setup where the "True On-Policy" run maintains zero KL-divergence between sampler and trainer (He, 2025).

Discussion: Evidence, Figures, And Kernel Design

The article opens with a crisp explanation of floating-point non-associativity and illustrates how changing addition orders can yield many possible sums for the same multiscale vector. 

This frames the core question: when do the kernel implementations change the reduction order? Contrary to the idea that "parallel threads finish in random order," the authors argue that most forward-pass kernels (matmul, normalization, pointwise ops) are engineered to be run-to-run deterministic. The larger system becomes nondeterministic because kernels often lack batch invariance.

RMSNorm. Figures 5-7 in the post compare a purely data-parallel approach (one batch row per core, reduction contained within a core) with split reductions introduced when batch is too small to saturate the GPU. The latter improves occupancy but changes the order of additions, breaking batch invariance. A batch-invariant RMSNorm keeps a single reduction strategy across batch sizes - potentially sacrificing some utilization at tiny batches to preserve identical numerics per row.

Matrix Multiplication. Figures 8-11 explain that GEMM is most naturally data-parallel along tiles when M and N are sufficiently large, keeping K-reductions local to a core. When M and N are small, many libraries switch to Split-K (parallelized reductions along K) to increase occupancy, which alters the reduction order and even instruction selection (tensor-core tile sizes). CUTLASS documents Split-K as a standard performance tactic for small M/N shapes (NVIDIA, 2025). A batch-invariant GEMM compiles a single configuration without Split-K and reuses it across shapes, trading a slice of peak throughput (~20% in the post's example) for invariant numerics across batch sizes. Some strategies like stream-k can further reduce invariance by varying reductions across tiles, making them not even batch-position-invariant (He, 2025).

Attention. Figures 12-15 treat attention as two matmuls with reductions across both feature and sequence dimensions, then analyze how inference optimizations (KV cache layout, prefill chunking) can reorder reductions. A key insight is that handling cached KV and current tokens with different block boundaries creates boundary-condition changes that reorder reductions. The fix is to standardize layout before the kernel and, in decoding, to adopt a fixed split-size Split-KV scheme. Unlike "balanced" strategies that choose the number of splits to saturate the GPU (and thereby vary the reduction order with batch context), a fixed per-split size produces the same reduction order whether you process one token or many at a time (He, 2025).

Empirical Results. The authors report that with Qwen/Qwen3-235B-A22B-Instruct-2507, temperature-0 completions for "Tell me about Richard Feynman" yield 80 unique completions out of 1000 runs; all runs share the first 102 tokens and diverge around token 103 (e.g., "Queens, New York" vs "New York City"). With batch-invariant kernels, 1000/1000 completions match exactly. A throughput test on Qwen-3-8B shows default vLLM at 26 tokens/s, an unoptimized deterministic mode at 55 tokens/s, and a version with an improved attention kernel at 42 tokens/s - indicating a moderate, manageable cost (He, 2025).

Related System Notes. In distributed training, even collective operations can have determinism caveats depending on the algorithm and hardware path. NVIDIA's NCCL issue tracker notes that NVLink Sharp (NVLS) all-reduce was historically non-deterministic on some stacks; newer drivers make it deterministic on Blackwell and Hopper with CUDA 12.8+ (NVIDIA NCCL, 2024-2025). While this is orthogonal to single-GPU inference, it underscores how algorithm choices and hardware paths affect determinism.

Implementation Sketch

The demonstration builds on vLLM's FlexAttention backend and PyTorch's torch.Library to replace reduction ops with batch-invariant versions. The guiding rule: for each op, choose a single tiling/instruction mix whose reduction order is independent of batch size and sequence chunking. In decoding, fix the split size along KV. 

The reference implementation is available at thinking-machines-lab/batch_invariant_ops. Note: achieving fixed split-size Split-KV required internal FlexAttention changes that were not included in the initial code release and will be upstreamed (He, 2025).

# Pseudocode sketch for a batch-invariant GEMM dispatch
# Always use the same tile sizes and disable Split-K.
# This sacrifices some throughput on small M,N but ensures fixed reduction order.

def deterministic_gemm(a, b):
    # shape checks elided
    cfg = select_cfg(tile_m=128, tile_n=128, tile_k=32, use_tensor_cores=True)
    # explicitly disable split-k path
    return launch_gemm(a, b, cfg, split_k=1)

Experiments

How nondeterministic are completions? Using Qwen/Qwen3-235B-A22B-Instruct-2507 at temperature 0, 1000 runs of “Tell me about Richard Feynman” produced 80 unique completions; all runs matched for the first 102 tokens and diverged at token 103 (e.g., “Queens, New York” vs “New York City”). With batch-invariant kernels, 1000/1000 completions match (He, 2025).

Performance. On Qwen-3-8B with a single-GPU API server and 1000 sequences of ~100 tokens, reported throughputs were: vLLM default ~26 tok/s, unoptimized deterministic ~55 tok/s, and with an improved attention kernel ~42 tok/s. The overhead is moderate and driven largely by current FlexAttention integration costs (He, 2025).

True On-Policy RL. Deterministic inference enables bitwise-identical sampling and training, eliminating off-policy drift. In an RLVR setup on BigMath initialized from Qwen 2.5-VL Instruct 8B (rollouts up to 4096), the “True On-Policy” run maintained KL divergence at 0 between sampler and trainer, while a run without importance weighting collapsed and a run with importance weighting stabilized but showed nonzero KL (see background on off-policy corrections in Fengyao, 2023; details in He, 2025).

Conclusion

The familiar explanation - "GPUs are parallel so outputs vary" - misses the practical lever. What determines repeatability is whether kernel numerics are batch-invariant. By holding reduction order fixed across batch sizes and sequence slicing, the authors show that temperature-0 inference can be made reproducible with modest overhead. This improves testing and evaluation, enables true on-policy RL, and makes production debugging cleaner. For the deep dive, figures, and code, see (He, 2025) and thinking-machines-lab/batch_invariant_ops.


Definitions

Batch invariance: The numerical result for an element is independent of batch size and position in the batch; reduction order and instruction choices are fixed.

Split-K (GEMM): Partitioning the K dimension across cores/threadblocks and reducing partial sums later to improve occupancy when M,N are small; changes reduction order (NVIDIA, 2025).

Split-KV (attention decoding): Partitioning the KV axis to gain parallelism when query length is small. A fixed split-size policy preserves invariance.

Deterministic inference: Given the same input, model parameters, and environment, outputs are bitwise identical across runs at temperature 0.

References

(He, 2025) Defeating Nondeterminism in LLM Inference. Thinking Machines Lab: Connectionism. DOI: 10.64434/tml.20250910.

(vLLM Team, 2024) Frequently Asked Questions.

(SGLang Team, 2025) Troubleshooting and FAQ.

(NVIDIA, 2025) Efficient GEMM in CUDA - Parallelized Reductions.

(Dao, 2023) FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.

(PyTorch, 2025) Numerical Accuracy - Batched or Slice Computations.

(NVIDIA NCCL, 2024-2025) Issue 1497: Determinism of NVLink Sharp.

(NVIDIA CUTLASS, 2024) Deterministic semaphore ordering for split reductions.

(Fengyao, 2023) Notes on off-policy RL and importance weighting.

(BigMath, 2025) BigMath: Benchmark and RLVR setup details.


Publication Title: Defeating Nondeterminism in LLM Inference
Authors:
Horace He Thinking Machines Lab
Defeating Nondeterminism In LLM Inference
Joshua Berkowitz September 23, 2025
Views 1991
Share this post