Though no major provider currently guarantees fully deterministic outputs — and this is widely recognised as a limitation of today’s LLM technology — it becomes a significant challenge when implementing agentic systems at scale in production.
LLM output
probabilities are theoretically deterministic, but not exactly repeatable in
practice due to implementation and infrastructure realities. Variability arises
from two sources: sampling randomness (token selection) and computation
variability (the probability distribution itself). The former can be controlled
via greedy decoding (argmax), but the latter is significantly harder to
eliminate.
Most popular LLM
APIs often use Mixture-of-Experts (MoE) architectures, where tokens are routed
to subsets of experts. In production environments, batching, routing
thresholds, and infrastructure scheduling can cause slight variations in expert
selection, leading to different computations across runs. For example,
floating-point non-associativity, non-deterministic GPU kernels, and
multi-threaded execution can alter values during neural computations.
Furthermore, the batching of user requests in shared infrastructure can
influence execution paths, meaning the same expert model may not be selected
consistently across runs. These effects can shift the selected argmax token,
leading to divergent outputs. As a result, achieving absolute determinism is
generally not possible with public APIs, and this proved challenging in my
agentic AI implementations using Google Gemini and Amazon Nova, both of which
are widely considered to follow MoE-based designs.
Improving
consistency
1.
Control sampling variability
- Use temperature = 0 to enforce
greedy decoding (argmax selection).
- Set top_p = 1, and disable other
stochastic sampling parameters.
- Use a seed where supported
(improves repeatability but does not guarantee determinism).
2. Reduce
variability in probability computation
- Tighten prompts to keep them consistent (including punctuation, casing, spacing, and formatting).
- Tighten context and stabilise
retrieval (if using RAG) with deterministic ranking (e.g. hybrid semantic
+ lexical scoring) and fixed thresholds.
- Break tasks into smaller,
deterministic steps and replace open-ended reasoning with rules where
possible.
- Use structured outputs (e.g.
JSON schema) to stabilise results.
- Apply post-processing,
guardrails, and retry strategies.
- If consistency is paramount,
consider self-hosted models on fixed infrastructure.


