Mej

After completing 12 years in software QA with a variety of test data, I was tempted to make a career shift into data science and decided to pursue this through a structured masters program. Though I love the three pillars - math, statistics and programming, I did not have an easy start as I am getting back to studies after a long gap of 14 years. As I began learning machine learning, visual analytics, data science, Python, Matlab, R, Tableau, Mondrian etc., I got excited of blogging so as to summarise my learning. I will try to make frequent posts and keep it simple. Looking forward for good learning and sharing time... Cheers, Mej!

Thursday, 25 June 2026

Same Input, Different Output: The Biggest Reliability Challenge in Agentic AI

Though no major provider currently guarantees fully deterministic outputs — and this is widely recognised as a limitation of today’s LLM technology — it becomes a significant challenge when implementing agentic systems at scale in production.

LLM output probabilities are theoretically deterministic, but not exactly repeatable in practice due to implementation and infrastructure realities. Variability arises from two sources: sampling randomness (token selection) and computation variability (the probability distribution itself). The former can be controlled via greedy decoding (argmax), but the latter is significantly harder to eliminate.

Most popular LLM APIs often use Mixture-of-Experts (MoE) architectures, where tokens are routed to subsets of experts. In production environments, batching, routing thresholds, and infrastructure scheduling can cause slight variations in expert selection, leading to different computations across runs. For example, floating-point non-associativity, non-deterministic GPU kernels, and multi-threaded execution can alter values during neural computations. Furthermore, the batching of user requests in shared infrastructure can influence execution paths, meaning the same expert model may not be selected consistently across runs. These effects can shift the selected argmax token, leading to divergent outputs. As a result, achieving absolute determinism is generally not possible with public APIs, and this proved challenging in my agentic AI implementations using Google Gemini and Amazon Nova, both of which are widely considered to follow MoE-based designs.

 

Improving consistency

1. Control sampling variability

  • Use temperature = 0 to enforce greedy decoding (argmax selection).
  • Set top_p = 1, and disable other stochastic sampling parameters.
  • Use a seed where supported (improves repeatability but does not guarantee determinism).

2. Reduce variability in probability computation

  • Tighten prompts to keep them consistent (including punctuation, casing, spacing, and formatting).
  • Tighten context and stabilise retrieval (if using RAG) with deterministic ranking (e.g. hybrid semantic + lexical scoring) and fixed thresholds.
  • Break tasks into smaller, deterministic steps and replace open-ended reasoning with rules where possible.
  • Use structured outputs (e.g. JSON schema) to stabilise results.
  • Apply post-processing, guardrails, and retry strategies.
  • If consistency is paramount, consider self-hosted models on fixed infrastructure.

 

No comments:

Post a Comment

Wanna search?