SCIence aND ART of data

Thursday, 25 June 2026

Same Input, Different Output: The Biggest Reliability Challenge in Agentic AI

Though no major provider currently guarantees fully deterministic outputs — and this is widely recognised as a limitation of today’s LLM technology — it becomes a significant challenge when implementing agentic systems at scale in production.

LLM output probabilities are theoretically deterministic, but not exactly repeatable in practice due to implementation and infrastructure realities. Variability arises from two sources: sampling randomness (token selection) and computation variability (the probability distribution itself). The former can be controlled via greedy decoding (argmax), but the latter is significantly harder to eliminate.

Most popular LLM APIs often use Mixture-of-Experts (MoE) architectures, where tokens are routed to subsets of experts. In production environments, batching, routing thresholds, and infrastructure scheduling can cause slight variations in expert selection, leading to different computations across runs. For example, floating-point non-associativity, non-deterministic GPU kernels, and multi-threaded execution can alter values during neural computations. Furthermore, the batching of user requests in shared infrastructure can influence execution paths, meaning the same expert model may not be selected consistently across runs. These effects can shift the selected argmax token, leading to divergent outputs. As a result, achieving absolute determinism is generally not possible with public APIs, and this proved challenging in my agentic AI implementations using Google Gemini and Amazon Nova, both of which are widely considered to follow MoE-based designs.

Improving consistency

1. Control sampling variability

Use temperature = 0 to enforce greedy decoding (argmax selection).
Set top_p = 1, and disable other stochastic sampling parameters.
Use a seed where supported (improves repeatability but does not guarantee determinism).

2. Reduce variability in probability computation

Tighten prompts to keep them consistent (including punctuation, casing, spacing, and formatting).
Tighten context and stabilise retrieval (if using RAG) with deterministic ranking (e.g. hybrid semantic + lexical scoring) and fixed thresholds.
Break tasks into smaller, deterministic steps and replace open-ended reasoning with rules where possible.
Use structured outputs (e.g. JSON schema) to stabilise results.
Apply post-processing, guardrails, and retry strategies.
If consistency is paramount, consider self-hosted models on fixed infrastructure.

Lessons Learned from Agentic AI Implementations

When building with Agentic AI, most enterprise value today concentrates around two foundational archetypes — the “Doers” (workflow automation) and the “Thinkers” (intelligence and reasoning). The highest-value use cases increasingly combine both—using intelligence to determine the optimal flow, and structured automation to execute it end to end, often at varying levels of intensity along the spectrum. While this practical hybrid pattern drives immediate business ROI, a smaller, highly specialised frontier of autonomous multi-agent networks is emerging to push the boundaries of open-ended planning and adaptation.

I shall share key learnings from two such Agentic AI systems I recently implemented — one focused on operational execution, and the other on high-stakes cognitive reasoning:

The "Doer" (Workflow Automation): Implementing an agentic chatbot that autonomously resolved simpler queries and seamlessly diverted complex cases to Genesys for human intervention — significantly reducing manual agent workload.

The "Thinker" (Intelligent Reasoning): Building a multi-agent solution that analysed open order notes to infer the root causes of service delays, helping avoid SLA violation fines for issues beyond operational control.

What is an Agent?

It is useful to begin with a clear definition of what an agent is. At its core, an agent is a software component that takes an input and produces an output. It is uniquely identifiable (often by a name) and operates within defined safety guardrails.

An agent leverages tools, APIs, and large language models (LLMs) to perform tasks, guided by business logic encoded in its instructions. Its execution is governed by configuration settings—such as model selection, thresholds, and control parameters — while its behaviour is continuously monitored and analysed through metrics, logs, and traces.

Crucially, an agent functions with awareness of its broader context. This context is dynamically retrieved either on demand or from persistent knowledge sources, such as vectorised or graph-based knowledge bases. These repositories are typically built from existing business documentation and enriched through ongoing collaboration with domain experts.

Technical Tips and Lessons Learned from Agentic AI Implementations:

1. Start with Business Clarity

Establish clear understanding of the business problem, measurable outcomes, and expected business ROI.
Engage a knowledgeable product owner or end user early to capture domain knowledge in the form of RAG, knowledge graphs, agent instructions.
Continuously validate that the solution aligns with business outcomes and ROI.
Start with use cases that need workflow or reasoning agents before moving into autonomous systems.

2. Redesign Workflows, Don’t Just Automate

Map the end-to-end workflow — especially for workflow automation use cases.
Be bold in challenging and redesigning workflows with the business team, not just digitising existing ones.
Decompose workflows into agent responsibilities, factoring in access controls, tools and APIs.
If intent recognition layer is required, start with LLM-based classification and subsequently an ML model as labelled data matures.

3. Agent Architecture & Orchestration

Design modular agents with a single responsibility for better reliability and interpretability.
Implement the main orchestrator agent such that it delegates tasks to sub-agents, regains control after execution and supports sequential or parallel execution with retries.
Balance deterministic vs non-deterministic behaviour by prioritising deterministic logic wherever possible. Implement core logic through tools/APIs and reserve LLM reasoning for resolving ambiguity or making contextual decisions.
Avoid deep tool chaining to maintain simplicity and debuggability.

4. Data & Context Engineering

Ensure availability of relevant data to identify patterns (e.g. frequent queries) and generate test cases.
Dedicate time to analysing how context shapes expected outcomes in sample data using NLP and LLMs; these insights are critical for optimizing reasoning and output quality.
Pre-process data (e.g. acronym expansion, text clean up) for lean and relevant context and to reduce token usage and cost

5. Memory, State & Persistence Strategy

Store session metadata for sharing across agents.
Persist session state with session caching (e.g. Redis) for fast access.
Persist session state within a NoSQL quick retrieval DB (like Firestore) for session restoration in low-latency scenarios.
Apply semantic caching (e.g. FAISS) for general or repeatable queries to reduce latency and optimise token cost.
For predominantly user-specific queries, semantic caching offers limited benefit. Instead, persist session context in long-term storage(GCS/S3) with an appropriate expiry, enabling retrieval via composite indexing (e.g. Firestore) when needed.
Store transaction-level data in RDBMS (e.g. BigQuery on GCP) for downstream analytics.
Implement session state management, archival and transaction logging as a bare minimum.

6. Performance, Cost & Scalability Optimisation

Implement monitoring callbacks early to track latency (e.g. time to first token, end-to-end), throughput, token usage and cost.
Handle LLM rate limits by reducing LLM API calls through deterministic flows and routing low-priority requests to smaller models. When utilizing a single LLM, manage parallel agent execution via exponential backoff following a Fibonacci sequence, capped at a maximum number of retries.
Leverage batch pre-generation where applicable for speed and cost savings.
Regularly analyse usage patterns to optimise cost vs performance trade-offs.

7. Safety, Security & Governance

Implement multi-layer guardrails for pre-input validation, pre-tool invocation checks and post-output validation.
Use centralised guardrails where possible; otherwise build reusable safety components.
Enforce secure access controls using token-based tool authorisation (e.g. JWT) with an authentication server, token rotation for sensitive use cases and Secrets Manager via IAM for high security.
Continuously validate adherence to policies and constraints via agent instructions for simple constraints and dedicated validation agents for complex policies; as additional sequential agent would add latency, parallelise the validation agent if the workflow permits.

8. Observability, Logging & Debugging

Enable comprehensive logging across user interactions, agent behaviour and cloud services for post go-live analysis.
Make it a practice to capture and analyse prompts, generated outputs and agent reasoning traces to derive insights that help debug, fine-tune prompts, and improve system behaviour during development.

9. Human-in-the-Loop & Trust Design

Use confidence scoring and reflection to route low-confidence responses for human review.
Provide clear explanations to end users to enable informed decision-making and aid debugging of incorrect workflows.

10. Developer Experience & Operational Discipline

Standardise agent structure with below folder structure instead of a single python file:

/agent-name (folder)

├── agent.py (file)

├── description.md (file)

└── instructions.md (file)

Version control prompts to maintain history across releases and compare outputs for regression analysis.
Build a regression test suite incrementally from Sprint 1 to ensure consistency.

11. Continuous Improvement & Value Tracking

Regularly analyse monitoring metrics and transaction data during development and post go-live.
Use insights to refine agent behaviour, validate business impact and identify new optimisation opportunities.

12. Prototyping & User Feedback

Build UI prototypes early to allow users to interact with the system and capture early real-world feedback.
Iterate rapidly based on user experience and observed behaviour.

To wrap up, the most successful agentic AI implementations are not those that maximise autonomy, but those that strike the right balance between structured control and intelligent flexibility — anchored firmly in business value.

The Agentic Data Scientist: How Code Assist LLMs Drive Peak Productivity

Organizations have long chased the myth of the "unicorn" data scientist — a hybrid professional possessing elite statistical expertise, deep software engineering mastery, and sharp business acumen. In practice, this profile was nearly impossible to find. Traditionally, math and statistics specialists focused heavily on modelling and interpreting outputs, leaving deployment to software engineers. While this handoff phase worked, it naturally introduced operational friction, extended development cycles, and fractured end-to-end project ownership.

Large Language Models (LLMs) have shattered this paradigm, finally making the unicorn data scientist concept achievable through virtual pair programming. Furthermore, agentic AI enables a stronger emphasis on building reliable systems for end users while seamlessly embedding AI capabilities within the existing IT landscape.

Importantly, data scientists do not need to alter their preferred workflows; they can still experiment and develop code freely within Jupyter Notebooks. LLMs act as an automated systems engineering partner, structuring and enhancing that experimental code into an appreciable level of production-grade software. Because the generated code inherently bakes in development best practices alongside comprehensive documentation, it completely relieves both data scientists and software engineers of tedious manual refactoring.

This efficiency gain directly tackles the historical MLOps bottleneck where ~70% of POCs historically failed to reach production. This transition is demonstrated by repository agentic_chatbot, where experimental notebook segments were automatically refactored into an enterprise-grade codebase using Amazon Q (Claude Sonnet 4.6) in couple of hours.

Here is how LLM technologies systematically eliminate traditional deployment bottlenecks.

1. Reducing Technical Debt & Engineering Friction

LLMs instantly bridge the gap between analytics-focused logic and production software patterns, turning notebook cells into robust applications.

Automated Production-Grade Code: LLMs transform experimental notebook code into production-ready modules, automatically embedding core software practices like structured logging, type hints, error handling etc.
Rapid Prototyping: LLMs accelerate the generation of back-end APIs (e.g., via FastAPI) to serve the model. This allows frontend developers to quickly build user interfaces, bringing forward the end-user experience and securing early validation feedback.
Quick Containerisation: LLMs scan central framework or local repositories, deduce necessary package versions, and generate accurate Dockerfiles and container configurations to guarantee managed execution in the cloud.
Automated Infrastructure and DevOps Setup: LLMs orchestrate full DevOps workflows by parsing source code to automatically deliver production-ready CI/CD configurations and Terraform scripts complete with integrated linting, testing, and cloud provisioning.

2. Streamlining Data Infrastructure & Pipeline Complexity

A POC runs on clean, static data snapshots, whereas production requires ingesting messy, live streaming data. LLMs automate and abstract the engineering overhead needed to keep models running smoothly.

Accelerated Data Generation: LLMs excel at generating realistic synthetic data, allowing teams to present functional prototypes and help stakeholders visualize outcomes without waiting for lengthy organizational approvals to access production data.
Efficient Data Pre-processing: LLMs drastically slash feature engineering time by rapidly generating high-quality code for data wrangling, statistical analysis, and plotting.
Workflow Optimization: LLMs analyse query patterns to optimize database logic (e.g., via partitions and joins), rewrite resource-heavy segments for maximum execution speed, and document data steps.
Mitigating Data Skew: LLMs save time by embedding capabilities that continuously track, tag, and align feature definitions across both training and production databases for schema validation and skew monitoring.

3. Resolving Organisational & Operational Misalignment

Traditional engineering handoffs often slow down due to missing context and misaligned objectives. LLMs automate translation tasks that used to drain technical resources.

Instant Documentation & Code Explanations: LLMs analyse complex code repositories to instantly generate comprehensive docstrings, README files, architectural flowcharts, and clear semantic explanations of how the underlying algorithm works.
Dismantling Infrastructure Silos: Because LLMs automatically output clean and documented code, the time engineering teams take to review and approve a model is radically reduced.
Automated Regulatory Auditing: LLMs parse data pipelines to construct automated data-lineage audits, enforce data governance, and apply automated guardrails such as masking Personally Identifiable Information (PII).

4. Automating Governance, Security, & Compliance

Adhering to strict enterprise security and legal compliance frequently stalls model deployment. LLMs automate these defensive checks to accelerate risk reviews.

Accelerated Security Reviews: LLMs scan codebases to flag accidental hardcoded secrets (such as API keys and passwords) and verify that permissible role-based access controls are strictly enforced.
Beyond "Black Box" Solution: LLMs instantly translate technical model behaviours and outputs into plain-language explanations. This quickens corporate adoption by providing immediate documentation for legal clearances and making results highly relatable for business users.

The Productivity Payoff

By leveraging LLM technologies to bridge the gap between mathematics and software engineering, data scientists are no longer bound by traditional organizational silos. They are finally empowered to operate as true end-to-end "unicorns," driving peak productivity and ensuring their models deliver rapid, tangible business value.

Wednesday, 24 June 2026

From Classical Data Science to Agentic AI Engineering: Why It Feels Less Exciting Than It Should

After nearly 10 years in classical machine learning — I now find myself working on agentic AI systems.

It’s an exciting space, no doubt.
And yet, I feel… bored at times.

This isn’t something many people openly admit, especially when working with “cutting-edge” technologies. But if you’ve made a similar transition, you might recognise this feeling too. This isn’t about the technology being less powerful. It’s about how the nature of the work has changed.

From Craft to Assembly

Classical machine learning felt like a craft.

You worked closely with data — engineering features, tuning models, and understanding behaviour through clear cause and effect. The satisfaction came from building something that you could reason about deeply.

With agentic AI, the work has shifted.

Instead of building intelligence, we often orchestrate it:

· Designing prompts

· Integrating APIs

· Managing workflows across tools and agents

It can feel less like engineering from first principles, and more like assembling systems from pre-built components. The intellectual depth hasn’t disappeared — but it has shifted away from where many of us derived satisfaction.

Abstraction and Loss of Control

Modern AI systems are extremely powerful — but also highly abstracted.

In classical ML, when something failed, you could trace it back — data leakage, model choice, tuning gaps, bias/variance trade-offs.

In agentic systems, failures are often opaque:

· Is it the prompt?

· Context ambiguity?

· The model’s internal reasoning?

You lose the tight feedback loop between hypothesis and outcome, and that can make the work feel less intellectually satisfying.

The Rise of “Glue Work”

A significant part of agentic AI today involves:

· Prompt iteration

· Output formatting

· Guardrails and validation

· Tool integration

This “glue engineering” is necessary, but often repetitive. The challenge shifts from solving domain problems to making systems behave reliably.

Determinism vs. Probabilistic Behaviour

Another shift is from relative determinism to probabilistic behaviour.

Traditional ML systems were not perfectly predictable, but they were understandable within known bounds.

Agentic systems are more variable:

· The same input may produce different outputs at invocations

· Small prompt changes can have large effects

Instead of debugging logic, you spend time managing uncertainty. And that can be frustrating.

Expectation vs. Reality

Agentic AI is often described as autonomous and self-directed.

In practice, much of the work involves:

· Adding constraints

· Designing safeguards

· Keeping humans in the loop

Much of the work becomes about limiting the system rather than unleashing it — which can feel like a step back from the original promise.

So Why Does It Feel Less Engaging?

Putting it all together, it isn’t about the technology being trivial — it’s about a mismatch in where the intellectual challenges lie.

In classical ML, the focus was on: Understanding and modelling the problem.

In agentic AI, the challenge is increasingly: Managing reliability, orchestration, and behaviour of black-box intelligence.

If you were energised by depth, optimisation, and mathematical clarity, the current landscape can feel like it lacks those same hooks.

Where the Real Challenge Is Moving

The depth hasn’t disappeared — it has relocated.

Areas that feel genuinely challenging include:

· Evaluation of LLM and agent outputs

· Building reliable and observable systems

· Designing hybrid architectures combining ML and agents

This is where prior experience in classical ML becomes valuable again.

A Personal Reflection

Maybe this isn’t boredom — maybe it’s transition.

I’ve moved from being a builder of models to a designer of intelligent systems. And somewhere along the way, I’ve lost that familiar sense of control and mastery I built over years in classical machine learning — that shift feels noticeable, even though it isn’t about Agentic AI being less interesting.

If you come from a world of deep modelling and mathematical clarity, it’s natural to feel like something is missing at first.

Perhaps the real shift is this: to deliberately seek out the new depth in this space — rather than expecting it to appear where it once did.

Monday, 23 January 2017

My Works

Welcome!

This page contains details of my academic work i.e. project and coursework reports. Please use navigation on the right to read my write up on data science and machine learning. Sample code for some of my work is available at: Github

Excerpts from my MBA dissertation in behavioural finance published by Taylor & Francis in Journal of Gender Studies: Click for report

SPSS was used for data analysis

Data Science (DS) coursework on yelp dataset: Click for report

Python was mainly used for analysis
A suite of other tools (refer the picture below) was also used and the details are provided in the report

Visual Analytics (VA) coursework on impact of financial consolidation in the UK with Bank of England household survey data and related datasets from Office of National Statistics, UK: Click for report

Only R programming was used for this coursework

Computer Vision (CV) coursework on face recognition, emotion recognition and augmented reality. Feature extraction was done with HOG, SURF and LBP. Classification was done with SVM, Random Forest and Feedforward neural network. Click for report and code.

MATLAB was used for this coursework

Binary classification model was created using Random Forest to predict airline delay. A flight is classified as delayed if delay time is > 15mts. Click for summary and code.