17 May 202612 min read

AIengineeringproduct

From demo to deployed: what it really takes to ship AI products

AI tools have made prototypes easier than ever. The hard part is turning a promising demo into a reliable, secure, governable product that people can depend on.

Tremoli

What you will learn

What this article covers:

Why AI prototypes are now cheap, fast, and genuinely useful

Why adoption numbers do not prove production readiness

What we learned moving CLARA from prototype to research product

The practical meaning of reliability, security, evaluation, and governance

How to decide whether to keep an idea simple or engineer it properly

AI has changed the opening move of software. A person with a clear problem can now build a working first version before a traditional project would have finished discovery. They can connect a form to a model, point an assistant at a knowledge base, automate a process, and show colleagues something real.

That is a big deal. It moves experimentation closer to the people who understand the problem. It turns vague requirements into a working conversation. It gives teams evidence before they spend months designing the wrong thing.

But the story gets muddy when a prototype is treated as proof that a product exists. The 2026 Stanford AI Index reports that organizational AI adoption reached 88%, while also noting that responsible AI measurement and incident reporting are struggling to keep pace with capability 1. McKinsey's 2025 global survey found the same 88% regular AI use in at least one business function, but only about 6% of respondents met its definition of AI "high performers" 2. The gap is no longer between companies that have tried AI and companies that have not. The gap is between trying it and making it work safely, repeatedly, and measurably.

Key Insight

The no-code and AI-assisted development boom has moved the starting line. It has not removed the finish line. A demo proves that an idea might be useful. A deployed product proves that the idea still works when the inputs are messy, the users are real, the data matters, and nobody is watching it by hand.

The prototype revolution is real

The old way of starting software was expensive. Someone wrote a brief. Someone else interpreted it. A team estimated it. Weeks later, the people with the original problem might see a first version and discover that everyone had misunderstood the work.

AI-assisted tools have made that loop much shorter. Today, non-specialists and small teams can often create:

Document assistants that search internal material and draft answers
Operations tools that classify requests, extract fields, and update systems
Research tools that summarize long collections and highlight supporting evidence
Internal dashboards that combine spreadsheet data, APIs, and generated commentary
Content systems that draft, transform, translate, or review text at speed

This is not just faster software. It is better learning. A prototype exposes the real constraints: the odd data format, the missing field, the approval step everyone forgot, the query users actually ask, the moment trust breaks.

Google's 2025 DORA report makes a useful point about AI-assisted software development: AI acts as an amplifier of an organization's existing strengths and weaknesses, not as a magic replacement for the system around it 3. The same is true for AI products. A prototype amplifies learning. In production, the surrounding engineering determines whether that learning becomes value or risk.

Where the demo stops

Most AI demos are tested under friendly conditions. The data is known. The user is cooperative. The task is narrow. The happy path is visible. If something goes wrong, the person presenting the demo can explain it away.

Production removes those comforts.

Accuracy has to become measurable

In a demo, "that answer looks right" can feel good enough. In a product, it is not. Real users need to know whether the system is correct, when it is uncertain, and where the evidence came from.

NIST's Generative AI Profile for the AI Risk Management Framework is explicit that trustworthy AI requires lifecycle practices for design, development, use, evaluation, and monitoring 4. OpenAI's evaluation guidance makes the same practical point for LLM applications: evals need to run continuously, grow as new failure cases appear, and measure specific criteria rather than relying on open-ended impressions 5. Google's ML Test Score work made this argument for machine learning systems years before the current LLM wave: production readiness depends on testing and monitoring, not just offline performance 6.

For an AI product, that means accuracy cannot be a vibe. It needs test sets, expected behaviours, regression checks, human review paths, and clear thresholds for release.

Retrieval is product logic, not plumbing

Many useful AI products are retrieval systems in disguise. They look like chatbots or search boxes, but the hard work is deciding what information the model sees and how that information is checked.

That is where prototypes often crack. A small demo corpus may work with simple chunking and a vector search. A real corpus brings scanned PDFs, duplicate documents, old versions, missing metadata, tables, footnotes, ambiguous dates, and passages that only make sense in context.

The quality of the answer depends on the quality of the retrieval pipeline. In production, teams have to test the ingestion process, rank and rerank results, handle stale documents, preserve permissions, verify citations, and make it obvious when the system has not found enough evidence to answer.

Security changes when the tool can act

A conventional web app has familiar risks. An LLM application adds new ones. OWASP's latest Top 10 for LLM Applications calls out risks such as prompt injection, sensitive information disclosure, supply-chain compromise, excessive agency, and vector or embedding weaknesses 7. Those are not abstract risks. They show up as ordinary product questions:

Can a user make the system ignore its instructions?
Can a document inside the knowledge base manipulate the answer?
Can the model leak private data across customers or teams?
Can an agent take an action it should only suggest?
Can generated output be safely passed into another system?

The answer is not "write a better prompt." Production systems need permission checks, least-privilege tool access, content boundaries, output validation, audit logs, and security review. NIST's Cybersecurity Framework 2.0 is useful here because it frames security as an operating cycle: govern, identify, protect, detect, respond, and recover 8. AI does not escape that cycle.

Governance arrives sooner than teams expect

Once a prototype touches real users or real data, governance stops being paperwork. It becomes product design.

The EU AI Act is now law as Regulation (EU) 2024/1689 9. General-purpose AI model obligations entered into application on 2 August 2025, including requirements around technical documentation, copyright policy, training-content summaries, and additional duties for models with systemic risk 10. In the UK, the ICO's AI and data protection guidance remains clear that AI systems processing personal data need lawful, fair, transparent, and risk-managed data practices 11.

Even when a product is not high-risk under a specific law, the questions are practical:

What data is sent to the model provider?
Is any input used for training or improvement?
Where is data processed and retained?
Who can see prompts, outputs, logs, and source documents?
How are users told what the system can and cannot do?
What happens when someone asks for deletion, access, or explanation?

No-code configuration may answer some of this. Contracts, architecture, and governance answer the rest.

The 2am test still matters

Production is not just launch day. It is the Tuesday morning when a model endpoint slows down, a vendor changes behaviour, a document import fails, a user uploads a malformed file, or costs spike because one automated process loops.

Google's Site Reliability Engineering guidance defines monitoring as collecting and displaying real-time quantitative data about a system, including query counts, errors, latency, and server behaviour 13. For AI systems, that list needs to expand: retrieval quality, grounding, refusal rates, tool calls, token usage, user feedback, and drift in answer quality.

If nobody can diagnose the system without recreating the demo manually, it is not deployed. It is hosted.

What CLARA taught us

We saw this gap clearly while building CLARA, an AI research assistant developed with the University of Oxford 14.

The early version was valuable because it made the idea tangible. Researchers could search large document collections, ask questions, and see passages that might support an answer. That changed the conversation immediately. Instead of debating whether AI could help, we could watch where it helped and where it failed.

But the prototype was not the product. For research work, a plausible answer is not enough. A researcher needs to trace the answer back to source material, judge the context, and trust that the system is not inventing evidence. The production work became much less glamorous and much more important:

Building ingestion pipelines that respected messy academic and archival documents
Improving retrieval so the right evidence appeared before generation
Designing citation trails that made every answer inspectable
Creating evaluations for failure cases, not just showcase queries
Putting human judgement back into the process where certainty mattered
Treating security, access, and data handling as product requirements from the start

The prototype gave us the map. Engineering made the map usable in bad weather.

Key Insight

A good prototype should not be judged by whether its code survives. It should be judged by whether it teaches the team what the production system must prove.

What production-ready means in practice

"Production-ready" is often used as a mood. It should be used as a checklist. For AI products, we think it means the system has evidence behind these claims:

The production-ready scorecard

Useful

The product solves a real problem, not just a model capability demo.

Grounded

Important answers can be traced to sources, data, or rules the user can inspect.

Evaluated

The team has test cases, release checks, and continuous evaluation for known failure modes.

Reliable

The system handles slow APIs, bad inputs, rate limits, partial failures, and load spikes gracefully.

Secure

Data access is controlled, tool permissions are narrow, outputs are validated, and logs are auditable.

Governed

Data protection, retention, user disclosure, and regulatory obligations are understood before launch.

Observable

Engineers can see latency, errors, cost, retrieval quality, model behaviour, and user feedback.

Maintainable

Someone other than the original builder can change, test, deploy, and debug it.

Scalable

Growth in users, data, or complexity does not require rebuilding the foundations.

Not every prototype needs all of this. A personal automation or a one-week experiment should stay light. But once a tool serves other people, handles sensitive data, makes recommendations, or runs unattended, the checklist stops being optional.

ISO/IEC 42001 frames this at the organizational level: responsible AI requires a management system for establishing, maintaining, and continually improving how AI is developed or used 12. The practical version for a product team is simple: know what the system is allowed to do, measure whether it is doing it, and have a way to improve it when reality changes.

The graduation test

Before turning a prototype into a product, we ask five questions.

1Who depends on it? A tool for one expert user can rely on judgement and recovery. A product for many users needs guardrails, documentation, and support.
2What data does it touch? Public, non-sensitive content is one thing. Personal data, client data, regulated data, or proprietary strategy changes the architecture.
3What is the cost of being wrong? A weak draft is annoying. A false citation, missed risk, leaked file, or incorrect decision can break trust.
4How will we know it is failing? If the first signal is a user complaint, the system is under-instrumented.
5How long must it last? A three-day experiment can be scrappy. A product expected to run for years needs maintainable foundations.

These questions stop teams over-engineering every idea. They also stop teams shipping experiments as if they were products.

Tier 1

Keep it simple

Personal automations, low-risk internal helpers, short experiments, and tools using non-sensitive data.

Tier 2

Harden the team tool

Recurring use, limited sensitive data, clear owners, and manual fallback routes.

Tier 3

Engineer for production

User-facing products, regulated work, sensitive data, high cost of error, or unattended operation.

The middle tier matters. Many useful AI systems do not need to become full products. They need ownership, access control, documentation, backup paths, and a clear decision about what happens if they fail.

Better sources, better decisions

AI is moving quickly, so weak sources age badly. For this rewrite, we deliberately leaned on primary or near-primary sources: Stanford HAI for broad AI measurement, McKinsey for current organizational adoption, Google DORA for AI-assisted engineering practice, NIST for risk and cybersecurity frameworks, OWASP for LLM application security, EU and UK regulators for legal obligations, ISO for management systems, Google SRE for operations, and OpenAI's current evaluation guidance for practical eval design.

That source mix matters because the question is not "is AI powerful?" It is. The useful question is "what has to surround AI before people can depend on it?"

The answer is less glamorous than the demo: evaluation, retrieval quality, security, governance, observability, maintainability, and product judgement. Those are the things that make AI useful after the first impressive afternoon.

What is ahead

The next wave will make the gap more visible, not less. McKinsey found that 23% of surveyed organizations were already scaling agentic AI somewhere in the enterprise in 2025, with another 39% experimenting 2. Agents raise the stakes because they do not just generate content. They plan, call tools, modify systems, and take multi-step actions.

That makes the prototype phase even more valuable. Teams should absolutely build, test, and learn quickly. But as AI systems move from answering to acting, the production bar rises. Permissions matter more. Audit trails matter more. Rollbacks matter more. Human oversight matters more.

We are optimistic about this shift. More people can now build first versions of ideas that previously would never have left a notebook. That is worth celebrating. But the winning teams will be the ones that treat prototypes as learning machines and production systems as promises.

Key Takeaways

What to take away from this article:

AI prototypes are now faster and more accessible, which is excellent for learning.

A working demo is not evidence of reliability, safety, or readiness.

The hard production work is evaluation, grounding, security, governance, observability, and maintenance.

CLARA worked because the prototype taught us what researchers needed, then engineering made those needs dependable.

Not every idea needs to become a product, but every deployed AI system needs evidence that it behaves well under real conditions.

References

1Stanford Institute for Human-Centered Artificial Intelligence (2026). The 2026 AI Index Report. Stanford HAI
2McKinsey & Company (2025). The State of AI: Global Survey 2025. McKinsey QuantumBlack
3Google Cloud DORA (2025). State of AI-assisted Software Development 2025. DORA
4National Institute of Standards and Technology (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1
5OpenAI (2026). Evaluation best practices. OpenAI Developers
6Breck, E., Cai, S., Nielsen, E., Salib, M., and Sculley, D. (2016). What's your ML test score? A rubric for ML production systems. Google Research
7OWASP Foundation (2025). Top 10 for LLMs and Generative AI Applications. OWASP GenAI Security Project
8National Institute of Standards and Technology (2024). Cybersecurity Framework 2.0. NIST
9European Union (2024). Regulation (EU) 2024/1689: Artificial Intelligence Act. EUR-Lex
10European Commission (2025). General-purpose AI obligations under the AI Act. Shaping Europe's Digital Future
11Information Commissioner's Office (updated 2023; under review after 2025 legislation). Guidance on AI and data protection. ICO
12International Organization for Standardization (2023). ISO/IEC 42001:2023 AI management systems. ISO
13Google Site Reliability Engineering (2016). Monitoring Distributed Systems. Google SRE
14CLARA Research (2026). AI research assistant. clara-research.com

Share this article

We build AI products.

We take ideas from prototype to production. See what we are working on.

See Our Work →