The Pilot Trap: Why AI Surveillance POCs Succeed and Deployments Don't | The Vigilant

The successful pilot that never became a deployment. Between 40 and 60 percent of AI POCs are abandoned before production — and the reasons are structural, not algorithmic.

We start with the phenomenon that has come up in more client conversations this year than any other single topic: the successful pilot that never became a deployment.

The pattern is consistent. A proof of concept runs for six to twelve weeks. Detection rates look strong. The vendor's team is responsive. Operators are engaged. Leadership signs off. And then something happens between that sign-off and operational reality — or, more often, nothing happens at all. The pilot sits as a successful experiment while the organisation returns to running security the way it always has.

This edition covers why that happens, what the conditions are that make pilots artificially succeed, and what high-performing organisations do differently when they want pilot results to actually predict production performance.

Deep Dive

The failure rate for AI pilots in enterprise environments is not a secret. Depending on which analyst data you use, between 40 and 60 percent of AI proof-of-concepts are abandoned before reaching production. Gartner data from 2024 puts the figure for generative AI POCs at 60 percent abandoned after completion. MIT research on enterprise AI more broadly finds that 95 percent of pilots — including those that technically reach production — deliver zero measurable business value.

Those figures cover AI broadly. There is no equivalent large-sample dataset for physical security AI specifically. But from conversations with security leaders across European enterprise environments this year, the pattern is recognisable and consistent: pilots that performed well on paper, that vendors cited as case studies, that never changed how security actually ran.

The reason is structural, and it starts before the pilot begins.

Why pilots are designed to succeed — and why that is the problem

A well-run AI surveillance pilot, from a vendor's perspective, is a sales process with a technical wrapper. That is not a criticism — it is an accurate description of the incentive structure. The vendor's A-team runs the engagement. Data is selected or pre-screened for conditions that favour the model. The evaluation window is short enough that seasonal drift, attacker adaptation, and infrastructure degradation never show up. Known-problem cameras — the backlit ones, the low-bitrate ones, the ones mounted at angles no one would choose in retrospect — are excluded in favour of the handful where detection will look cleanest.

Behind the scenes, false positives are filtered before they reach operators. Tuning happens daily, based on real-time feedback, at a pace and with a level of vendor involvement that will never exist once the contract is signed and support moves to a ticketing system.

The result is a set of metrics that accurately describe performance under these specific conditions — and have almost no predictive relationship to performance in your environment, on your full camera estate, with your overnight shift, after eighteen months of model drift.

Pilots are also typically scoped to avoid complexity. A single use case — perimeter breach detection, loitering in a specific zone — on a small subset of cameras in a controlled area. That is a reasonable way to demonstrate technical feasibility. It is not a reasonable basis for a deployment decision covering a multi-site estate with diverse scene conditions, varying camera quality, and operators who will be managing twenty other responsibilities alongside the alert stream.

The evaluation criteria reinforce this. Most pilots are declared successful if they demonstrate technical feasibility and generate positive operator sentiment in a controlled setting. Production deployments need to meet hard operational SLAs — false alert rates per camera per shift, incident detection within defined time windows, measurable impact on the security outcomes that justified the investment. Those are different standards. Pilots almost never test for the second set.

The structural gap: pilot design versus production reality

Beyond the conditions that inflate pilot metrics, there is a deeper mismatch between how pilots are structured and what production deployments actually require.

Pilots typically run on isolated infrastructure, bypassing the enterprise identity systems, logging requirements, SIEM integration, and data governance controls that production will demand. They operate without the legal scrutiny — DPIAs, works council consultation, DPO sign-off — that the EU AI Act and GDPR require for high-risk AI systems in operational deployment. In European contexts, those governance processes regularly force re-scoping once legal and compliance teams engage: analytics that drove pilot performance may not survive proportionality review.

The economic model shifts completely. Pilot pricing is discounted or absorbed. Compute is bundled. Vendor time is underpriced as a commercial investment. Production requires multi-year OPEX and CAPEX for licences, compute, integration, change management, and internal support capability. Organisations frequently discover, after a successful pilot, that the business case at estate scale was never established — and cannot be established without a level of investment that the pilot results do not justify.

The people and process requirements at production scale are categorically different. Pilots run on a small champion team with informal training and personalised support. Full deployment requires formal SOPs, union and HR alignment in European environments, training for operational staff who were not part of the pilot, 24/7 support structures, and — critically — internal capability to tune, monitor, and maintain the AI without permanent vendor involvement. When that capability does not exist, the system underperforms quietly until operator trust erodes and workarounds become standard practice.

What high-performing organisations do differently

The organisations that consistently move AI surveillance from pilot to production share a set of practices that are visible before the vendor conversation even starts.

They define hard success criteria before the pilot begins. Not "demonstrates capability" but specific, production-grade thresholds: maximum false alerts per camera per shift, minimum detection rate for the incident types that justify the investment, maximum latency under peak load, explicit no-go criteria that would prevent go-live regardless of other metrics. These criteria are defined by the people who will own the system in production — security operations, legal, IT — not by the project team running the pilot.

They insist on environment realism. Pilots run on live production feeds, against a representative subset of the camera estate — old cameras and new ones, busy scenes and quiet ones, indoor and outdoor, day and night. Load is scaled to a realistic fraction of the target deployment. Network, storage, and failover constraints are active. The vendor's ability to tune against real conditions, not curated clips, is what gets evaluated.

They embed governance from day one. Rather than treating the pilot as a sandbox that will be cleaned up later, high-performing organisations use the pilot as the first phase of compliance. DPIAs and AI risk assessments are initiated at pilot stage. Logging and auditability requirements are in scope. EU AI Act obligations — which apply to high-risk surveillance systems from August 2026 — are built into pilot design rather than retrofitted at deployment.

They design for handover, not demonstration. The operating model — who owns tuning, who monitors performance, who manages retraining, who handles escalation — is defined before the pilot starts. Internal staff are embedded in pilot operations rather than observing the vendor perform. Runbooks are built during the pilot. The vendor's role in production support is defined contractually, not assumed.

They test failure modes deliberately. Chaos exercises during the pilot — network outages, camera failures, policy changes, unusual scene conditions — test both the AI layer and the organisation's ability to respond. Benign anomalies that generate false positives in production environments are tested explicitly: maintenance crews, cleaning schedules, seasonal changes, wildlife, weather. The false positive rate against these conditions tells you more about operational fit than any detection rate on a curated test set.

They tie the pilot to a funded path to scale. Before the pilot starts, budget and timelines exist for the next phase if success criteria are met. The "successful but stranded" pattern — a pilot that performed well, with no organisational or financial path to deployment — is one of the most common failure modes in enterprise AI, and it is entirely preventable with pre-commitment at the right level.

The incentive problem underneath the technical one

It is worth being direct about something that sits beneath the technical story.

A lot of AI surveillance pilots exist to demonstrate that an organisation is innovating, not to change how security actually operates. Leaders get internal visibility and conference material from announcing pilots, regardless of whether they scale. Vendors get logos and reference cases from a well-run POC that will never become a deployment. Security operations inherit the system and the operational burden, but they were rarely the primary audience for the pilot's success metric.

Scaling AI surveillance forces questions that pilots can avoid: who owns the false negatives, who is accountable when the system misses an incident, how do you handle union consultation and works council engagement in European environments, who explains the model to a regulator. At pilot scale, these questions can be deferred. At production scale, they cannot — which means that no one wants to be named owner of a system that might end up in a regulator's crosshairs, so ownership stays fuzzy and the pilot quietly stalls.

The organisations that navigate this successfully treat pilot design as risk-transfer design. They use the pilot to answer: is this vendor capable of delivering production performance under our conditions? Can we build internal capability to own this without permanent vendor dependency? Can we pass governance and compliance scrutiny? Those are different questions from "can this model detect the things we showed it during the evaluation."

The answers to the first set of questions determine whether a deployment actually changes how security runs. The answers to the second set determine whether a vendor gets to add your logo to their case study page.

Industry Signal — The Numbers Behind the Pattern

The pilot-to-production failure rate is well-documented in enterprise AI broadly, even if physical security-specific data is limited.

S&P Global Market Intelligence's 2025 survey of over one thousand respondents across North America and Europe found that the average organisation scraps 46 percent of AI projects between proof-of-concept and broad adoption. The share of companies abandoning most of their AI initiatives before production jumped from 17 percent to 42 percent year-on-year — rising as organisations accelerated adoption without building the governance and integration infrastructure that scaling requires.

Gartner's analysis of 2024 AI deployments found that 60 percent of generative AI POCs were abandoned after completion. The primary causes were not model performance: they were data readiness and organisational change difficulty — exactly the categories that curated pilot conditions are designed to avoid testing.

MIT's NANDA Initiative research — synthesising 150 leadership interviews, 350 employee surveys, and 300 public AI deployments — found that 95 percent of enterprise AI pilots deliver zero measurable business ROI, including those that reach production. The framing is pointed: the issue is not model quality. It is organisational capability gaps — the absence of operating models, roles, and processes to embed AI into actual workflows.

For European enterprises specifically, the picture has additional texture. An analysis drawing on Eurostat and AWS survey data from 2025 found that while 41 percent of large EU enterprises were using AI in 2024, relatively few had fully scaled AI across operations — citing fragmented tools, integration challenges, skills shortages, and cost. Ipsos' 2025 "Making AI Work for Europe" report identifies pronounced regional disparities, with Nordic states at 35 to 42 percent adoption and parts of Eastern and Southern Europe at 5 to 9 percent. The same report finds that 68 percent of European businesses cite EU AI Act uncertainty as a source of hesitation — a dynamic that is particularly acute in AI surveillance, where the high-risk classification creates compliance obligations that many organisations are not yet structured to meet.

The physical security-specific data remains largely qualitative — vendor case studies, integration post-mortems, operator feedback. But the cross-industry pattern is consistent enough to serve as a credible baseline: roughly half of pilots do not reach production, and the majority that do fail to deliver measurable security value. The reasons are organisational and structural, not algorithmic.

From the Field

Luís Lamy, CEO — SafetyScope

What I have been thinking about after this week's conversations.

The most revealing question you can ask a vendor during a pilot is not "what is your detection accuracy" — it is "who will be doing this work in eighteen months."

In almost every pilot, the vendor team you meet is not the team you will have in production. The sales engineer who knows your site, the product specialist who tuned the model to your camera angles, the account manager who returns your calls — they move on. What you actually buy is a support model, a ticketing system, and a contract. The pilot is a relationship. The deployment is a service level agreement.

The organisations I have seen navigate this well do one thing consistently: they use the pilot to build their own capability, not to observe the vendor's. They put their own people inside the pilot operations, not as observers but as decision-makers. They own the tuning conversations. They manage the feedback loops. By the time the vendor engagement scales back, they know how the system behaves, where it struggles, and what it takes to keep it performing.

The organisations that struggle treat the pilot as a demonstration. They watch the vendor perform. And then, when the vendor's A-team moves on and the system is running on standard support, they discover that no one inside their organisation knows how to own it.

The transition from pilot to deployment is really a transition from vendor-led to internally-owned. The pilots that become successful deployments are the ones that build toward that transition from day one.

One to Watch

SafetyScope has launched the Security Knowledge Hub at safetyscope.eu/knowledge — a growing library of glossary definitions, integration guides, technical explainers, and vendor comparison frameworks covering AI, physical security, and surveillance.

For security leaders navigating procurement decisions, the Knowledge Hub is designed to close the information gap that vendor conversations tend to exploit: the gap between marketing language and technical reality, between demo conditions and production performance. The integration guides cover VMS compatibility, ONVIF compliance, and edge compute architecture. The comparison frameworks give procurement teams the language to ask questions vendors are not used to being asked.

The Knowledge Hub is updated continuously. For teams in the middle of a procurement process or designing a pilot framework, it is a useful reference to have open alongside the vendor's sales materials.

Published: 2026-04-01 · Updated: 2026-04-01

Markdown version of this page