From engineered demos to contract language that shifts risk — this edition covers the full anatomy of the vendor problem in AI surveillance, and the buyer-controlled evaluation that changes the equation.
This week's posts covered the full anatomy of the vendor problem in AI surveillance — from how demos are technically engineered to show best-case performance, to the contract language that leaves buyers absorbing risk they never agreed to accept, to the sales-to-engineering handover that breaks exactly where promise meets production, to the metrics designed to win evaluations rather than predict deployment outcomes, to the integration promises that collapse once real VMS versions, real API endpoints, and real camera counts are involved.
This edition ties all of it together with the one thing that changes the equation: a buyer-controlled evaluation designed to predict production, not to witness the vendor's best day.
We also cover what is happening to the vendor landscape structurally — and what consolidation means for buyers on three to five year contracts with platforms that may be acquired, deprioritised, or heading toward end-of-life before the ink is dry on the renewal.
The standard AI surveillance vendor demo answers one question: can this system perform in the conditions we control.
That is not the question you need answered before committing to a three to five year contract. The question you need answered is whether the system can perform in the conditions you cannot control — your lighting, your scene complexity, your camera estate, your VMS version, your network constraints, your operators, and the environment as it exists in six months when the seasons change and the store gets refitted.
A buyer-controlled evaluation is how you force that question into the room before signature. Here is what it looks like from start to finish.
The first decision is framing. Most procurement processes treat an AI surveillance evaluation as a product comparison — which vendor has the best detection capability, the most integrations, the most compelling roadmap. That framing produces impressive demos and unreliable deployments.
A rigorous evaluation treats the procurement as a high-risk AI assessment exercise. Before any vendor is invited to demonstrate anything, the buyer defines — in writing — the use cases being evaluated, the operational constraints that must be honoured in production, what good enough looks like per use case with specific metric thresholds, and which evaluation outcomes are disqualifying regardless of other performance.
This evaluation charter is the document the vendor signs off on before the evaluation begins. It defines how performance will be judged — not the vendor's preferred metrics, not the aggregate accuracy figure from a benchmark dataset, but the specific numbers that determine whether the system generates security value in your environment.
The fundamental design principle of a buyer-controlled evaluation is that the test environment should reflect your worst credible normal operating conditions — not a pristine lab designed to show the technology at its best.
This means using your existing camera estate. The mixed vintages, the compressed streams, the dirty domes, the cameras mounted at angles that made sense for recording but were never evaluated for analytics inference. If the vendor wants to test with an ideal camera as a separate comparison, that is acceptable — clearly labelled as a separate track, not as the primary evaluation.
It means feeding the AI system exactly how production will feed it — through your VMS streams, with your transcoding settings, your network paths, your bandwidth constraints. Not a direct RTSP connection from camera to analytics engine if that is not how your production environment is architected.
It means constraining compute to realistic production budgets. Not the high-end GPU the vendor brought to the proof of concept that will not appear anywhere in the actual deployment.
And it means running the evaluation inside or alongside your real operator console, so you see the actual user experience, the triage friction, and how AI alerts coexist with your existing alarm noise.
The footage corpus the evaluation runs on must be yours — not the vendor's curated demo clips, not footage selected to show the system well. The corpus should cover four categories.
Boring normal — long stretches of nothing, weather changes, benign movement, normal staff workflows. This is the dominant class in any real surveillance environment and the category that determines your false positive rate and operator fatigue level. Demos almost never test this.
Typical incidents for each use case, with natural variation — different speeds, angles, levels of occlusion, clothing types, crowding levels. Not the clean, unobstructed, front-facing examples that appear in vendor demonstration reels.
Edge cases — heavy occlusion, crowds, partial views, reflections, lens flare, rain and fog and night with IR, people with hoods and hats and masks that are entirely lawful. These are the conditions where model performance degrades most sharply and where vendor headline figures diverge most severely from production reality.
The vendor never curates or cleans this corpus. It reflects your reality, not their best day.
The metric stack that determines how the evaluation is scored must be defined and vendor-approved before any footage is analysed. Not after. Not adjusted based on what the results show. Before.
At the model level: recall and precision per use case per camera family, latency from event onset to alert in operator console, and robustness bands sliced by lighting, weather, camera angle, and crowding. Not a single aggregate accuracy figure. Condition-bound numbers that show where performance holds and where it degrades.
At the operational level: validated alert rate — alerts per hour per operator, with proportion actively reviewed and dispositioned. Mean time to review and act. False negative tracking against the labelled corpus. Escalation quality — over-escalation and under-escalation rates. These are the metrics that predict whether operators engage with the system or learn to ignore it.
At the governance level: model transparency — version, training date, change log available at any point. Drift indicators — trend lines over the evaluation period showing whether performance is holding or declining as conditions vary.
The vendor who cannot or will not operate within this metric framework is telling you something important about what post-deployment performance management will look like.
A rigorous proof of concept has five phases.
Design and baseline — one to two weeks. Lock metrics, success thresholds, and constraints. Run baseline logging with no analytics or your current solution to establish the status quo noise and detection level. This is the benchmark everything gets compared against.
Vendor integration and configuration — one to two weeks. The vendor can tune zones, sensitivity, and rules within constraints. Configuration freezes after the tuning window. The system that performs after that freeze is the system that represents production — not the system that three resident engineers are continuously adjusting.
Stability and load phase — two to four weeks. Introduce normal operational changes without warning. Apply scenario scripts intermittently and without pre-announcement. Run shortlisted vendors in parallel where the camera estate allows.
Failure mode exploration — one to two weeks, overlapping with stability phase. Deliberately stress the system — network jitter and brief outages, camera reboots, temporary occlusions, time synchronisation drift. The question is not whether failures occur. It is whether the system fails gracefully with clear status signals, or silently with no indication that analytics have degraded.
Post-mortem and scoring — one week. Compute metrics from your telemetry and ground truth. Score per use case against defined thresholds. Document vendor-specific operational overhead and quirks.
How a vendor behaves during evaluation is often a better predictor of deployment outcome than any metric from the evaluation itself.
Refusal to test on your footage or in your environment — insistence on curated demo reels, their cameras, or their lab — is the single strongest negative signal. It means the performance they are willing to stand behind only exists in conditions they control.
Extreme resistance to metrics beyond aggregate accuracy. If a vendor pushes back on recall and precision together, on latency requirements, on false negative tracking, or on operator workload metrics, expect difficulty with performance management throughout the contract term.
Over-reliance on manual tuning during the proof of concept. If the only path to acceptable performance requires continuous white-glove adjustment by vendor engineers, you are seeing the true operating cost and fragility of the system — and an honest picture of what happens when those engineers move on to the next deployment.
No clear story on model lifecycle — vague answers about training and retraining, no versioning, no willingness to discuss performance degradation or drift management. This is the operational reality you are buying into for three to five years.
Refusal to put claimed performance into contract language. If numbers appear only in slide decks and disappear when contract terms are discussed, those numbers are marketing, not engineering.
The positive signals matter too. Willingness to share independent benchmark results and map them to your specific use cases. Clear explanation of known failure modes and conditions where the system is not suitable. Engineering in the room during evaluation, answering feasibility questions rather than deferring to sales.
The specific failure rate for AI surveillance proof of concepts is not publicly tracked as a distinct category. But the data on enterprise AI projects broadly tells a consistent enough story to be operationally useful.
IDC research reports that 88% of AI proof of concepts did not reach wide-scale deployment by the end of 2024. Other analyst commentary puts the range of AI projects that miss expectations or fail to scale beyond pilots at 70 to 85%. AWS claims its supported projects reach production at sixteen times the industry average — which implies an industry average in the low single digits.
If you treat AI surveillance proof of concepts as enterprise AI projects with additional physical-world complexity, the defensible assumption is that something between 10 and 30% of proof of concepts proceed to meaningful production use without major rescoping. Most of the failure literature is consistent that the failure pattern is not primarily model-centric. It is operational — data quality and domain shift, false positive burden, integration complexity, and governance gaps are the dominant causes. These are the same conditions that vendor demos are specifically designed to avoid.
The EU AI Act is the regulatory response to the gap between vendor claims and production reality. High-risk AI surveillance applications — biometric identification, critical infrastructure monitoring, access control analytics — now face binding obligations on technical documentation, conformity assessment, post-market monitoring, and incident reporting that effectively force providers to substantiate performance claims with auditable evidence rather than benchmark figures from controlled test environments.
The provision with the most immediate commercial implication for buyers in Europe is the re-qualification clause. Distributors, importers, or deployers who substantially modify a system, change its intended purpose, or place it on the market under their own name become providers under the Act — with full provider obligations, including responsibility for performance documentation and conformity assessment. For an integrator bundling third-party analytics into a branded solution, this means the obligation to substantiate and document performance falls on you, not just on the upstream vendor.
The practical implication for European procurement teams in 2025 and 2026 is that AI Act compliance is not a future consideration. Any system deployed this year that falls into a high-risk category will be operating under conformity obligations within twelve to eighteen months of go-live. Requiring AI Act-aligned documentation — intended purpose, risk classification, technical file, performance metrics, post-market monitoring plan — from vendors before signature is not a regulatory burden. It is the most direct available method of separating vendors whose systems are built for production accountability from those whose systems are built for demo performance.
Something I have been sitting with after this week's posts.
Most of the procurement conversations we have been in at SafetyScope — on both sides of the table — have started with price. Not with the problem the buyer is trying to solve, not with what success looks like twelve months in, not with which capabilities actually matter for their specific environment. Price.
The uncomfortable thing about that from a vendor's position is that we know our competitors. We know where they are strong and where they are not. We know which use cases they handle well and which ones they handle with a threshold setting that produces demo accuracy and production noise. And we know that none of that knowledge surfaces in a procurement conversation that starts and ends with who will discount furthest.
The talk about price versus problem resolution needs to change — and it needs to change on the buyer's side as much as the vendor's. A vendor who knows they cannot win on price has every incentive to redirect the conversation toward where they actually add value. But if the buyer's evaluation process has no mechanism for that conversation, it never happens.
The most useful thing a security director can do before the next vendor evaluation is not research the market. It is define the problem precisely enough that price becomes a secondary question. What specific incident type, at what frequency, with what false positive tolerance, integrated into which existing workflows, with what governance requirements. That specification is what forces vendors to compete on capability rather than margin.
The EU AI Act's re-qualification provision is the regulatory development most likely to reshape how AI surveillance is sold in Europe over the next eighteen months — and it is the least discussed in current vendor and integrator conversations.
The provision means that an integrator who takes a third-party analytics engine, packages it into a branded security solution, and sells it under their own name becomes a provider under the Act with full obligations for conformity assessment, technical documentation, and post-market monitoring. Not the upstream analytics vendor. The integrator.
The commercial implication is significant. Integrators who have been selling AI surveillance solutions built on third-party models without conducting their own conformity assessments are now potentially exposed to provider-level obligations they did not anticipate when they structured their vendor relationships.
The organisations that will navigate this transition well are those that have already demanded AI Act-aligned documentation from their upstream vendors — intended purpose, risk classification, performance characteristics, known limitations — and have built that documentation into their own solution delivery process. The organisations that will struggle are those that have been reselling capability claims without the evidentiary backing that the Act now requires.
This is not a 2026 problem. The design decisions and vendor relationships being established in 2025 determine the compliance exposure of 2026 deployments.
Published: 2026-03-25 · Updated: 2026-03-25