How was the 47-deployment voice agent audit conducted?

Forty-seven mid-market firms ($5-50M ARR) running production voice agents handling at least 200 inbound calls per week. Sampled across real estate (8), dental (7), medspa (6), restaurants (6), home services (5), ecommerce support (5), legal (5), and automotive (5). Each deployment received a 90-minute audit covering scripts, PMS/CRM integration, compliance posture, escalation paths, voice persona, and 50 randomised call transcripts.

What does "failing" mean in this audit?

Either leaking measurable value (>10% caller drop-off, >5% mis-routed calls, >2% compliance-flag transcripts) or actively harming caller experience (negative review tied to the agent, repeated complaints in transcripts, escalated complaints to ownership). 37 of 47 deployments hit at least one. 19 hit two or more failure types simultaneously.

What do the 22% of deployments that work do differently?

Three traits: (1) every script has a documented escalation path with named human owner per scenario, (2) weekly transcript review meeting where the operator reviews 30-50 randomised calls, (3) on-call coverage with a 4-minute ack SLA for true emergencies. None had magic prompts; all had operational discipline.

Which voice-agent vertical has the worst failure rate?

Restaurants at 100% (6 of 6), driven by reservation-system fragility plus high call volume during peak hours. Best vertical: legal at 60% (3 of 5) because the typical legal-intake script is narrow and well-defined. Most verticals cluster in the 75-85% failure range.

How long does it take to fix a failing voice agent deployment?

For a deployment running 200-1,000 inbound calls/week: 4-7 weeks of focused operational work plus an indefinite weekly review cadence. Weeks 1-2 audit the failure surface and rewrite scripts. Weeks 3-5 fix integrations and harden compliance. Weeks 6-7 establish the on-call rotation and weekly review meeting.

What stack do the 22% of working deployments run?

No single stack. Vapi and Retell split roughly 60/40 across the working set. ElevenLabs Enterprise dominates TTS for HIPAA verticals; AWS Polly and Azure Speech split the non-HIPAA verticals. The constants are documented escalation runbooks, weekly transcript review, named on-call operator, and a Slack alert channel for compliance-flagged transcripts.

Voice Agent Audit: 78% of 47 Deployments Failing

Q: Why is "script overconfidence" the top failure?

Operators ship the agent with a happy-path script that handles 70-80% of calls cleanly, then assume the long tail will resolve itself. It does not. The other 20-30% of calls stack up as either silent drop-offs, mis-routed transfers, or angry customer complaints. The fix is documenting the long-tail handlers explicitly and training the agent to escalate cleanly when off-script.

Q: How does this compare to the 50-firm AI stack audit?

Companion data set. The 50-firm AI audit covered AI workflows broadly (87% broken). This one zooms into voice agents specifically (78% failing). Same operational pattern explains both: documented runbooks, named owners, monitored loops. Voice has higher stakes per failure because callers are live humans, not pipeline records.

A 47-deployment voice agent audit in 2026 found 78% of mid-market voice agents leaking value or actively harming caller experience. Five predictable failure patterns explain almost all of them: script overconfidence (78%), PMS/CRM integration drift (54%), compliance gaps in HIPAA/TCPA/PCI (37%), no human escalation path (29%), and voice persona mismatch (24%). The 22% of deployments that work share three traits: documented escalation paths with named owners, weekly transcript review meetings, and an on-call operator with a 4-minute ack SLA.

By Oskar Korjus, co-founder of luup voice agents. Audits ran February to April 2026 across 47 mid-market firms running production voice agents in the US, EU, and UK. Methodology cross-checked against the parallel 50-firm AI stack audit.

TL;DR

78% of mid-market voice agent deployments are failing. Not "could be improved" - measurably leaking value or harming caller experience.
Five failure patterns cover 90%+ of cases. Script overconfidence, integration drift, compliance gaps, no human escalation, persona mismatch.
The 22% that work all do three things. Documented escalation runbooks, weekly transcript review, on-call coverage.
Restaurants is the worst vertical at 100%. Legal is best at 60%. Most verticals cluster 75-85%.
Median fix time: 4-7 weeks. Most of the work is process discipline, not better prompts.

Audit method · Script overconfidence · Integration drift · Compliance gaps · No escalation · Persona mismatch · Per-vertical breakdown · The 22% pattern · Self-audit checklist · Remediation playbooks · Vendor due-diligence · FAQ

1. The audit method

Forty-seven mid-market firms with $5-50M ARR running production voice agents handling at least 200 inbound calls per week. Sample composition across the eight voice-agent verticals luup actively builds for: real estate 8, dental 7, medspa 6, restaurants 6, home services 5, ecommerce support 5, legal 5, automotive 5. Geographic mix: 28 US, 14 EU, 5 UK. We refused deployments running fewer than 200 calls/week because failure patterns at low volume often look like statistical noise rather than systemic issues.

Each audit ran 90 minutes. Topics: full script inventory, PMS/CRM integration map, compliance posture (HIPAA where applicable, TCPA always, PCI for any payment-handling), escalation paths to humans, on-call coverage, voice persona choice, transcript review cadence. We then sampled 50 randomised call transcripts from the prior 30 days and scored each against five dimensions: script handling, integration write success, compliance flags, escalation correctness, caller satisfaction signal. Sampling baseline borrowed from Gartner's customer-service AI tracker; the qualitative scoring rubric mirrored the Atlassian post-incident framework applied per call.

2. Failure 1: Script overconfidence (78%)

The most common failure across the entire audit. Operators ship the agent with a happy-path script that cleanly handles 70-80% of calls. The script gets demoed, approved, and deployed. The other 20-30% of calls stack up as silent drop-offs, mis-routed transfers, or compliance-flag transcripts that nobody reviews until a customer escalates.

Three sub-patterns:

The "we'll handle edge cases later" trap. 31 of 37 failing deployments shipped without documented edge-case handlers. Common edge cases: caller speaks a non-default language, caller is calling on behalf of someone else, caller is asking about a service the script does not cover, caller is in distress.
The single-intent assumption. Real callers often combine intents in one call ("I want to book and also ask about insurance and also check my last appointment"). 18 of 37 deployments could not handle multi-intent calls cleanly.
The accent and speech-pattern blind spot. 14 of 37 deployments scored materially worse on transcripts from non-native speakers, callers with stutters, elderly callers, or callers in noisy environments. Most operators discovered this only after a customer complaint reached ownership.

Fix is documenting the long tail explicitly. Catalogue the top 30 edge cases per vertical (the voice-agent failure patterns guide covers real-estate; the same exercise generalises). Train the agent to escalate cleanly when off-script rather than guess. Review weekly.

3. Failure 2: PMS / CRM integration drift (54%)

The agent talks to the caller, agrees to book or update, and then writes to the wrong record, the wrong field, or no record at all. 25 of 47 deployments had at least one production integration silently broken when we pulled the logs.

Three sub-patterns:

Schema drift. The PMS admin renamed a custom field. The agent's integration still wrote to the old field name. Booking confirmations went to a void column. 11 of 25 cases.
Authentication rotation. Service-account credentials rotated per security policy. Integration credential reference did not. Every monthly rotation broke the same loop until somebody noticed. 8 of 25 cases.
API deprecation. Vendor announced v1 API end-of-life. v1 started returning 410 Gone. Agent kept retrying. 6 of 25 cases.

Fix is the same monitoring discipline that closes the same loops in the broader 50-firm AI stack audit: synthetic daily check per integration, Slack alerts on failure, on-call ack SLA. Voice has higher stakes because callers are live humans, not pipeline records waiting in a queue.

4. Failure 3: Compliance gaps - HIPAA, TCPA, PCI (37%)

Voice deployments touch three compliance surfaces simultaneously: HIPAA for any healthcare context, TCPA for any outbound calling, PCI DSS for any payment handling. 17 of 47 deployments had at least one documented compliance gap.

Common gaps:

No BAA at the LLM tier. Standard OpenAI API does not cover PHI; standard Anthropic API does not either. Most healthcare voice deployments we audited used a developer-tier LLM API thinking the platform-level voice BAA covered it. It does not.
Missing recording disclosure. Several US states require two-party consent for call recording. 9 deployments shipped scripts that opened with the AI greeting but no recording disclosure.
Outbound TCPA violations. 6 deployments running outbound recall or sales sequences had no documented TCPA opt-in trail. The PMS intake form did not contain current TCPA-compliant opt-in language.
PCI exposure on payment intake. 4 deployments collected payment information by voice without proper SAQ-D compliance posture. The transcripts contained card numbers in plain text.

Compliance is not a feature you add later. It is a property of the entire stack the day you ship.

5. Failure 4: No human escalation path (29%)

14 of 47 deployments had no working transfer-to-human path, or a path that worked only during business hours. Real callers asking for a human got either a polite "I can help you with that" loop, an awkward voicemail, or silence.

The escalation gap shows up most painfully on emergencies. We documented 7 cases across the audit where a true emergency caller (medical urgency, post-op complication, broken contract clause, urgent security issue) was kept on the line by the agent rather than escalated. Two led to legal complaints; one led to a settlement.

Fix is straightforward in concept, hard in practice: every script has a documented escalation criterion, a named on-call human, a 4-minute ack SLA, and a fallback path if the on-call does not ack. Test the path weekly with a synthetic call. The dental voice-agent guide documents the emergency triage flow in detail; the same shape applies across verticals.

6. Failure 5: Voice persona mismatch (24%)

11 of 47 deployments shipped with a voice that did not match the caller's expectation of the brand. Common patterns: dental clinic agent voiced as the dentist (patients expected the front desk), restaurant agent voiced too formal for the brand tone, real-estate developer agent voiced too casual for the price band, legal-intake agent voiced too robotic for the empathy expected.

The numeric impact is real. Across the 11 mismatched deployments, mid-call drop-off rates ran 22-38% versus 8-15% for persona-aligned deployments. Acceptance scores on post-call surveys ran 30-50 points lower.

Fix is one extra sprint day on persona work during deployment. Voice cloning of an actual front-desk team member (with written consent) lifts acceptance 12-18 points in our test data. Do not voice the agent as the highest-status human in the practice; voice it as the front-desk persona callers expect.

7. Per-vertical breakdown

Failure rates cluster by vertical because the structural pressure differs by vertical. Restaurants run high call volume on fragile reservation systems; legal-intake scripts are narrow and well-defined. The audit-derived breakdown:

Vertical	Deployments	% failing	Top failure	Companion read
Restaurants	6	100%	Reservation system fragility + peak volume	Restaurant voice agents
Real estate	8	88%	Lead routing drift + multi-language	RE failure patterns
Medspa	6	83%	HIPAA gaps + persona mismatch	Medspa voice agents
Home services	5	80%	Dispatch routing + emergency triage	Home-services voice agents
Dental	7	71%	HIPAA gaps + insurance verification	Dental voice agents
Ecommerce support	5	80%	Order-status integration drift	Ecom support voice agents
Automotive	5	80%	Service-bay scheduling + parts lookup	Automotive voice agents
Legal	5	60%	Empathy gap on intake	Legal voice agents

The pattern across verticals is consistent even when the surface differs. Operational discipline beats vertical specificity. The same closed-loop pattern shows up in our parallel audits of automation (50-firm AI stack audit) and creative ops (47-agency creative audit).

8. The 22% pattern: what working deployments actually do

Ten deployments in our 47 ran with zero documented failures across the audit. Their tooling was not impressive. Vapi and Retell split roughly 60/40 across the working set; Vapi dominated the simpler scripts and Retell dominated the complex multi-turn flows. ElevenLabs Enterprise (with BAA) handled TTS in every HIPAA deployment. The constants were operational, not technical.

8.1 Documented escalation runbook per script scenario

Every script had a 1-page runbook: trigger condition, severity score, named on-call human, ack SLA, fallback path. Stored in Notion or Confluence. Reviewed quarterly with the on-call rotation. The 8 deployments that had no runbook all hit failure 4 (no escalation path) at least once.

8.2 Weekly transcript review meeting

30-50 randomised call transcripts reviewed every week by the operator who owns the deployment plus one rotating human reviewer (front desk for dental/medspa, sales lead for real estate, support lead for ecom, etc.). 30 minutes per week. Catches edge cases before customers escalate.

8.3 On-call coverage with named owner and ack SLA

Either an in-house ops hire or external partner with a written SLA. 4-minute ack SLA on emergency-routed calls during business hours; 15-minute SLA after hours. None of the working deployments had "ops is part of everyone's job" - that anti-pattern correlated 1:1 with failures, same as the broader 50-firm audit found.

9. Self-audit checklist for your own voice deployment

Run this on your stack today. If you cannot answer any one of these inside 90 seconds, that is the audit signal.

Pull last week's call transcripts. Sample 30 random ones. How many had compliance flags, off-script handling, mis-routed transfers, or caller frustration signals?
List every PMS/CRM integration the agent writes to. Confirm the synthetic daily check exists for each. Confirm the on-call gets paged when it fails.
Read the BAA chain. Voice platform, STT, LLM, TTS, storage, PMS. Every link signed? Every one current?
Map the escalation path. What triggers a transfer? Who gets paged? What's the ack SLA? What's the fallback if no ack?
Pull the on-call schedule. Who is on-call right now? When did they last test the path? Is there a written SLA?
Test a synthetic call. Pick the most common edge case from your transcripts. Call the agent. Score the handling.
Confirm weekly transcript review. When did the last meeting happen? What was reviewed? What changed in the script as a result?

Score: 7 yeses puts you in the 22%. 5-6 is recoverable inside a quarter. 4 or fewer is the 78% bucket; start with the playbooks below. The Agency Audit tool automates this scoring and benchmarks you against the 47-deployment sample.

10. Remediation playbooks for the five failures

Each failure has a known fix. Below are the 4-7-week playbooks we ran with the 14 post-audit remediation engagements.

10.1 Playbook for script overconfidence

Week 1: pull the last 200 call transcripts. Categorise by intent + outcome. Identify the top 30 edge cases not handled by the current script. Week 2: write explicit handlers for the top 10. Week 3-4: handlers 11-30 plus train the agent to escalate cleanly when off-script. Week 5: weekly transcript review meeting becomes the standing cadence.

10.2 Playbook for integration drift

Week 1: catalogue every active integration plus API version dependency. Week 2: subscribe one ops inbox to every vendor changelog. Synthetic daily check per integration with Slack alerts on failure. Week 3: configure quarterly review of API versions; anything more than two minor versions behind gets prioritised.

10.3 Playbook for compliance gaps

Week 1: BAA audit across the entire stack. Anything missing gets escalated to procurement. Week 2: TCPA opt-in audit on every outbound caller record. Week 3: recording disclosure script localised per jurisdiction. Week 4: PCI scoping for any payment-handling agents (most should not handle payment by voice; redirect to a secure payment portal).

10.4 Playbook for no escalation path

Week 1: write the escalation runbook. Trigger conditions, named owners, ack SLA, fallback. Week 2: implement the page logic in Twilio plus Slack plus SMS, all three. Week 3: test the path with synthetic calls weekly. Week 4: document the post-incident review template for any escalation that goes wrong.

10.5 Playbook for persona mismatch

Week 1: gather caller satisfaction signal from prior 30 days (post-call surveys, complaints, drop-off rates by call segment). Week 2: voice work with a real front-desk team member (with written consent). Week 3: clone via ElevenLabs Voice Lab plus tone-match the script. Week 4: A/B test old voice vs new on a routed sample.

11. Vendor due-diligence for voice deployments

Half the failures we audited were baked in at vendor selection. Six questions to ask any voice-platform vendor before signing:

Show me a customer running this product at my scale for 18 months. Three-month logos prove the demo went well, not that integration debt is manageable.
What does your BAA cover, exactly? Verify it includes transcripts, recordings, and PHI fields - not just the platform-layer voice routing. Many vendors offer a BAA that does not cover the LLM tier.
How do you announce breaking API changes? Public changelog, email-list, 6-month deprecation window are the right answers.
What is your status page for the last 12 months? Real status pages have history; theatre status pages are always green.
What is your CSM rotation rate? Voice deployments are operational relationships. CSMs that rotate every 6 months produce orphan-script deployments at the 12-18 month mark.
What is the PCI / HIPAA / TCPA story end-to-end? If they cannot answer in plain English in two minutes, they are not the vendor. Pattern matches the broader vendor-selection rules in our 47-agency audit and automation vendor comparison.

Vendor-selection mistakes compound. Every voice agent failure pattern in this audit had a precursor at vendor selection: chose the platform without verifying BAA chain, chose the TTS without checking enterprise-tier coverage, chose the LLM without confirming PHI eligibility, chose the integration partner without 18-month references.

12. What to ship this week

Pull last week's call transcripts. Sample 30. Read them. Count compliance flags. If you find one in the first 30, you are in the 78%. Run the Agency Audit tool to formalise the scoring or Phantom Lead Test to probe your inbound funnel for the integration drift in failure 2. Or book a 30-minute review with a luup operator who has run remediation across the failure patterns above.

13. Frequently asked questions

How was the 47-deployment audit conducted?

47 mid-market firms ($5-50M ARR) running production voice agents at 200+ calls/week. 90-minute audits each, 50 random transcripts per deployment, scored across script handling, integration write success, compliance, escalation, satisfaction.

What does "failing" mean?

Either leaking measurable value (caller drop-off, mis-routes, compliance flags) or actively harming caller experience (negative reviews, complaints, escalations). 37 of 47 hit at least one. 19 hit two or more.

Why is script overconfidence the top failure?

Operators ship the happy path and assume the long tail will resolve itself. It does not. The 20-30% off-script calls become silent drop-offs, mis-routes, or angry customers.

What do the 22% do differently?

Documented escalation runbooks, weekly transcript review, on-call coverage with 4-minute ack SLA. None had magic prompts.

How does this compare to the 50-firm AI stack audit?

Companion data set. Both share the same operational pattern: documented runbooks, named owners, monitored loops. Voice has higher stakes per failure because callers are live humans.

Which vertical is worst?

Restaurants at 100% (reservation fragility + peak volume). Best is legal at 60% (narrow well-defined intake scripts).

How long does it take to fix?

4-7 weeks of focused operational work plus indefinite weekly review cadence. Most of the work is process discipline, not better prompts.

What stack do the 22% run?

Vapi and Retell split 60/40. ElevenLabs Enterprise with BAA for HIPAA verticals. Constants are operational: runbooks, transcript review, on-call coverage.

14. Field notes from 47 voice-agent deployments

Five patterns surface specifically across the 47 voice deployments that did not show up as cleanly in our parallel audits. They track the structural specifics of voice operations - the live-caller dynamic, the per-second cost economics, the multi-vendor stack reality.

Note 1 - the founder hears one bad call and shuts the agent off. 6 of the 47 deployments we audited had been disabled at least once after a single bad call reached ownership. Operationally this is a rational response, but it does not address the underlying failure pattern. The 22% have a different reflex: bad call triggers a transcript review, a script update, a runbook entry, and a synthetic test of the fix. The agent stays on; the system improves.

Note 2 - voice cost economics force ruthless prioritisation. Voice agents cost roughly 10-50x per interaction what a chatbot does because of TTS, STT, and LLM inference all in real-time per call. That cost economics forces operators to focus on the highest-value loops. The deployments that fail are usually the ones that tried to handle every possible call from Day 1; the deployments that work scoped tightly and expanded carefully.

Note 3 - the brand voice gap matters more than the LLM choice. Caller satisfaction correlated more strongly with voice persona match than with which LLM was running underneath. The brands using a custom-cloned front-desk voice scored 30-50 points higher on satisfaction than the ones using stock library voices, even when the underlying LLM was identical. The lesson: spend persona budget before model budget.

Note 4 - multi-language deployments degrade silently. 9 deployments in our audit served callers in 2+ languages. 7 of those 9 had material quality drop in the secondary language that operators only discovered through transcript review. The pattern: ship in primary language, validate, then expand. Do not ship multi-language on Day 1 with a single quality bar.

Note 5 - the 4-minute ack SLA is the most important number. Across the working deployments, the on-call ack SLA on emergency-routed calls was 4 minutes during business hours. Anything looser produced complaints. Anything tighter required staffing that broke the unit economics. Four minutes is a rule of thumb worth treating as load-bearing.

The fix in every case is the operational pattern that shows up across every audit luup runs: documented runbooks, named owners, single source of truth, weekly review cadence, on-call coverage with written SLA. The cross-vertical pattern is documented in the 50-firm AI stack audit (50-firm audit) and the 25-hour-week services pattern (25-hour week playbook). Run on your specific stack at luup voice agents or book a review.

Last updated: 4 May 2026.

Voice Agent Audit: We Reviewed 47 Mid-Market Deployments and 78% Were Failing