AI Voice Agents for Inbound Order Calls: How Far Are We

A typical inbound order call at a distributor lasts three to six minutes. A customer calls in, the inside sales rep recognizes the company from the caller ID, pulls up the account, asks what is needed, looks up the parts, checks inventory at the relevant warehouse, confirms a contract price, places the order in the ERP, and gives the customer an ETA. The whole exchange has maybe twenty-five sentences and a fair amount of tab-flipping on the rep's screen.

The question this post answers is: in May 2026, how much of that call can an AI voice agent actually do? Not as a phone-tree replacement that routes to a human after gathering the customer's account number, but as the primary handler, the way a junior inside sales rep handles it after their first three months.

The short answer is "more than people think, less than vendors claim." The interesting answer is in the architecture that determines which side of the gap a given deployment lands on.

What inbound order calls actually contain

Before talking about what an agent can do, it helps to be specific about what the call contains. A typical call decomposes into roughly five segments:

Identification. The agent identifies the caller (usually from ANI / caller ID, sometimes by asking) and the account. Trivial in 2026; every CRM does this with caller-ID lookup against contact records. Maybe ten seconds.
Intent capture. "I need to reorder the 50-amp breakers I got last week" or "I want to check on PO 7842." Natural language, occasionally jargon-heavy, sometimes with a brand-name vs. part-number ambiguity ("the Cooper ones," which model? the customer doesn't always know either).
Lookup. The agent looks up the relevant parts, inventory, pricing, or order status in the ERP. The system-side work is straightforward; the customer-facing work is the conversation while the lookup is happening.
Decision. The customer decides: yes ship that, yes change quantity, no I want the alternate. The agent often guides the decision ("we have the brand-equivalent in stock at the Cleveland warehouse, or the original brand will be in tomorrow at Akron; which would you like").
Confirmation and close. Order placed in ERP, confirmation read back, ETA stated, ticket created if there is a downstream task. Maybe twenty seconds.

Of those five, segments 1, 3, and 5 are largely deterministic. They are system calls with predictable shape. Segments 2 and 4 are the genuinely conversational parts, where the rep is doing the work of being a person.

Where AI voice agents are good today is segments 1, 3, and 5. Where they are not yet good is the trickier corners of 2 and 4.

The state of the art, concretely

The realtime voice stack in May 2026 looks roughly like this:

Speech recognition. Sub-300ms streaming ASR with word error rates near 4% on enterprise jargon. Solved problem for English; competitive for Spanish, Mandarin, and a handful of others. The remaining gaps are heavy accents, names, and shop-floor background noise, all of which still hurt accuracy in distributor environments specifically.
Speech synthesis. Studio-quality voices with sub-200ms time-to-first-byte. Realistic enough that customers do not usually clock the agent as synthetic in the first few seconds. The remaining gaps are emotional register (the difference between "neutral confirming an order" and "warm reassuring a frustrated customer") and clean handling of stuttered or hedged customer speech.
Realtime LLMs. OpenAI, Anthropic, Google, and several open-weights labs ship realtime conversational models with audio I/O. Latency from end-of-customer-speech to first-byte-of-agent-speech ranges from 350-900ms in practice. Sub-500ms feels conversational; anything past 700ms feels like a phone tree.
Tool use and function calling. Mature. The model can call a function (ERP lookup, inventory check, order placement) and incorporate the result into the next turn within ~250ms once the tool call completes. The bottleneck is rarely the model; it is the tool call latency, which is bounded by the ERP's own response time.

The composite system (ASR, then realtime LLM with tool use, then TTS) is fluent enough today that a customer who is not paying close attention will treat it as a human for the first minute or two. The question is what happens at minute three when something goes wrong.

What works today (and ships)

There is a concrete set of inbound call types where AI voice agents are genuinely production-ready:

Order status checks. "Where is PO 7842?" The agent identifies the customer, looks up the PO in the ERP, reads back the status and ETA, offers to text a tracking link. End-to-end in under 60 seconds. Customer satisfaction matches or exceeds the rep version because the agent never has to put them on hold.
Reorders of recently-ordered parts. "I need to reorder what I got last week." The agent pulls the recent orders, confirms the parts and quantities, checks inventory, places the order. Works because the search space is bounded; the agent does not need to disambiguate part numbers, it just confirms last week's selection.
In-stock checks. "Do you have the 50-amp Cooper breaker in Cleveland?" The agent looks up the part, the warehouse, reads back the count. Trivially deterministic.
Quote requests on known SKUs. The agent generates a quote on a contract price, emails or texts it to the customer's contact-on-file, marks the conversation in the CRM. Works because the pricing model is deterministic; the agent is not negotiating, it is reading from a contract.
After-hours coverage. The agent handles segments 1-3 (identification, intent, lookup), captures detail, and either places the simple order or queues the complex one for a rep callback first thing in the morning. The agent here is replacing the customer leaving a voicemail nobody listens to until 10 AM.

These five together cover something like 50-70% of inbound order calls at a distributor with mature digital ordering. The customers who call instead of ordering online tend to be doing so for one of the above reasons. The remaining 30-50% are the genuinely complex calls where a rep adds judgment, and we will get to those.

What does not work (yet)

The list of things the agent should not be handling alone:

Ambiguous part identification. "I need the part that goes on top of the thing, you know what I mean." A human rep with five years of experience knows what the customer means and asks the right disambiguating question. An agent will guess wrong about 30% of the time, with the failure mode of cheerfully ordering the wrong thing.
Negotiation. "Can you do better on the price?" The agent does not have the authority to negotiate, and even if it did, the customer wants to feel like they negotiated. This is a human-rep job for the foreseeable future.
Account-level decisions. "Can I get net-60 terms instead of net-30?" Credit decisions, payment terms, contract changes: these are not something a rep handles autonomously either; they escalate to a manager. The agent should escalate the same way.
Complaints and refunds. Frustrated customers are not the place to deploy an agent. Even if the agent could technically resolve the issue, the customer needs to feel heard by a person.
Net-new product education. "I'm trying to figure out what I need for this job." The customer wants a consultative conversation, not an order-taker. Agent fails badly here.

The pattern is consistent: the agent is good at deterministic, bounded conversations and bad at conversations that require judgment, authority, or genuine empathy. That is the line.

The architecture that actually matters

What separates a deployment that works from one that fails is rarely the model choice. It is the surrounding architecture. Three pieces matter most:

Tool access through a CRM-grade integration layer. The agent's tools (ERP read, inventory check, order placement, ticket create) have to run with the same guardrails as a human rep: same access controls, same audit log, same error handling. A bolt-on voice agent that talks to the ERP through its own bespoke API client is going to create permission drift, audit gaps, and silent failure modes. The agent needs to be a tenant of the CRM, sharing the same integration layer.

The Factory architecture uses MCP for this. The voice agent is an MCP client that calls the same tools the chat assistant calls, which call the same gateway services the CRM UI calls. One toolchain, one audit log, one set of guardrails.

Graceful escalation. The agent has to know when to hand off. Detection signals: customer frustration in voice tone, third repetition of the same question, any of the "does not work" cases above, customer explicitly asking for a human. The handoff has to be clean; the agent introduces the situation to the rep before transferring, the rep picks up the call with full context, no "let me start over" from the customer. Building this handoff is harder than building the agent itself.

Conversation transcripts as first-class CRM records. Every call produces a full transcript, plus a structured summary, plus the tool calls the agent made. These land in the CRM as activities on the account, searchable, reportable, auditable. The reps trust the agent if they can see what it did; they distrust it if it is a black box. The omnichannel section covers this surface.

Per-tenant deployment, not shared infrastructure. Distributors are skittish about a SaaS vendor's agent talking to their ERP on shared infrastructure that handles competitors' calls too. The cleaner deployment is a per-tenant voice agent: same model, isolated config, isolated audit trail. This is also how you avoid one tenant's prompt-injected customer breaking another tenant's agent.

What we expect in the next 18 months

The trajectory is steep. Two specific changes are likely to land:

Sub-300ms end-to-end latency as realtime models get faster and TTS pipelines mature. At 300ms, the agent feels indistinguishable from a person on the conversational dimension. The agent is no longer "the AI" to the customer.
Cross-call memory at the account level. The agent remembers what the customer ordered last month, what their preferences are, what the last conversation closed on. This is the second-largest leap from "phone tree replacement" to "junior rep," and it is mostly a data-layer problem, not a model problem.

We expect the 50-70% of calls AI handles today to be 75-85% in 18 months, with the remaining 15-25% being the ones that genuinely require a human. The economics will be undeniable for most distributors. Every call the agent handles is one a rep is free to spend on consultative work instead.

The thing to avoid in the meantime is a deployment that overpromises. An agent that fails on 30% of complex calls but is marketed as "handles all your inbound calls" is going to erode customer trust faster than the win on the other 70% can build it. The honest framing is "handles the deterministic calls so the reps can focus on the consultative ones," and it is true today.

Getting started

If you are evaluating voice agents for inbound order work, the Factory Labs telephony stack (browser softphone, voice agent, conversation transcripts) is part of Professional. The architecture is what this post described: MCP-tooled, per-tenant, integrated with the same CRM gateway the chat assistant and the human reps use.

/contact is the way to talk through a specific deployment. The /docs/guides/telephony section has the technical setup.