Definition

AI Voice Agent

An AI voice agent is a software system that handles voice phone calls end-to-end using a large language model and speech models. In B2B distribution it covers inbound order calls, customer service, and structured handoff to human reps when the conversation needs it.

Last updated

Definition

An AI voice agent is a software system that handles voice phone calls end-to-end using a large language model (LLM) for reasoning, a speech-to-text (STT) model for incoming audio, and a text-to-speech (TTS) model for outgoing audio. The agent runs autonomously: it picks up calls, identifies the caller, captures intent, performs actions in connected systems (a CRM, an ERP, a knowledge base), and decides when to escalate to a human.

This is distinct from older IVR ("press 1 for sales") systems, which are pattern-matching menus, not language models.

What an AI voice agent actually does (today, well)

In 2026 the realistic, production-shipping use cases for B2B voice agents are:

  • Inbound order calls. Customer calls in to place a reorder of a familiar SKU; the agent identifies the customer, looks up the ERP, takes the order, confirms ATP and ship date, and creates the sales order. Volume: 30-60% of inbound order traffic for a typical B2B distributor falls in this lane.
  • Status inquiries. "Where is my order PO12345?" Agent looks up the ERP, reads the status, gives the customer the tracking number. Volume: high.
  • Routing. Caller says what they need, agent routes to the right rep with context already presented in the rep's screen pop. Volume: very high.
  • After-hours handling. When humans are not available, the agent takes a structured message that lands in the CRM as a case for the next-day team. Volume: significant.

What it does not do (yet, reliably)

  • Complex part identification. "I need the one with the brass fitting, the kind we got last summer." Humans are still better at this.
  • Negotiation. Price negotiation, terms negotiation, custom commercial conversations. Humans win.
  • Sensitive escalation. Angry customer, complex complaint, financial dispute. Hand off cleanly to a human.
  • Long-form discovery. Multi-step solution selling. Voice agents are good at transactional flows, not at trusted-advisor conversations.

A production-ready voice agent recognizes these limits and escalates without making the customer fight to get a human.

What "escalation" looks like done well

The handoff from agent to human is the single hardest part. Done badly it makes both worse than not having the agent at all. Done well, it looks like:

  1. The trigger. Frustration tone, the third repetition, a complex part-ID question, an explicit "let me talk to a human."
  2. The handshake. The agent says clearly "I am going to transfer you to a rep who can help with that. Hold on for just a moment."
  3. The screen pop. The receiving rep gets the caller's identity, the conversation transcript so far, the order context the agent has already pulled, and a one-line summary of why the agent escalated.
  4. The continuation. The rep picks up the conversation without asking "what is your account number" because they already have it.

When this works, the customer experience is "the system handled the easy part fast and got me to a human when I needed one." When it does not, the customer experience is "the system wasted my time and the rep had to start over."

How Factory Labs handles voice agents

Factory Labs ships per-tenant voice agents that share the CRM's MCP tool surface. The agent identifies the caller against the contact directory, looks up the ERP through the gateway, can place orders, can update cases, and escalates with full context to a human rep through the same softphone the rep already uses.

Per-tenant configuration:

  • Greeting prompts, hold prompts, escalation prompts (per language).
  • Allowed actions (some tenants enable order placement; others limit the agent to read-only).
  • Escalation triggers (which intent categories always go to a human).
  • Recording posture (per jurisdiction).

See /integrations/twilio and the voice agents post for the operational detail.

Trade-offs

  • First-call resolution depth. A well-tuned voice agent can handle 40-60% of B2B inbound traffic end to end. The rest still needs humans.
  • Per-call cost. STT + LLM + TTS for a 3-minute call costs roughly $0.05-0.20 depending on model choices. Cheaper than a human rep for transactional calls, more expensive than a touch-tone IVR.
  • Compliance. Recording laws vary by state; consent flows have to be configured correctly.

These are usually fine; the math is favorable for transactional inbound traffic, less favorable for high-touch advisory calls.

  • IVR (Interactive Voice Response). The pre-LLM pattern-matching predecessor. Often combined with AI agents today (the IVR handles the first menu, the agent handles the conversation).
  • CTI (Computer Telephony Integration). The wiring between phone systems and CRM, which AI voice agents rely on for screen pops and routing.
  • Model Context Protocol. How the voice agent calls the CRM and ERP through standardized tools.

Further reading