Most voice AI stacks stall at the same point: getting audio in and out of an existing PBX without ripping it out.
This guide covers the full setup: SIP gateway config, 3CX trunk registration, LLM and TTS selection for low latency, and production monitoring for voice AI agents running on 3CX.
According to the LiveKit engineering blog, SIP-based voice AI deployments using IP-based authentication achieve 30–40% lower call setup latency compared to credential-based alternatives — a meaningful advantage in real-time conversation scenarios where every millisecond of delay is perceptible.
- 3CX uses IP-based SIP auth; Vapi, Retell, and Bland use credential-based auth — they can't connect directly to 3CX. LiveKit's SIP gateway is the bridge
- Deploy LiveKit in the same cloud region as your LLM to achieve sub-300ms latency
- Use G.711 ulaw (PCMU) as the codec on the 3CX trunk — no compression, no conversion overhead
- Silero VAD with min_silence_duration=0.3s is the sweet spot for English business conversations
- Production failover requires two LiveKit instances behind a load balancer — not just one server
- GPT-4o-mini or Llama 3 70B via Groq are the LLM choices for sub-150ms time-to-first-token
Why LiveKit Works for 3CX Voice AI Integration
The authentication mismatch between 3CX and AI voice platforms isn't a configuration problem. It's structural. 3CX SIP trunks use IP authentication — the carrier trusts your IP with no username or password. Vapi, Retell, and Bland all run credential-based SIP; they expect a username and password on every inbound call leg. 3CX sends none.
LiveKit's SIP gateway doesn't have this problem. It runs its own SIP endpoint that accepts inbound calls from whitelisted IP addresses. You add your 3CX public IP to the trunk allowlist, and 3CX registers a SIP trunk to LiveKit exactly the same way it would to any PSTN carrier. No credential negotiation, no registration failures.
Beyond auth compatibility, LiveKit gives you direct access to the audio stream inside a room. Your AI agent subscribes to the audio track, runs STT, generates a response via LLM, and publishes TTS audio back into the room. The whole pipeline runs in your infrastructure. Call audio never touches a third-party AI platform's managed environment — important for teams with data residency or compliance requirements.
The other advantage is WebRTC under the hood. LiveKit uses WebRTC for internal room transport, giving you adaptive bitrate, packet loss recovery, and jitter buffering without building any of it yourself. For voice AI agents handling real calls, that resilience matters more than it does in lab tests.
LiveKit 3CX Voice AI Architecture: How Calls Route to Your Agent
The call path from a 3CX phone system to a LiveKit voice AI agent works like this:
- Customer dials your DID
- 3CX receives the call on its configured DID, routes it through a queue or ring group
- When the ring timeout expires, 3CX forwards the call out via the LiveKit SIP trunk
- LiveKit's SIP gateway receives the call, matches it to a dispatch rule, and places the call into a LiveKit room
- Your AI agent, running in that room, picks up the audio stream and handles the conversation
- Call outcome is logged to your CRM via webhook at call end
Audio Codec Path
3CX sends G.711 ulaw (PCMU) over the SIP trunk by default. LiveKit accepts PCMU and handles any internal transcoding needed for your AI pipeline. Keep G.711 on the 3CX side — it's uncompressed, adds no codec processing latency, and works with every STT provider without conversion overhead. Disable G.729 and G.722 on the LiveKit-facing trunk to prevent codec negotiation overhead on call setup.
AI Agent Components Inside the LiveKit Room
- STT (speech-to-text): Deepgram Nova-2 or AssemblyAI for real-time transcription with word-level timestamps
- LLM: GPT-4o-mini, Llama 3 70B via Groq, or Mistral for fast first-token response
- TTS: Cartesia Sonic (~50ms), Deepgram Aura (~80ms), or ElevenLabs eleven_turbo_v2_5 (~80ms)
- VAD (voice activity detection): Silero VAD or WebRTC VAD for turn detection and interruption handling
Each component runs in a separate process or container. LiveKit's agent framework (Python SDK or Node.js) handles room connection, track subscription, and the event loop that ties them together.
Third Rock Techkno has deployed real-time voice AI agents on 3CX for businesses across USA, UK, and Australia. We handle architecture, LiveKit deployment, LLM pipeline, and ongoing latency tuning. Talk to our team →
How to Connect LiveKit to 3CX: Step-by-Step SIP Trunk Configuration
Complete the LiveKit setup first before touching 3CX. You need the SIP endpoint address before configuring the 3CX trunk.
Step 1: Deploy LiveKit Server
Deploy LiveKit on a cloud VM in the same region as your LLM API provider. This is the most important infrastructure decision for latency. If your LLM endpoint is in AWS us-east-1, run LiveKit in us-east-1. Cross-region audio adds 40–80ms per hop and makes sub-300ms latency very hard to achieve consistently.
Use LiveKit's official Docker image or Helm chart for Kubernetes. Minimum spec for production:
- 4 vCPU, 8GB RAM, dedicated network interface with a static public IP
- Same cloud region as your primary LLM API provider
- Port 5060 open for SIP (UDP + TCP)
- Ports 50000–60000 open for RTP media
- Static public IP — dynamic IPs break SIP trunk registration
Step 2: Create an Inbound SIP Trunk
Use the LiveKit CLI to create an inbound trunk that accepts calls from your 3CX server's public IP:
Note the trunk ID returned — you'll reference it in the dispatch rule. The
Step 3: Create a SIP Dispatch Rule
The dispatch rule tells LiveKit what to do when a call arrives on the inbound trunk — which room to place it in and which agent to connect:
Each inbound call gets its own room with a unique name prefixed with "call-". Your agent process watches for new rooms matching this prefix and joins them automatically. This one-room-per-call model is what makes concurrent call handling clean — there's no shared state between calls.
Step 4: Configure the 3CX SIP Trunk
In 3CX Admin Console, go to SIP Trunks and create a new trunk using these settings:
- Type: Generic SIP Trunk (IP Based) — not Registration Based
- Registrar/Server: Your LiveKit server's public IP
- Port: 5060
- Authentication: Do not require (IP-based auth only)
- Codec: G.711 ulaw (PCMU) as primary — disable G.729 and G.722
- Outbound caller ID: Set $OriginatorCallerID to pass original caller number to LiveKit
- SIP Options: Enable SIP OPTIONS keepalive if your 3CX version supports it
Step 5: Build Your AI Agent
Use LiveKit's Python agent framework to connect your LLM and TTS pipeline. This is the minimal working agent that handles a 3CX-routed call:
We deploy and tune LiveKit 3CX voice AI stacks for enterprise clients. From SIP gateway config to LLM pipeline optimization and production monitoring. Get a technical consultation →
Not sure which LLM or TTS provider fits your latency and compliance requirements? Our team has benchmarked all major providers across USA, UK, and GCC deployments. Get a free consultation →
Hitting Sub-300ms: Latency Tuning for LiveKit 3CX Voice AI
Sub-300ms end-to-end latency is achievable, but it requires getting several things right simultaneously. Any single bad decision — wrong region, wrong model, wrong TTS — adds enough latency to push you over. Here's where each millisecond goes and how to control it.
According to Google Research on conversational latency perceptual thresholds, human callers perceive responses above 300ms as noticeably delayed. Below 300ms, the interaction feels natural. This is your hard production target.
LLM Selection: Time-to-First-Token Under 150ms
GPT-4o-mini via OpenAI API delivers time-to-first-token (TTFT) of 80–140ms for most conversational responses. Llama 3 70B via Groq runs even faster at 60–120ms TTFT because Groq's LPU hardware is optimised for inference throughput. Both are good choices for real-time voice AI. Model comparison:
- GPT-4o-mini: 80–140ms TTFT, reliable, good instruction following — recommended default
- Llama 3 70B (Groq): 60–120ms TTFT, lower cost per token, strong for structured tasks
- Mistral 7B: 50–100ms TTFT, lowest cost, trade-off on reasoning quality
- GPT-4o: 200–500ms TTFT under load — exceeds your latency budget, avoid for voice
- Claude Opus / Sonnet: 180–400ms TTFT — too slow for sub-300ms voice target in production
TTS Selection: First Audio Byte Under 100ms
Cartesia Sonic is currently the fastest production TTS for voice AI, with time-to-first-audio-byte around 50ms. Deepgram Aura and ElevenLabs eleven_turbo_v2_5 both run around 80ms. All three are viable for sub-300ms targets. TTS comparison:
- Cartesia Sonic: ~50ms first audio byte, excellent naturalness, streaming output — top choice
- Deepgram Aura: ~80ms first audio byte, competitive naturalness, lower cost than Cartesia
- ElevenLabs eleven_turbo_v2_5: ~80ms first audio byte, best voice naturalness for brand personas
- ElevenLabs standard: 250–400ms — exceeds budget, do not use for voice AI
- Google WaveNet: 100–150ms — borderline, works only if your LLM is very fast
VAD Configuration: Silero Sweet Spot Settings
Your VAD settings directly control how fast the agent responds after the caller stops speaking. These Silero VAD settings work well for production English business calls:
The
Region Co-location: The Biggest Single Lever
This is the single most impactful latency decision. Run LiveKit, your STT provider, your LLM API, and your TTS provider in the same cloud region. The round-trip from LiveKit to an out-of-region LLM endpoint alone adds 80–120ms. Recommended regions:
- USA: AWS us-east-1 or GCP us-central1 — widest selection of co-located services
- UK: AWS eu-west-2 (London) for lowest latency to UK callers
- Australia: AWS ap-southeast-2 (Sydney)
- GCC / UAE: AWS me-south-1 (Bahrain) — or AWS eu-west-1 (Ireland) as nearest fallback if UAE region lacks your LLM provider
Production Checklist: Monitoring, DTMF, and Failover
A working LiveKit 3CX voice AI demo and a production system are different things. These are the gaps that cause incidents at 2am.
Call Recording and Transcription
LiveKit supports egress recording via its Egress API. For each call room, start a track composite egress to record both sides. Pipe Deepgram's real-time transcription output to your logging system or CRM via webhook at call end. Store recordings in S3 or GCS with a lifecycle policy. 90-day retention covers most compliance requirements. According to GDPR guidance on call recording, businesses in the UK and EU must obtain caller consent before recording — ensure your AI agent's opening greeting includes a recording disclosure.
DTMF Handling
3CX passes DTMF events in-band (RFC 2833) by default. LiveKit exposes DTMF events on the SIP participant object. Subscribe to DTMF in your agent to handle keypad inputs:
Monitoring: Prometheus and Grafana
LiveKit exports Prometheus metrics. Track these in Grafana for production visibility. Critical metrics to alert on:
- livekit_room_duration_seconds: average call length — sudden drops indicate audio issues
- livekit_participant_joined_total: call volume — unexpected spikes may indicate trunk misconfiguration
- livekit_track_publish_latency_ms: audio pipeline latency — alert on p95 > 400ms
- STT word error rate via Deepgram dashboard — flag anything above 8% for your primary language
- LLM error rate via OpenAI / Groq API metrics — 429 rate limit errors cause call failures
Failover Architecture
Run two LiveKit instances behind an AWS NLB or GCP Network Load Balancer. Configure your 3CX SIP trunk with both IPs in the allowed addresses list and use SRV records or a static failover IP for the registrar. If the primary LiveKit instance goes down, 3CX retries to the secondary within the SIP retry timeout (typically 3–5 seconds). For AI agent failover, run multiple agent worker processes and let LiveKit's dispatch mechanism distribute rooms across available workers.
- Minimum production setup: 2 LiveKit nodes behind NLB, 2 agent workers per node
- 3CX trunk config: Both LiveKit IPs in allowed-addresses, SRV record pointing to NLB
- Agent recovery: Workers auto-restart with systemd or Kubernetes deployment restart policy
- Database failover: Use managed Redis (ElastiCache / Cloud Memorystore) for dispatch state — avoid single-node Redis
Security and Compliance Considerations
A LiveKit 3CX deployment handles real call audio. Security isn't optional. These are the non-negotiable baseline controls for any production deployment handling business calls.
SIP Security
- Restrict the LiveKit SIP port (5060) to your 3CX server's public IP only — never expose to 0.0.0.0
- Enable SIP TLS (SIPS) if your 3CX version and carrier both support it
- Rotate LiveKit API keys on a quarterly schedule
- Use a firewall rule to block all SIP traffic from unexpected IPs — SIP brute-force attacks are common
Data Residency and GDPR
For UK and EU deployments subject to GDPR, and GCC deployments subject to UAE PDPL or Saudi PDPD:
- Run LiveKit, STT, and storage entirely within the required geographic region
- Use Deepgram's self-hosted deployment if your compliance policy prohibits sending audio to third-party APIs
- Implement caller consent recording disclosures in the AI agent's opening greeting
- Document your data flow: caller audio → LiveKit → STT → LLM → TTS → caller — each hop is a data processor under GDPR
According to the UK ICO's GDPR guidance, voice recordings of identifiable individuals are personal data — full data processing agreements are required with each service provider in your AI call pipeline.
Cost Estimation: What a LiveKit 3CX Voice AI Deployment Actually Costs
Before committing to a LiveKit deployment over phone forwarding, run your own numbers. The break-even point depends on your call volume and your PSTN forwarding rate.
- Cloud VM (2 nodes): $80–$200/month depending on provider and region (AWS, GCP, Azure)
- Deepgram STT: ~$0.0043/minute — at 500 calls/month of 3 minutes avg = $6.45/month
- GPT-4o-mini LLM: ~$0.15 per 1M tokens — a 3-minute call uses ~500 tokens, cost negligible at SMB volumes
- Cartesia Sonic TTS: ~$0.015/minute — at 500 calls of 3 minutes avg = $22.50/month
- Total infra cost at 500 calls/month: Approx. $120–$230/month all-in
- Phone forwarding equivalent (Option 1): 500 calls × 3 minutes × $0.02/min PSTN = $30/month plus platform subscription ($50–$300)
At low call volumes (under 300 calls/month), phone forwarding almost always wins on cost. LiveKit starts paying off at 500+ calls/month when your PSTN forwarding costs exceed the fixed VM infrastructure cost. According to Statista's enterprise telephony market data, the average SMB handling inbound customer calls processes 400–800 calls per month — putting most businesses right at the LiveKit break-even threshold.
Conclusion
LiveKit 3CX voice AI integration works because LiveKit provides a modern developer-friendly SIP bridge optimized for real-time AI voice workloads.
Sub-300ms latency is a real target, not a marketing number. It requires three things done right together: co-located infrastructure, a fast LLM (GPT-4o-mini or Llama 3 via Groq), and a low-latency TTS provider (Cartesia Sonic or Deepgram Aura). Get any one of those wrong and you're looking at 400–600ms, which callers notice.
Start with a single queue routing calls to LiveKit, measure your p95 latency in production, and tune from there. The SIP plumbing in this guide holds at scale. The tuning is where the engineering work actually lives.

