Published At: May 29, 2026

LiveKit + 3CX: Building a Real-Time Voice AI Agent with Sub-300ms Latency

Updated: May 29, 2026

Most voice AI stacks stall at the same point: getting audio in and out of an existing PBX without ripping it out.

This guide covers the full setup: SIP gateway config, 3CX trunk registration, LLM and TTS selection for low latency, and production monitoring for voice AI agents running on 3CX.

According to the LiveKit engineering blog, SIP-based voice AI deployments using IP-based authentication achieve 30–40% lower call setup latency compared to credential-based alternatives — a meaningful advantage in real-time conversation scenarios where every millisecond of delay is perceptible.

Key Takeaways
  • 3CX uses IP-based SIP auth; Vapi, Retell, and Bland use credential-based auth — they can't connect directly to 3CX. LiveKit's SIP gateway is the bridge
  • Deploy LiveKit in the same cloud region as your LLM to achieve sub-300ms latency
  • Use G.711 ulaw (PCMU) as the codec on the 3CX trunk — no compression, no conversion overhead
  • Silero VAD with min_silence_duration=0.3s is the sweet spot for English business conversations
  • Production failover requires two LiveKit instances behind a load balancer — not just one server
  • GPT-4o-mini or Llama 3 70B via Groq are the LLM choices for sub-150ms time-to-first-token

Why LiveKit Works for 3CX Voice AI Integration

The authentication mismatch between 3CX and AI voice platforms isn't a configuration problem. It's structural. 3CX SIP trunks use IP authentication — the carrier trusts your IP with no username or password. Vapi, Retell, and Bland all run credential-based SIP; they expect a username and password on every inbound call leg. 3CX sends none.

LiveKit's SIP gateway doesn't have this problem. It runs its own SIP endpoint that accepts inbound calls from whitelisted IP addresses. You add your 3CX public IP to the trunk allowlist, and 3CX registers a SIP trunk to LiveKit exactly the same way it would to any PSTN carrier. No credential negotiation, no registration failures.

Beyond auth compatibility, LiveKit gives you direct access to the audio stream inside a room. Your AI agent subscribes to the audio track, runs STT, generates a response via LLM, and publishes TTS audio back into the room. The whole pipeline runs in your infrastructure. Call audio never touches a third-party AI platform's managed environment — important for teams with data residency or compliance requirements.

The other advantage is WebRTC under the hood. LiveKit uses WebRTC for internal room transport, giving you adaptive bitrate, packet loss recovery, and jitter buffering without building any of it yourself. For voice AI agents handling real calls, that resilience matters more than it does in lab tests.

LiveKit 3CX Voice AI Architecture: How Calls Route to Your Agent

The call path from a 3CX phone system to a LiveKit voice AI agent works like this:

  1. Customer dials your DID
  2. 3CX receives the call on its configured DID, routes it through a queue or ring group
  3. When the ring timeout expires, 3CX forwards the call out via the LiveKit SIP trunk
  4. LiveKit's SIP gateway receives the call, matches it to a dispatch rule, and places the call into a LiveKit room
  5. Your AI agent, running in that room, picks up the audio stream and handles the conversation
  6. Call outcome is logged to your CRM via webhook at call end

Audio Codec Path

3CX sends G.711 ulaw (PCMU) over the SIP trunk by default. LiveKit accepts PCMU and handles any internal transcoding needed for your AI pipeline. Keep G.711 on the 3CX side — it's uncompressed, adds no codec processing latency, and works with every STT provider without conversion overhead. Disable G.729 and G.722 on the LiveKit-facing trunk to prevent codec negotiation overhead on call setup.

AI Agent Components Inside the LiveKit Room

  • STT (speech-to-text): Deepgram Nova-2 or AssemblyAI for real-time transcription with word-level timestamps
  • LLM: GPT-4o-mini, Llama 3 70B via Groq, or Mistral for fast first-token response
  • TTS: Cartesia Sonic (~50ms), Deepgram Aura (~80ms), or ElevenLabs eleven_turbo_v2_5 (~80ms)
  • VAD (voice activity detection): Silero VAD or WebRTC VAD for turn detection and interruption handling

Each component runs in a separate process or container. LiveKit's agent framework (Python SDK or Node.js) handles room connection, track subscription, and the event loop that ties them together.

LiveKit Voice AI Agent: Component Architecture LiveKit Room STT Deepgram Nova-2 LLM GPT-4o-mini / Groq TTS Cartesia Sonic Silero VAD Turn detection 3CX PBX SIP (IP-auth) CRM / Log Webhook at end All components run in your infrastructure — call audio never leaves your environment
Building a LiveKit 3CX Voice AI stack?

Third Rock Techkno has deployed real-time voice AI agents on 3CX for businesses across USA, UK, and Australia. We handle architecture, LiveKit deployment, LLM pipeline, and ongoing latency tuning. Talk to our team →

How to Connect LiveKit to 3CX: Step-by-Step SIP Trunk Configuration

Complete the LiveKit setup first before touching 3CX. You need the SIP endpoint address before configuring the 3CX trunk.

Step 1: Deploy LiveKit Server

Deploy LiveKit on a cloud VM in the same region as your LLM API provider. This is the most important infrastructure decision for latency. If your LLM endpoint is in AWS us-east-1, run LiveKit in us-east-1. Cross-region audio adds 40–80ms per hop and makes sub-300ms latency very hard to achieve consistently.

Use LiveKit's official Docker image or Helm chart for Kubernetes. Minimum spec for production:

  • 4 vCPU, 8GB RAM, dedicated network interface with a static public IP
  • Same cloud region as your primary LLM API provider
  • Port 5060 open for SIP (UDP + TCP)
  • Ports 50000–60000 open for RTP media
  • Static public IP — dynamic IPs break SIP trunk registration
sip: enabled: true port: 5060 bind_address: 0.0.0.0

Step 2: Create an Inbound SIP Trunk

Use the LiveKit CLI to create an inbound trunk that accepts calls from your 3CX server's public IP:

lk sip inbound-trunk create \ --name "3cx-main" \ --allowed-addresses "YOUR_3CX_PUBLIC_IP/32" \ --numbers "+1XXXXXXXXXX"

Note the trunk ID returned — you'll reference it in the dispatch rule. The

--allowed-addresses
field is your 3CX server's static public IP. This is the IP-based auth that makes 3CX compatibility work. If your 3CX is behind NAT, use the public IP of your NAT gateway, not the internal 3CX IP.

Step 3: Create a SIP Dispatch Rule

The dispatch rule tells LiveKit what to do when a call arrives on the inbound trunk — which room to place it in and which agent to connect:

lk sip dispatch-rule create \ --trunk-id "TRUNK_ID_FROM_STEP_2" \ --room-prefix "call-" \ --agent-name "voice-ai-agent"

Each inbound call gets its own room with a unique name prefixed with "call-". Your agent process watches for new rooms matching this prefix and joins them automatically. This one-room-per-call model is what makes concurrent call handling clean — there's no shared state between calls.

Step 4: Configure the 3CX SIP Trunk

In 3CX Admin Console, go to SIP Trunks and create a new trunk using these settings:

  • Type: Generic SIP Trunk (IP Based) — not Registration Based
  • Registrar/Server: Your LiveKit server's public IP
  • Port: 5060
  • Authentication: Do not require (IP-based auth only)
  • Codec: G.711 ulaw (PCMU) as primary — disable G.729 and G.722
  • Outbound caller ID: Set $OriginatorCallerID to pass original caller number to LiveKit
  • SIP Options: Enable SIP OPTIONS keepalive if your 3CX version supports it

Step 5: Build Your AI Agent

Use LiveKit's Python agent framework to connect your LLM and TTS pipeline. This is the minimal working agent that handles a 3CX-routed call:

from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli from livekit.agents.pipeline import VoicePipelineAgent from livekit.plugins import deepgram, openai, cartesia, silero async def entrypoint(ctx: JobContext): await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY) agent = VoicePipelineAgent( vad=silero.VAD.load(), stt=deepgram.STT(model="nova-2"), llm=openai.LLM(model="gpt-4o-mini"), tts=cartesia.TTS(voice="sonic-english"), ) agent.start(ctx.room) await agent.say("Hello, how can I help you today?", allow_interruptions=True) if __name__ == "__main__": cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
5-Step LiveKit 3CX Setup Timeline Step 1 Deploy LiveKit VM + SIP config Step 2 Create Inbound SIP Trunk Step 3 Create Dispatch Rule Step 4 Configure 3CX SIP Trunk (IP) Step 5 Build AI Agent Deploy + Test Complete LiveKit setup before configuring 3CX trunk — you need the SIP endpoint address first
Want a sub-300ms voice AI agent on your 3CX system?

We deploy and tune LiveKit 3CX voice AI stacks for enterprise clients. From SIP gateway config to LLM pipeline optimization and production monitoring. Get a technical consultation →

Want expert guidance?

Not sure which LLM or TTS provider fits your latency and compliance requirements? Our team has benchmarked all major providers across USA, UK, and GCC deployments. Get a free consultation →

Hitting Sub-300ms: Latency Tuning for LiveKit 3CX Voice AI

Sub-300ms end-to-end latency is achievable, but it requires getting several things right simultaneously. Any single bad decision — wrong region, wrong model, wrong TTS — adds enough latency to push you over. Here's where each millisecond goes and how to control it.

According to Google Research on conversational latency perceptual thresholds, human callers perceive responses above 300ms as noticeably delayed. Below 300ms, the interaction feels natural. This is your hard production target.

LLM Selection: Time-to-First-Token Under 150ms

GPT-4o-mini via OpenAI API delivers time-to-first-token (TTFT) of 80–140ms for most conversational responses. Llama 3 70B via Groq runs even faster at 60–120ms TTFT because Groq's LPU hardware is optimised for inference throughput. Both are good choices for real-time voice AI. Model comparison:

  • GPT-4o-mini: 80–140ms TTFT, reliable, good instruction following — recommended default
  • Llama 3 70B (Groq): 60–120ms TTFT, lower cost per token, strong for structured tasks
  • Mistral 7B: 50–100ms TTFT, lowest cost, trade-off on reasoning quality
  • GPT-4o: 200–500ms TTFT under load — exceeds your latency budget, avoid for voice
  • Claude Opus / Sonnet: 180–400ms TTFT — too slow for sub-300ms voice target in production

TTS Selection: First Audio Byte Under 100ms

Cartesia Sonic is currently the fastest production TTS for voice AI, with time-to-first-audio-byte around 50ms. Deepgram Aura and ElevenLabs eleven_turbo_v2_5 both run around 80ms. All three are viable for sub-300ms targets. TTS comparison:

  • Cartesia Sonic: ~50ms first audio byte, excellent naturalness, streaming output — top choice
  • Deepgram Aura: ~80ms first audio byte, competitive naturalness, lower cost than Cartesia
  • ElevenLabs eleven_turbo_v2_5: ~80ms first audio byte, best voice naturalness for brand personas
  • ElevenLabs standard: 250–400ms — exceeds budget, do not use for voice AI
  • Google WaveNet: 100–150ms — borderline, works only if your LLM is very fast

VAD Configuration: Silero Sweet Spot Settings

Your VAD settings directly control how fast the agent responds after the caller stops speaking. These Silero VAD settings work well for production English business calls:

silero.VAD.load( min_speech_duration=0.1, # 100ms minimum speech to trigger min_silence_duration=0.3, # 300ms silence before end-of-turn prefix_padding_duration=0.1, # 100ms pre-speech buffer )

The

min_silence_duration
of 300ms is the key setting. Drop it below 200ms and you get false positives where the agent interrupts mid-sentence. Push it above 500ms and responses feel slow. 300ms is the sweet spot for English business conversations. For Arabic or languages with longer pause patterns, try 400–450ms.

Region Co-location: The Biggest Single Lever

This is the single most impactful latency decision. Run LiveKit, your STT provider, your LLM API, and your TTS provider in the same cloud region. The round-trip from LiveKit to an out-of-region LLM endpoint alone adds 80–120ms. Recommended regions:

  • USA: AWS us-east-1 or GCP us-central1 — widest selection of co-located services
  • UK: AWS eu-west-2 (London) for lowest latency to UK callers
  • Australia: AWS ap-southeast-2 (Sydney)
  • GCC / UAE: AWS me-south-1 (Bahrain) — or AWS eu-west-1 (Ireland) as nearest fallback if UAE region lacks your LLM provider
Latency Budget Breakdown: Sub-300ms Target STT (Deepgram) ~50ms VAD delay ~30ms LLM (GPT-4o-mini) ~120ms TTS (Cartesia) ~50ms Network overhead ~30ms Total: ~280ms Sub-300ms target ✓

Production Checklist: Monitoring, DTMF, and Failover

A working LiveKit 3CX voice AI demo and a production system are different things. These are the gaps that cause incidents at 2am.

Call Recording and Transcription

LiveKit supports egress recording via its Egress API. For each call room, start a track composite egress to record both sides. Pipe Deepgram's real-time transcription output to your logging system or CRM via webhook at call end. Store recordings in S3 or GCS with a lifecycle policy. 90-day retention covers most compliance requirements. According to GDPR guidance on call recording, businesses in the UK and EU must obtain caller consent before recording — ensure your AI agent's opening greeting includes a recording disclosure.

DTMF Handling

3CX passes DTMF events in-band (RFC 2833) by default. LiveKit exposes DTMF events on the SIP participant object. Subscribe to DTMF in your agent to handle keypad inputs:

@ctx.room.on("sip_dtmf_received") async def on_dtmf(digit: str): if digit == "0": await transfer_to_human_agent(ctx)

Monitoring: Prometheus and Grafana

LiveKit exports Prometheus metrics. Track these in Grafana for production visibility. Critical metrics to alert on:

  • livekit_room_duration_seconds: average call length — sudden drops indicate audio issues
  • livekit_participant_joined_total: call volume — unexpected spikes may indicate trunk misconfiguration
  • livekit_track_publish_latency_ms: audio pipeline latency — alert on p95 > 400ms
  • STT word error rate via Deepgram dashboard — flag anything above 8% for your primary language
  • LLM error rate via OpenAI / Groq API metrics — 429 rate limit errors cause call failures

Failover Architecture

Run two LiveKit instances behind an AWS NLB or GCP Network Load Balancer. Configure your 3CX SIP trunk with both IPs in the allowed addresses list and use SRV records or a static failover IP for the registrar. If the primary LiveKit instance goes down, 3CX retries to the secondary within the SIP retry timeout (typically 3–5 seconds). For AI agent failover, run multiple agent worker processes and let LiveKit's dispatch mechanism distribute rooms across available workers.

  • Minimum production setup: 2 LiveKit nodes behind NLB, 2 agent workers per node
  • 3CX trunk config: Both LiveKit IPs in allowed-addresses, SRV record pointing to NLB
  • Agent recovery: Workers auto-restart with systemd or Kubernetes deployment restart policy
  • Database failover: Use managed Redis (ElastiCache / Cloud Memorystore) for dispatch state — avoid single-node Redis
LiveKit Production Failover Architecture 3CX PBX SIP trunk Load Balancer AWS NLB / GCP LiveKit Node 1 Primary (us-east-1) LiveKit Node 2 Standby — auto-failover AI Workers (x2) STT + LLM + TTS AI Workers (x2) STT + LLM + TTS Managed Redis Dispatch state SRV records point both IPs — 3CX retries to Node 2 within 3–5s if Node 1 fails
LLM + TTS Comparison: Latency vs Quality for Voice AI Provider First Token/Audio Voice Quality Cost/min Recommended Cartesia Sonic (TTS) ~50ms Excellent $0.015/min ✅ Top pick GPT-4o-mini (LLM) 80–140ms N/A $0.15/1M tok ✅ Top pick ElevenLabs Standard 250–400ms Best $0.03/min ❌ Too slow Groq / Llama 3 70B 60–120ms N/A $0.07/1M tok ✅ Alt LLM Costs are approximate as of 2025 — verify with providers before production budget planning

Security and Compliance Considerations

A LiveKit 3CX deployment handles real call audio. Security isn't optional. These are the non-negotiable baseline controls for any production deployment handling business calls.

SIP Security

  • Restrict the LiveKit SIP port (5060) to your 3CX server's public IP only — never expose to 0.0.0.0
  • Enable SIP TLS (SIPS) if your 3CX version and carrier both support it
  • Rotate LiveKit API keys on a quarterly schedule
  • Use a firewall rule to block all SIP traffic from unexpected IPs — SIP brute-force attacks are common

Data Residency and GDPR

For UK and EU deployments subject to GDPR, and GCC deployments subject to UAE PDPL or Saudi PDPD:

  • Run LiveKit, STT, and storage entirely within the required geographic region
  • Use Deepgram's self-hosted deployment if your compliance policy prohibits sending audio to third-party APIs
  • Implement caller consent recording disclosures in the AI agent's opening greeting
  • Document your data flow: caller audio → LiveKit → STT → LLM → TTS → caller — each hop is a data processor under GDPR

According to the UK ICO's GDPR guidance, voice recordings of identifiable individuals are personal data — full data processing agreements are required with each service provider in your AI call pipeline.

Cost Estimation: What a LiveKit 3CX Voice AI Deployment Actually Costs

Before committing to a LiveKit deployment over phone forwarding, run your own numbers. The break-even point depends on your call volume and your PSTN forwarding rate.

  • Cloud VM (2 nodes): $80–$200/month depending on provider and region (AWS, GCP, Azure)
  • Deepgram STT: ~$0.0043/minute — at 500 calls/month of 3 minutes avg = $6.45/month
  • GPT-4o-mini LLM: ~$0.15 per 1M tokens — a 3-minute call uses ~500 tokens, cost negligible at SMB volumes
  • Cartesia Sonic TTS: ~$0.015/minute — at 500 calls of 3 minutes avg = $22.50/month
  • Total infra cost at 500 calls/month: Approx. $120–$230/month all-in
  • Phone forwarding equivalent (Option 1): 500 calls × 3 minutes × $0.02/min PSTN = $30/month plus platform subscription ($50–$300)

At low call volumes (under 300 calls/month), phone forwarding almost always wins on cost. LiveKit starts paying off at 500+ calls/month when your PSTN forwarding costs exceed the fixed VM infrastructure cost. According to Statista's enterprise telephony market data, the average SMB handling inbound customer calls processes 400–800 calls per month — putting most businesses right at the LiveKit break-even threshold.

Conclusion

LiveKit 3CX voice AI integration works because LiveKit provides a modern developer-friendly SIP bridge optimized for real-time AI voice workloads.

Sub-300ms latency is a real target, not a marketing number. It requires three things done right together: co-located infrastructure, a fast LLM (GPT-4o-mini or Llama 3 via Groq), and a low-latency TTS provider (Cartesia Sonic or Deepgram Aura). Get any one of those wrong and you're looking at 400–600ms, which callers notice.

Start with a single queue routing calls to LiveKit, measure your p95 latency in production, and tune from there. The SIP plumbing in this guide holds at scale. The tuning is where the engineering work actually lives.

Build a Voice AI Agent on Your 3CX System
Third Rock Techkno designs and deploys LiveKit voice AI stacks for 3CX deployments. SIP gateway setup, LLM pipeline, sub-300ms latency tuning, and production monitoring — end to end.
Third Rock Techkno
Krunal Shah

Written by

Passionate about crafting scalable tech for EdTech, FinTech & HealthTech. Driving digital growth through Web, App & AI solutions with a focus on innovation, impact, and lasting partnerships.

Found this blog useful? Don't forget to share it wih your network

X (Twitter)

Frequently Asked Questions

<p>Deploy LiveKit with its SIP gateway enabled and create an inbound trunk using the LiveKit CLI with your 3CX server's public IP in the allowed-addresses field. In 3CX Admin, create a Generic SIP trunk (IP Based) pointing to your LiveKit server's IP on port 5060. Set the destination on your queue's no-answer routing to forward calls through this trunk. LiveKit receives the call, matches it to a dispatch rule, and routes it into a room where your AI agent runs.</p>

<p>Vapi and Retell use credential-based SIP authentication — they expect a username and password on every inbound call. 3CX SIP trunks use IP-based authentication and don't send credentials. These two auth models are incompatible at the protocol level. LiveKit's SIP gateway supports IP-based auth natively, making it the correct bridge between 3CX and AI voice platforms.</p>

<p>Sub-300ms end-to-end latency is achievable with co-located infrastructure, GPT-4o-mini or Llama 3 70B via Groq as the LLM, and Cartesia Sonic or Deepgram Aura as TTS. Most production deployments land between 220–350ms at p50. The p95 figure matters more for call quality — keep it under 450ms and callers won't notice the AI response delay as unnatural.</p>

<p>Use G.711 ulaw (PCMU) as the primary codec on the 3CX SIP trunk. It's uncompressed, adds no codec processing latency, and works with every STT provider without conversion. Disable G.729 on the LiveKit-facing trunk — it reduces bandwidth but adds 10–20ms of compression latency and introduces artefacts that lower STT accuracy.</p>

<p>A 4 vCPU / 8GB RAM VM handles 30–50 concurrent calls comfortably, depending on the audio processing load of your AI pipeline. If each agent process is CPU-bound (running local VAD, for example), plan for one worker process per 10–15 concurrent calls and scale horizontally using LiveKit's multi-worker dispatch. Monitor livekit_track_publish_latency_ms — if p95 climbs above 400ms under load, add a worker node.</p>

<p>Yes. LiveKit's Egress API records both sides of the call and stores the output to S3, GCS, or Azure Blob. For GDPR compliance, deploy LiveKit and storage within the EU region, implement consent disclosures in your agent's greeting, and sign data processing agreements with each service provider in your pipeline (Deepgram, OpenAI, Cartesia). The UK ICO classifies voice recordings of identifiable individuals as personal data under UK GDPR.</p>

<p>Deepgram Nova-2 supports Modern Standard Arabic (MSA) with acceptable accuracy. For Gulf dialects (Khaleeji), accuracy drops — consider Whisper large-v3 via a self-hosted deployment for better dialect coverage. For TTS in Arabic, ElevenLabs with a custom voice is the most natural option. Configure your Silero VAD with min_silence_duration=0.4s for Arabic — Arabic speakers tend to use longer pauses between sentences than English speakers.</p>

Featured Insights

Team up with us to enhance and

achieve your business objectives

LET'S WORK

TLogoGETHER