High Availability Setup

For mission-critical communication workflows, Exotel supports high availability (HA) configurations that provide redundancy, automatic failover, and multi-region resilience. This guide covers the architecture and setup for ensuring maximum uptime.

info

High availability features are available on Enterprise plans. Contact your account manager to discuss HA requirements for your deployment.

Exotel Platform Availability

Built-In Redundancy

Exotel's platform includes the following built-in redundancy features:

Component	Redundancy	Details
API Gateway	Active-active cluster	Multiple API servers behind load balancers
Call Processing	Active-active	Distributed call processing across multiple servers
Database	Primary-replica replication	Automatic failover to replica on primary failure
Storage	Multi-AZ replication	Call recordings replicated across availability zones
Network	Multi-carrier	Multiple telecom carrier connections for voice/SMS

Platform SLA

Plan	Uptime SLA	Monthly Downtime Budget
Starter	Best effort	Not guaranteed
Growth	99.5%	~3.6 hours/month
Enterprise	99.9%	~43 minutes/month
Enterprise (custom)	99.95%+	~22 minutes/month

Configuring Your Integration for HA

Webhook Failover

Configure multiple webhook endpoints so that if your primary server is down, Exotel can deliver events to a backup:

Primary + Failover Setup

Configure your primary webhook URL in the Exotel dashboard
Set up a failover webhook URL that points to a different server or region
If the primary endpoint fails (non-200 response or timeout), Exotel retries on the primary and then falls back to the failover URL

Setting	Description
Primary URL	`https://primary.your-server.com/exotel/callback`
Failover URL	`https://backup.your-server.com/exotel/callback`
Failover after	2 failed attempts on primary

Multiple Carrier Routing

Exotel routes calls through multiple telecom carriers. In case of a carrier outage:

Exotel automatically detects the carrier failure
Calls are rerouted through an alternate carrier
The switch is transparent -- no action required on your end
Call quality and connectivity are maintained

ExoPhone Redundancy

For critical inbound numbers, maintain backup ExoPhones:

Strategy	Implementation
Multiple ExoPhones	Publish multiple contact numbers; if one goes down, callers use the other
Number forwarding	Configure carrier-level forwarding from one number to another
Geographic redundancy	Use ExoPhones from different regions/circles

Server-Side HA Architecture

Recommended Architecture

                    ┌─────────────────────┐
                    │    Exotel Platform   │
                    │  (Multi-AZ, Multi-  │
                    │   Carrier)           │
                    └─────────┬───────────┘
                              │
                    ┌─────────┴───────────┐
                    │    Load Balancer     │
                    │  (Health-checked)    │
                    └─────────┬───────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
        ┌─────┴─────┐  ┌─────┴─────┐  ┌─────┴─────┐
        │  Server 1  │  │  Server 2  │  │  Server 3  │
        │ (Region A) │  │ (Region A) │  │ (Region B) │
        └───────────┘  └───────────┘  └───────────┘

Key Components

Component	Purpose	Recommendation
Load Balancer	Distribute API calls and webhook traffic	Use health-checked ALB/NLB with multiple targets
Application Servers	Process API calls and webhook events	Minimum 2 servers in different availability zones
Database	Store call data, CRM integration data	Primary-replica with automatic failover
Queue	Buffer webhook events for processing	Use a managed message queue (SQS, RabbitMQ)
DNS	Route traffic to healthy endpoints	Use DNS failover (Route 53, Cloudflare)

Webhook Processing Architecture

Use an event-driven architecture for webhook processing:

Exotel Webhook ──► Load Balancer ──► API Server ──► Message Queue ──► Worker
                                        │                               │
                                        └── HTTP 200 (immediate) ──────┘
                                                                  (async processing)

This ensures:

Exotel always receives a 200 response quickly
Event processing happens asynchronously
If a worker fails, the event stays in the queue and is retried

Disaster Recovery

Recovery Time Objective (RTO)

Component	Target RTO
API access	< 5 minutes (Exotel platform)
Webhook delivery	< 20 minutes (including retries)
Your server failover	Depends on your infrastructure
Call flow recovery	< 1 minute (automatic carrier failover)

Recovery Point Objective (RPO)

Data Type	Target RPO
Call detail records	Zero data loss (synchronous replication)
Call recordings	< 5 minutes (async replication lag)
Webhook events	< 20 minutes (retry window)

DR Checklist

Webhook failover URLs configured -- Backup endpoint in a different region
API client with retry logic -- Exponential backoff with jitter
Reconciliation process -- Periodic API polling to catch missed webhooks
Monitoring and alerting -- Alerts for webhook failures, API errors, and connectivity issues
Runbook documented -- Step-by-step recovery procedures for common failure scenarios
Regular DR testing -- Test failover quarterly

Monitoring for HA

Key Metrics to Monitor

Metric	Healthy Range	Alert Threshold
Webhook success rate	> 99%	< 95%
API response time	< 500ms	> 2000ms
Concurrent call utilization	< 80% of limit	> 90%
Webhook retry rate	< 5%	> 15%
Active calls	Within expected range	Sudden drop or spike

Setting Up Health Monitoring

Use the Heartbeat feature to monitor your endpoint health
Configure Exotel to send periodic health checks to your webhook URL
Set up your own monitoring to track Exotel API availability
Implement alerts for degraded performance

Best Practices

Design for failure -- Assume any component can fail and build accordingly
Use multiple availability zones -- Deploy your servers across at least 2 AZs
Implement idempotent processing -- Handle duplicate webhook events gracefully using CallSid as a key
Queue webhook events -- Never process webhooks synchronously in the request handler
Test failover regularly -- Simulate failures and verify recovery
Monitor proactively -- Set up alerts for degraded metrics before they become outages
Keep runbooks updated -- Document and rehearse recovery procedures

Webhooks Setup -- Webhook configuration and retry logic
Concurrent Calls -- Managing call capacity
Network Requirements -- Ports and protocols for connectivity
Heartbeat -- Endpoint health monitoring

Exotel Platform Availability​

Built-In Redundancy​

Platform SLA​

Configuring Your Integration for HA​

Webhook Failover​

Primary + Failover Setup​

Multiple Carrier Routing​

ExoPhone Redundancy​

Server-Side HA Architecture​

Recommended Architecture​

Key Components​

Webhook Processing Architecture​

Disaster Recovery​

Recovery Time Objective (RTO)​

Recovery Point Objective (RPO)​

DR Checklist​

Monitoring for HA​

Key Metrics to Monitor​

Setting Up Health Monitoring​

Best Practices​

Related Topics​