Resilience Patterns
Resilience patterns are essential for building robust APIs that can gracefully handle unexpected failures or delays in dependent systems. This section provides guidelines for implementing retries, timeouts, circuit breakers, bulkheads, and fallbacks in API design. These patterns SHOULD be applied where appropriate to ensure reliability, scalability, and user experience.
Retries
Retries allow APIs to recover from transient failures.
- APIs SHOULD implement retries for idempotent operations (e.g.,
GET
,PUT
,DELETE
) where a transient failure is likely to succeed on subsequent attempts. - Retries SHOULD NOT be used for non-idempotent operations (e.g.,
POST
) unless specifically designed for retry safety. - A backoff strategy (e.g., exponential backoff with jitter) SHOULD be used to prevent cascading failures.
- Retries MUST be capped with a maximum retry count to avoid infinite loops or unnecessary resource consumption.
sequenceDiagram
participant Client
participant API
participant Dependency
Client->>API: Request
activate API
API->>Dependency: Initial Request
activate Dependency
Dependency--xAPI: Transient Failure
deactivate Dependency
Note over API: Apply backoff strategy
API->>Dependency: Retry #1
activate Dependency
Dependency--xAPI: Transient Failure
deactivate Dependency
Note over API: Increase backoff with jitter
API->>Dependency: Retry #2
activate Dependency
Dependency-->>API: Success
deactivate Dependency
API-->>Client: Response
deactivate API
Note over Client,Dependency: If max retries reached, return error
Example
In the following python example, we show data retrieval from a distributed database service that occasionally experiences network issues and use the Tenacity library to handle retries with exponential backoff and jitter. The @retry
decorator configures the retry behaviour to attempt up to 3 times for transient exceptions only, with exponential backoff starting at 100ms and built-in jitter.
import logging
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
before_sleep_log
)
logger = logging.getLogger(__name__)
class TransientException(Exception):
"""Represents a temporary failure that may succeed on retry"""
pass
class ServiceException(Exception):
"""Represents a permanent failure or failure after retries"""
pass
@retry(
retry=retry_if_exception_type(TransientException),
stop=stop_after_attempt(3),
wait=wait_random_exponential(multiplier=0.1, max=2),
before_sleep=before_sleep_log(logger, logging.WARNING),
reraise=True
)
def get_customer_data(customer_id):
"""
Retrieve customer data with automatic retry handling for transient failures.
This function will retry up to 2 times (3 attempts total) when transient
exceptions occur, using exponential backoff with jitter to prevent
overwhelming the database.
"""
try:
return database_service.get_customer(customer_id)
except TransientException:
# Will be automatically retried by Tenacity
raise
except Exception as e:
# Convert unexpected exceptions to ServiceException (not retried)
raise ServiceException(f"Failed to retrieve customer data: {str(e)}") from e
Timeouts
Timeouts prevent operations from hanging indefinitely.
- APIs SHOULD define timeouts for all external calls to dependent systems or services.
- Timeout values SHOULD be carefully chosen based on the performance characteristics of the dependent system and the API's SLA requirements.
- APIs SHOULD NOT rely on default or unspecified timeout settings, as these can vary widely across libraries and tools.
- A timeout SHOULD trigger a fallback mechanism or propagate an appropriate error to the client.
sequenceDiagram
participant Client
participant API
participant Dependency
Client->>API: Request
activate API
API->>Dependency: Send Request
activate Dependency
Note over API: Start Timeout Timer
alt Response within timeout period
Dependency-->>API: Timely Response
API-->>Client: Success Response
else Timeout occurs
Note over API: Timeout Threshold Exceeded
API--xDependency: Cancel Request (if possible)
deactivate Dependency
API-->>Client: Timeout Error Response
end
deactivate API
Example
In this python example, we set a 5-second timeout for the request to an external API using the Requests library. If the request takes longer than this, a Timeout
exception is raised, allowing us to handle it gracefully with a fallback mechanism or propagate an appropriate error to the client.
import requests
from requests.exceptions import Timeout, RequestException
try:
# Set a 5-second timeout for the request
response = requests.get('https://api.example.com/data', timeout=5)
data = response.json()
except Timeout:
print("The request timed out")
except RequestException as e:
print(f"Request error: {e}")
Circuit Breakers
Circuit breakers protect systems from cascading failures by halting requests to unhealthy dependencies.
- Circuit breakers SHOULD be implemented for calls to external systems that are critical to the API's operation.
- APIs SHOULD configure circuit breakers with thresholds for failure rates and recovery intervals.
- When a circuit breaker is open, the API MUST provide a meaningful error response or fallback mechanism.
- Circuit breakers MUST NOT be used for internal components that are highly reliable and tightly coupled, as they introduce unnecessary complexity.
flowchart TD
CLOSED((CLOSED)) -->|Failure threshold exceeded| OPEN((OPEN))
OPEN -->|Delay| HALF((HALF OPEN))
HALF -->|Success| CLOSED
HALF -->|Failure| OPEN
classDef closed fill:#59b259,stroke:#004d00,color:#fff
classDef open fill:#ff6666,stroke:#800000,color:#fff
classDef half fill:#ffcc00,stroke:#cc8800,color:#000
class CLOSED closed
class OPEN open
class HALF half
sequenceDiagram
participant Client
participant API with Circuit Breaker
participant Dependency
Note over API with Circuit Breaker: Circuit State: CLOSED
Client->>API with Circuit Breaker: Request 1
activate API with Circuit Breaker
API with Circuit Breaker->>Dependency: Forward Request
activate Dependency
Dependency-->>API with Circuit Breaker: Success Response
deactivate Dependency
API with Circuit Breaker-->>Client: Response
deactivate API with Circuit Breaker
Client->>API with Circuit Breaker: Request 2
activate API with Circuit Breaker
API with Circuit Breaker->>Dependency: Forward Request
activate Dependency
Dependency--xAPI with Circuit Breaker: Failure
deactivate Dependency
API with Circuit Breaker-->>Client: Error Response
deactivate API with Circuit Breaker
Client->>API with Circuit Breaker: Request 3
activate API with Circuit Breaker
API with Circuit Breaker->>Dependency: Forward Request
activate Dependency
Dependency--xAPI with Circuit Breaker: Failure
deactivate Dependency
API with Circuit Breaker-->>Client: Error Response
deactivate API with Circuit Breaker
Note over API with Circuit Breaker: Failure threshold exceeded
Note over API with Circuit Breaker: Circuit State: OPEN
Client->>API with Circuit Breaker: Request 4
activate API with Circuit Breaker
Note over API with Circuit Breaker: Request rejected without calling dependency
API with Circuit Breaker-->>Client: Circuit Open Error
deactivate API with Circuit Breaker
Note over API with Circuit Breaker: After timeout period
Note over API with Circuit Breaker: Circuit State: HALF-OPEN
Client->>API with Circuit Breaker: Request 5
activate API with Circuit Breaker
API with Circuit Breaker->>Dependency: Test Request
activate Dependency
Dependency-->>API with Circuit Breaker: Success Response
deactivate Dependency
API with Circuit Breaker-->>Client: Response
deactivate API with Circuit Breaker
Note over API with Circuit Breaker: Circuit State: CLOSED
Example
In this python example, we use the circuitbreaker library to protect calls to a recommendation service. The circuit breaker is configured to open after 3 failures out of 5 attempts (60% failure rate) and will stay open for 30 seconds before allowing a test request. When the circuit is open, we fallback to a cache of popular products instead of personalised recommendations.
import logging
from circuitbreaker import circuit, CircuitBreakerError
logger = logging.getLogger(__name__)
# Configure the circuit breaker:
# - fails when 3 out of 5 attempts fail (60% failure rate)
# - resets after 30 seconds in open state
@circuit(failure_threshold=3, recovery_timeout=30, expected_exception=Exception)
def get_product_recommendations(user_id):
"""
Retrieve product recommendations from the recommendation service.
This function is protected by a circuit breaker that will open after
3 failures out of 5 attempts, preventing further calls to the potentially
failing service for 30 seconds.
"""
try:
return recommendation_service.get_recommendations(user_id)
except Exception as e:
logger.error(f"Recommendation service error: {str(e)}")
raise # The circuit breaker will catch this
def get_recommendations_with_fallback(user_id):
"""
Get product recommendations with circuit breaker protection and fallback.
"""
try:
# This call is protected by the circuit breaker decorator
return get_product_recommendations(user_id)
except CircuitBreakerError:
logger.warning(f"Circuit breaker open, using fallback for user {user_id}")
# Fallback to a simpler recommendation strategy
return get_fallback_recommendations(user_id)
except Exception as e:
logger.error(f"Unexpected error in recommendations: {str(e)}")
return []
def get_fallback_recommendations(user_id):
"""
Provides a fallback when the recommendation service is unavailable.
Returns popular products instead of personalised recommendations.
"""
return popular_products_cache.get_popular_items(5)
Bulkheads
Bulkheads isolate failures to prevent them from impacting the entire system.
- APIs SHOULD use bulkheads to limit the impact of resource exhaustion (e.g., thread pools, connection pools) caused by a specific dependency or client.
- Bulkheads MUST be configured to allocate capacity proportionate to the criticality of the resource or operation.
- APIs MUST NOT allow a single poorly performing client or dependency to consume all available resources, degrading the experience for others.
Examples
In an e-commerce platform, the payment service, user service, and search service can be isolated using bulkheads. If the search service experiences high traffic or failure, the payment and user services remain unaffected, ensuring critical operations like checkout continue to function.
Without Bulkhead Pattern
When Search
service fails, it consumes all available resources in the shared pool, causing Payment
and User
services to suffer as well.
graph TD
Client[Client Requests] --> API[API Gateway]
API --> Pool[Shared Thread Pool]
Pool --> Service1[Payment Service]
Pool --> Service2[User Service]
Pool --> Service3[Search Service]
Service3 -. "Failure/Overload" .-> Pool
Pool -. "Resources Exhausted" .-> Service1
Pool -. "Resources Exhausted" .-> Service2
subgraph "Without Bulkhead Pattern"
Pool
end
classDef overloaded fill:#FF6347,stroke:#8B0000;
class Pool,Service1,Service2,Service3 overloaded
%% Note: When Search Service fails,
%% it consumes all available resources in the shared pool,
%% causing Payment and User services to suffer as well
With Bulkhead Pattern
Even though Search
service has failed, Payment
and User
services continue to function, because resources are isolated with bulkheads.
graph TD
Client[Client Requests] --> API[API Gateway]
API --> Pool1[Thread Pool 1]
API --> Pool2[Thread Pool 2]
API --> Pool3[Thread Pool 3]
Pool1 --> Service1[Payment Service]
Pool2 --> Service2[User Service]
Pool3 --> Service3[Search Service]
Service3 -. "Failure/Overload" .-> Pool3
subgraph "Bulkhead Pattern"
Pool1
Pool2
Pool3
end
classDef failed fill:#FF6347,stroke:#8B0000;
class Service3,Pool3 failed
%% Note: Even though Search Service has failed,
%% Payment and User services continue to function
%% because resources are isolated with bulkheads
Python Example
In this Python example, we implement the bulkhead pattern using ThreadPoolExecutor
from the concurrent.futures
module. The code creates separate thread pools for critical and non-critical operations, preventing failures in one service from consuming resources needed by others.
import asyncio
from concurrent.futures import ThreadPoolExecutor
from functools import partial
class ServiceExecutors:
def __init__(self):
# Dedicated pool for critical operations
self.critical_pool = ThreadPoolExecutor(
max_workers=4,
thread_name_prefix="critical"
)
# Pool for non-critical operations
self.normal_pool = ThreadPoolExecutor(
max_workers=10,
thread_name_prefix="normal"
)
async def execute_critical(self, func, *args):
return await asyncio.get_event_loop().run_in_executor(
self.critical_pool,
partial(func, *args)
)
async def execute_normal(self, func, *args):
return await asyncio.get_event_loop().run_in_executor(
self.normal_pool,
partial(func, *args)
)
Usage example:
executors = ServiceExecutors()
# Payment processing - uses the critical pool (4 threads max)
async def process_payment(payment_id):
return await executors.execute_critical(payment_service.process, payment_id)
# Product search - uses the normal pool (10 threads max)
async def search_products(query):
return await executors.execute_normal(search_service.find, query)
# Even if search_products overloads its thread pool,
# payment processing remains unaffected
This implementation demonstrates how:
- Critical operations like payments get dedicated resources (4 threads)
- Non-critical operations like search get separate resources (10 threads)
- If the search service becomes overloaded, payment processing continues normally
- Each service has its failure domain contained within its own thread pool
Fallbacks
Fallbacks provide alternative behaviour when a dependency fails.
- APIs MUST implement fallbacks for critical operations where failure would significantly impact the user experience.
- Fallbacks SHOULD provide meaningful degraded functionality (e.g., cached data, placeholder values) rather than returning generic errors.
- APIs MUST NOT use fallbacks that violate business logic, security, or data integrity requirements.
- Where fallbacks are implemented, the API SHOULD log the use of fallback mechanisms for monitoring and debugging purposes.
Example
If a weather API fails, the fallback could provide cached weather data from the last successful response. For a stock price API, a fallback might return the last known price or a default value.
import requests
# Simulate a cache (in a real app, this would be persistent storage)
weather_cache = {
"London": {"temperature": 15, "condition": "Cloudy"},
"New York": {"temperature": 20, "condition": "Sunny"}
}
def get_weather(city):
"""Get weather data with fallback to cache if API fails"""
try:
# Try to get fresh data from the API
response = requests.get(
f"https://api.weather.example.com/current?city={city}",
timeout=2
)
response.raise_for_status()
return response.json()
except Exception:
# API call failed, use fallback
print(f"Weather API failed. Using cached data for {city}")
# Return cached data if available, or a default
if city in weather_cache:
return weather_cache[city]
else:
return {"temperature": None, "condition": "Unknown"}
# Example usage
weather = get_weather("London")
print(f"Weather: {weather['temperature']}°C, {weather['condition']}")
General Guidance
- Resilience patterns MUST be chosen based on the specific context and requirements of the API.
- Combinations of patterns SHOULD be used to address complex failure scenarios (e.g., retries with timeouts and circuit breakers).
- APIs MUST log and monitor resilience events (e.g., retries, circuit breaker state changes) to enable proactive troubleshooting and optimisation.
- Overuse or misuse of resilience patterns MUST NOT degrade overall performance or introduce unnecessary latency.