Resilience Patterns
Overview
Resilience patterns are essential for building robust APIs that can gracefully handle unexpected failures or delays in dependent systems. This section provides guidelines for implementing retries, timeouts, circuit breakers, bulkheads, and fallbacks in API design. These patterns SHOULD be applied where appropriate to ensure reliability, scalability, and user experience.
Retries
Retries allow APIs to recover from transient failures.
- APIs SHOULD implement retries for idempotent operations (e.g.,
GET
,PUT
,DELETE
) where a transient failure is likely to succeed on subsequent attempts. - Retries SHOULD NOT be used for non-idempotent operations (e.g.,
POST
) unless specifically designed for retry safety. - A backoff strategy (e.g., exponential backoff with jitter) SHOULD be used to prevent cascading failures.
- Retries MUST be capped with a maximum retry count to avoid infinite loops or unnecessary resource consumption.
sequenceDiagram participant Client participant API participant Dependency Client->>API: Request activate API API->>Dependency: Initial Request activate Dependency Dependency--xAPI: Transient Failure deactivate Dependency Note over API: Apply backoff strategy API->>Dependency: Retry #1 activate Dependency Dependency--xAPI: Transient Failure deactivate Dependency Note over API: Increase backoff with jitter API->>Dependency: Retry #2 activate Dependency Dependency-->>API: Success deactivate Dependency API-->>Client: Response deactivate API Note over Client,Dependency: If max retries reached, return error
Example
In the following python example, we show data retrieval from a distributed database service that occasionally experiences network issues and use the Tenacity library to handle retries with exponential backoff and jitter. The @retry
decorator configures the retry behaviour to attempt up to 3 times for transient exceptions only, with exponential backoff starting at 100ms and built-in jitter.
import logging
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
before_sleep_log
)
logger = logging.getLogger(__name__)
class TransientException(Exception):
"""Represents a temporary failure that may succeed on retry"""
pass
class ServiceException(Exception):
"""Represents a permanent failure or failure after retries"""
pass
def get_customer_data(customer_id):
"""
Retrieve customer data with automatic retry handling for transient failures.
This function will retry up to 2 times (3 attempts total) when transient
exceptions occur, using exponential backoff with jitter to prevent
overwhelming the database.
"""
try:
return database_service.get_customer(customer_id)
except TransientException:
# Will be automatically retried by Tenacity
raise
except Exception as e:
# Convert unexpected exceptions to ServiceException (not retried)
raise ServiceException(f"Failed to retrieve customer data: {str(e)}") from e
Timeouts
Timeouts prevent operations from hanging indefinitely.
- APIs SHOULD define timeouts for all external calls to dependent systems or services.
- Timeout values SHOULD be carefully chosen based on the performance characteristics of the dependent system and the API’s SLA requirements.
- APIs SHOULD NOT rely on default or unspecified timeout settings, as these can vary widely across libraries and tools.
- A timeout SHOULD trigger a fallback mechanism or propagate an appropriate error to the client.
sequenceDiagram participant Client participant API participant Dependency Client->>API: Request activate API API->>Dependency: Send Request activate Dependency Note over API: Start Timeout Timer alt Response within timeout period Dependency-->>API: Timely Response API-->>Client: Success Response else Timeout occurs Note over API: Timeout Threshold Exceeded API--xDependency: Cancel Request (if possible) deactivate Dependency API-->>Client: Timeout Error Response end deactivate API
Example
In this python example, we set a 5-second timeout for the request to an external API using the Requests library. If the request takes longer than this, a Timeout
exception is raised, allowing us to handle it gracefully with a fallback mechanism or propagate an appropriate error to the client.
import requests
from requests.exceptions import Timeout, RequestException
try:
# Set a 5-second timeout for the request
response = requests.get('https://api.example.com/data', timeout=5)
data = response.json()
except Timeout:
print("The request timed out")
except RequestException as e:
print(f"Request error: {e}")
Circuit Breakers
Circuit breakers protect systems from cascading failures by halting requests to unhealthy dependencies.
- Circuit breakers SHOULD be implemented for calls to external systems that are critical to the API’s operation.
- APIs SHOULD configure circuit breakers with thresholds for failure rates and recovery intervals.
- When a circuit breaker is open, the API MUST provide a meaningful error response or fallback mechanism.
- Circuit breakers MUST NOT be used for internal components that are highly reliable and tightly coupled, as they introduce unnecessary complexity.
Flowchart Diagram
flowchart TD CLOSED((CLOSED)) -->|Failure threshold exceeded| OPEN((OPEN)) OPEN -->|Delay| HALF((HALF OPEN)) HALF -->|Success| CLOSED HALF -->|Failure| OPEN classDef closed fill:#59b259,stroke:#004d00,color:#fff classDef open fill:#ff6666,stroke:#800000,color:#fff classDef half fill:#ffcc00,stroke:#cc8800,color:#000 class CLOSED closed class OPEN open class HALF half
Sequence Diagram
sequenceDiagram participant Client participant API with Circuit Breaker participant Dependency Note over API with Circuit Breaker: Circuit State: CLOSED Client->>API with Circuit Breaker: Request 1 activate API with Circuit Breaker API with Circuit Breaker->>Dependency: Forward Request activate Dependency Dependency-->>API with Circuit Breaker: Success Response deactivate Dependency API with Circuit Breaker-->>Client: Response deactivate API with Circuit Breaker Client->>API with Circuit Breaker: Request 2 activate API with Circuit Breaker API with Circuit Breaker->>Dependency: Forward Request activate Dependency Dependency--xAPI with Circuit Breaker: Failure deactivate Dependency API with Circuit Breaker-->>Client: Error Response deactivate API with Circuit Breaker Client->>API with Circuit Breaker: Request 3 activate API with Circuit Breaker API with Circuit Breaker->>Dependency: Forward Request activate Dependency Dependency--xAPI with Circuit Breaker: Failure deactivate Dependency API with Circuit Breaker-->>Client: Error Response deactivate API with Circuit Breaker Note over API with Circuit Breaker: Failure threshold exceeded Note over API with Circuit Breaker: Circuit State: OPEN Client->>API with Circuit Breaker: Request 4 activate API with Circuit Breaker Note over API with Circuit Breaker: Request rejected without calling dependency API with Circuit Breaker-->>Client: Circuit Open Error deactivate API with Circuit Breaker Note over API with Circuit Breaker: After timeout period Note over API with Circuit Breaker: Circuit State: HALF-OPEN Client->>API with Circuit Breaker: Request 5 activate API with Circuit Breaker API with Circuit Breaker->>Dependency: Test Request activate Dependency Dependency-->>API with Circuit Breaker: Success Response deactivate Dependency API with Circuit Breaker-->>Client: Response deactivate API with Circuit Breaker Note over API with Circuit Breaker: Circuit State: CLOSED
Example
In this python example, we use the circuitbreaker library to protect calls to a recommendation service. The circuit breaker is configured to open after 3 failures out of 5 attempts (60% failure rate) and will stay open for 30 seconds before allowing a test request. When the circuit is open, we fallback to a cache of popular products instead of personalised recommendations.
import logging
from circuitbreaker import circuit, CircuitBreakerError
logger = logging.getLogger(__name__)
# Configure the circuit breaker:
# - fails when 3 out of 5 attempts fail (60% failure rate)
# - resets after 30 seconds in open state
def get_product_recommendations(user_id):
"""
Retrieve product recommendations from the recommendation service.
This function is protected by a circuit breaker that will open after
3 failures out of 5 attempts, preventing further calls to the potentially
failing service for 30 seconds.
"""
try:
return recommendation_service.get_recommendations(user_id)
except Exception as e:
logger.error(f"Recommendation service error: {str(e)}")
raise # The circuit breaker will catch this
def get_recommendations_with_fallback(user_id):
"""
Get product recommendations with circuit breaker protection and fallback.
"""
try:
# This call is protected by the circuit breaker decorator
return get_product_recommendations(user_id)
except CircuitBreakerError:
logger.warning(f"Circuit breaker open, using fallback for user {user_id}")
# Fallback to a simpler recommendation strategy
return get_fallback_recommendations(user_id)
except Exception as e:
logger.error(f"Unexpected error in recommendations: {str(e)}")
return []
def get_fallback_recommendations(user_id):
"""
Provides a fallback when the recommendation service is unavailable.
Returns popular products instead of personalised recommendations.
"""
return popular_products_cache.get_popular_items(5)
Bulkheads
Bulkheads isolate failures to prevent them from impacting the entire system.
- APIs SHOULD use bulkheads to limit the impact of resource exhaustion (e.g., thread pools, connection pools) caused by a specific dependency or client.
- Bulkheads MUST be configured to allocate capacity proportionate to the criticality of the resource or operation.
- APIs MUST NOT allow a single poorly performing client or dependency to consume all available resources, degrading the experience for others.
Examples
In an e-commerce platform, the payment service, user service, and search service can be isolated using bulkheads. If the search service experiences high traffic or failure, the payment and user services remain unaffected, ensuring critical operations like checkout continue to function.
Without Bulkhead Pattern
When Search
service fails, it consumes all available resources in the shared pool, causing Payment
and User
services to suffer as well.
graph TD Client[Client Requests] --> API[API Gateway] API --> Pool[Shared Thread Pool] Pool --> Service1[Payment Service] Pool --> Service2[User Service] Pool --> Service3[Search Service] Service3 -. "Failure/Overload" .-> Pool Pool -. "Resources Exhausted" .-> Service1 Pool -. "Resources Exhausted" .-> Service2 subgraph "Without Bulkhead Pattern" Pool end classDef overloaded fill:#FF6347,stroke:#8B0000; class Pool,Service1,Service2,Service3 overloaded %% Note: When Search Service fails, %% it consumes all available resources in the shared pool, %% causing Payment and User services to suffer as well
With Bulkhead Pattern
Even though Search
service has failed, Payment
and User
services continue to function, because resources are isolated with bulkheads.
graph TD Client[Client Requests] --> API[API Gateway] API --> Pool1[Thread Pool 1] API --> Pool2[Thread Pool 2] API --> Pool3[Thread Pool 3] Pool1 --> Service1[Payment Service] Pool2 --> Service2[User Service] Pool3 --> Service3[Search Service] Service3 -. "Failure/Overload" .-> Pool3 subgraph "Bulkhead Pattern" Pool1 Pool2 Pool3 end classDef failed fill:#FF6347,stroke:#8B0000; class Service3,Pool3 failed %% Note: Even though Search Service has failed, %% Payment and User services continue to function %% because resources are isolated with bulkheads
Python Example
In this Python example, we implement the bulkhead pattern using ThreadPoolExecutor
from the concurrent.futures
module. The code creates separate thread pools for critical and non-critical operations, preventing failures in one service from consuming resources needed by others.
import asyncio
from concurrent.futures import ThreadPoolExecutor
from functools import partial
class ServiceExecutors:
def __init__(self):
# Dedicated pool for critical operations
self.critical_pool = ThreadPoolExecutor(
max_workers=4,
thread_name_prefix="critical"
)
# Pool for non-critical operations
self.normal_pool = ThreadPoolExecutor(
max_workers=10,
thread_name_prefix="normal"
)
async def execute_critical(self, func, *args):
return await asyncio.get_event_loop().run_in_executor(
self.critical_pool,
partial(func, *args)
)
async def execute_normal(self, func, *args):
return await asyncio.get_event_loop().run_in_executor(
self.normal_pool,
partial(func, *args)
)
Usage example:
executors = ServiceExecutors()
# Payment processing - uses the critical pool (4 threads max)
async def process_payment(payment_id):
return await executors.execute_critical(payment_service.process, payment_id)
# Product search - uses the normal pool (10 threads max)
async def search_products(query):
return await executors.execute_normal(search_service.find, query)
# Even if search_products overloads its thread pool,
# payment processing remains unaffected
This implementation demonstrates how:
- Critical operations like payments get dedicated resources (4 threads)
- Non-critical operations like search get separate resources (10 threads)
- If the search service becomes overloaded, payment processing continues normally
- Each service has its failure domain contained within its own thread pool
Fallbacks
Fallbacks provide alternative behaviour when a dependency fails.
- APIs MUST implement fallbacks for critical operations where failure would significantly impact the user experience.
- Fallbacks SHOULD provide meaningful degraded functionality (e.g., cached data, placeholder values) rather than returning generic errors.
- APIs MUST NOT use fallbacks that violate business logic, security, or data integrity requirements.
- Where fallbacks are implemented, the API SHOULD log the use of fallback mechanisms for monitoring and debugging purposes.
Example
If a weather API fails, the fallback could provide cached weather data from the last successful response. For a stock price API, a fallback might return the last known price or a default value.
import requests
# Simulate a cache (in a real app, this would be persistent storage)
weather_cache = {
"London": {"temperature": 15, "condition": "Cloudy"},
"New York": {"temperature": 20, "condition": "Sunny"}
}
def get_weather(city):
"""Get weather data with fallback to cache if API fails"""
try:
# Try to get fresh data from the API
response = requests.get(
f"https://api.weather.example.com/current?city={city}",
timeout=2
)
response.raise_for_status()
return response.json()
except Exception:
# API call failed, use fallback
print(f"Weather API failed. Using cached data for {city}")
# Return cached data if available, or a default
if city in weather_cache:
return weather_cache[city]
else:
return {"temperature": None, "condition": "Unknown"}
# Example usage
weather = get_weather("London")
print(f"Weather: {weather['temperature']}°C, {weather['condition']}")
General Guidance
- Resilience patterns MUST be chosen based on the specific context and requirements of the API.
- Combinations of patterns SHOULD be used to address complex failure scenarios (e.g., retries with timeouts and circuit breakers).
- APIs MUST log and monitor resilience events (e.g., retries, circuit breaker state changes) to enable proactive troubleshooting and optimisation.
- Overuse or misuse of resilience patterns MUST NOT degrade overall performance or introduce unnecessary latency.
Published: 2 May 2025
Last updated: 14 August 2025
Page Source