Skip to main content

Troubleshooting runbook

This page documents common or likely failure modes and their causes. It is non-exhaustive and does not cover every possible scenario, but aims to capture the most frequently encountered issues.

If you are are unfamiliar with the service you are troubleshooting, have a look at its service page to get a better overview of its architecture.

Common issues

# Failure Mode User-Facing Symptoms Internal Symptoms Likely Cause Checks / Actions
1 Database permission issues New users may see no content. Direct API access may return 502/503 errors. ECS tasks restart repeatedly. Django permission errors appear in logs. Application database user lacks required permissions, often following migrations or role changes. Check CloudWatch logs for Django errors. Verify RDS user grants. Review recent schema or feature changes.
2 Incorrect S3 bucket permissions Media (images/audio) missing for users. Media links return access denied errors. API responses otherwise return successfully. IAM role or bucket policy does not permit access to required S3 objects. Review S3 bucket and Cloudfront policies. Check CloudTrail for denied S3 actions.
3 DNS resolution failures Browser displays “website cannot be found”. dig / nslookup returns NXDOMAIN. No traffic reaches the ALB. Route 53 hosted zone recreated without updating parent DNS zone delegations. Confirm NS records in the parent zone match the Route 53 hosted zone. Check recent DNS changes and propagation status.
4 Redis cache unreachable Slower responses for NHS login-related APIs. Errors appear in backend (not CMS) logs. Increased database load observed. Redis unavailable or network connectivity blocked. Check Redis cluster health. Verify security groups and routing. Review backend logs for cache connection failures.
5 TLS certificate expiry Applications may show missing content. Direct API access shows browser security warnings. HTTPS failures observed at the ALB. Expired or invalid TLS certificate attached to the ALB listener. Check ACM certificate expiry. Validate ALB listener configuration. Confirm renewal status or recent certificate updates.
6 Missing container images Users may receive 503 errors if no tasks are running. ECS tasks continuously restart with image pull errors. Image tag incorrect or missing from ECR. Parameter Store references a non-existent tag. Verify image tag in Parameter Store. Confirm image exists in ECR. Rebuild and push image if required. image pull succeeds but tasks still fail early, see row 8.
7 WAF blocking requests Users receive 403 Forbidden errors. Requests blocked before reaching the application. WAF logs show rule matches. Web Application Firewall rules blocking legitimate traffic, possibly due to rule changes or false positives. Review WAF rules and recent changes. Check WAF metrics and logs to identify which rules are blocking requests. Validate whether blocks are expected or require rule tuning.
8 ECS task execution role lacks Secrets Manager permissions Users may receive 503 errors if tasks fail to start (similar to row 6). ECS tasks fail during startup. Errors indicate inability to retrieve secrets. ECS task execution role does not have sufficient permissions to read secrets from Secrets Manager. Check ECS task execution role IAM policy. Verify required secretsmanager:GetSecretValue permissions. Review task definition secret references. If tasks fail before pulling images, also review row 6 for image-related failures.
9 Security group or network misconfiguration Users may see timeouts or 5xx errors. ALB targets marked unhealthy. Connection timeouts in logs. Security group or routing changes blocking traffic between ALB, ECS, or downstream services. Verify security group rules between ALB and ECS. Check outbound rules from ECS to RDS, Redis, and external services. Review recent network changes.
10 ALB health check failures Intermittent 503 errors. Targets marked unhealthy in ALB target group. ECS tasks may be running but not receiving traffic. Health check path, port, or success codes misconfigured, or application too slow to respond. Check target group health status. Verify health check path and expected response codes. Review application startup and response times. Check that Gunicorn/Uvicorn is running inside of containers and not being terminated.
This page was last reviewed on 29 January 2026. It needs to be reviewed again on 16 July 2026 .
This page was set to be reviewed before 16 July 2026. This might mean the content is out of date.