Skip to main content

Troubleshooting

This guide covers common issues and troubleshooting procedures for Runlayer. For complex issues or enterprise support, contact our technical team.

Quick Diagnostics

System Health Check

Kubernetes (Helm/EKS)
kubectl get pods -n anysource
kubectl get svc -n anysource
kubectl get events -n anysource --sort-by=.metadata.creationTimestamp
ECS (Terraform)
aws ecs list-services --cluster <cluster>
aws ecs list-tasks --cluster <cluster> --service-name <service>
aws ecs describe-services --cluster <cluster> --services <service>

Database Connectivity

# Test PostgreSQL connection
psql "postgresql://<user>:<password>@<rds-endpoint>:5432/<db>"

# Test Redis connection
redis-cli -h <redis-endpoint> -p 6379 ping

Common Issues

Service Won’t Start

Symptoms: Services fail to start or immediately exit Solutions:
  1. Kubernetes: Inspect pod status and events
    • kubectl describe pod <pod> -n anysource
    • kubectl logs <pod> -n anysource
  2. ECS: Inspect stopped tasks and CloudWatch logs
    • aws ecs describe-tasks --cluster <cluster> --tasks <task-id>
    • Review the task’s CloudWatch log group
  3. Verify configuration values and secrets (Kubernetes Secrets or AWS SSM/Secrets Manager)
  4. Check security groups and network connectivity between services

Database Connection Issues

Symptoms: Application cannot connect to database Solutions:
  1. Verify the database endpoint and credentials
  2. Check security groups / network policies for database access
  3. Review database logs in AWS (RDS logs / CloudWatch)
  4. Validate environment variables or secret values used by the backend

Performance Issues

Symptoms: Slow response times or high resource usage Solutions:
  1. Check resource utilization (CloudWatch metrics or Kubernetes metrics)
  2. Review database performance and slow query logs
  3. Verify Redis cache connectivity and hit rate
  4. Inspect application logs for errors and timeouts

ACM Certificate Issues

Symptoms: Terraform fails looking up certificate or ACM certificates remain in PENDING_VALIDATION. Wildcard lookup fails (default behavior): The ECS module derives a wildcard domain from your domain (e.g., ecs.staging.runlayer.com*.staging.runlayer.com) and looks up an existing certificate.
  1. Verify a wildcard certificate exists: aws acm list-certificates --query "CertificateSummaryList[?contains(DomainName, '*')]"
  2. If no wildcard certificate exists, either create one manually or set enable_acm_dns_validation = true to have Terraform create it.
DNS validation fails (when creating new certificates):
  1. Confirm the Route53 hosted zone exists in the AWS account running Terraform.
  2. Ensure hosted_zone_name matches the zone name (e.g., staging.runlayer.com).
  3. Re-run terraform apply so the _acme-challenge CNAME records are created automatically.

Authentication Problems

Symptoms: Users cannot log in or access resources Solutions:
  1. Verify authentication configuration
  2. Check external identity provider connectivity
  3. Review user permissions and roles
  4. Check JWT token configuration

Log Analysis

Application Logs

Kubernetes (Helm/EKS)
kubectl logs -n anysource deploy/backend -f
kubectl logs -n anysource deploy/webapp -f
ECS (Terraform)
aws logs tail /aws/ecs/<service> --follow

Database Logs

Review database logs in AWS (RDS logs / CloudWatch).

Enterprise Support

For complex issues, performance optimization, or enterprise-level troubleshooting:

Enterprise Technical Support

Contact our technical team for advanced troubleshooting and 24/7 support

Support Information

When contacting support, please include:
  • System Information: OS, deployment method, AWS region
  • Error Messages: Complete error messages and stack traces
  • Log Files: Relevant application and system logs
  • Configuration: Sanitized configuration files (remove secrets)
  • Steps to Reproduce: Detailed steps that led to the issue

Escalation Process

  1. Level 1: Basic troubleshooting (this guide)
  2. Level 2: Advanced diagnostics (contact support)
  3. Level 3: Engineering escalation (critical issues)

Preventive Measures

  • Regular Monitoring: Set up health checks and alerting
  • Log Rotation: Configure proper log management
  • Resource Monitoring: Monitor CPU, memory, and disk usage
  • Backup Verification: Regularly test backup and restore procedures
Contact our support team for comprehensive monitoring setup and proactive issue prevention.