Troubleshooting Guide
Common issues and solutions for SeaGit platform. Find quick fixes for deployment failures, networking problems, and infrastructure errors.
Quick Navigation: Use browser search (Ctrl+F or Cmd+F) to find specific error messages or symptoms.
Infrastructure Issues
Cluster Creation Failed
Symptoms:
- • Cluster stuck in "Creating" state for >30 minutes
- • Error: "Failed to create control plane"
- • AWS CloudFormation stack failed
Common Causes:
- • Insufficient AWS service quotas (VPCs, EIPs, NAT gateways)
- • IAM permissions missing for EKS operations
- • Selected region doesn't support EKS
- • Subnet CIDR blocks conflict with existing resources
Solutions:
- Check Service Quotas:
# In AWS Console, check: Service Quotas → Amazon VPC → VPCs per region (need at least 1 available) Service Quotas → Amazon EC2 → NAT gateways per AZ (need 2 for HA) Service Quotas → Amazon EC2 → Elastic IPs (need at least 2) - Verify IAM Permissions: Ensure the IAM user has
AmazonEKSClusterPolicyandAmazonEKSVPCResourceController - Try Different Region: Some regions have limited capacity. Try us-east-1, us-west-2, or eu-west-1
- Check Logs: View cluster creation logs in SeaGit for specific error messages
Node Not Ready
Symptoms:
- • Nodes show status "NotReady" in cluster view
- • Pods remain in "Pending" state indefinitely
- • kubectl get nodes shows NotReady
Common Causes:
- • Network connectivity issues between nodes and control plane
- • Kubelet service not running or crashed
- • Disk pressure or out of disk space
- • Container runtime (containerd/docker) not running
Solutions:
- Check Node Status:
kubectl describe node <node-name> # Look for Conditions section for specific errors - Restart Node: Terminate the EC2 instance; autoscaling group will launch a replacement
- Check Security Groups: Ensure nodes can communicate with EKS control plane on port 443
- Disk Space: Nodes need at least 20% free disk. Increase disk size if needed
Network Creation Failed
Symptoms:
- • VPC creation timeout
- • Error: "CIDR block conflicts"
- • NAT gateway creation failed
Solutions:
- Choose Different CIDR: Use a non-conflicting range (10.0.0.0/16, 172.16.0.0/16, 192.168.0.0/16)
- Check Existing VPCs: You may have hit the VPC limit (default 5 per region)
- Elastic IP Limit: Request increase if you need more than 5 EIPs
Deployment Issues
ImagePullBackOff
Symptoms:
- • Deployment stuck in "Starting" state
- • Pod events show ImagePullBackOff or ErrImagePull
- • Logs contain "Failed to pull image"
Common Causes:
- • Image name or tag is incorrect
- • Image doesn't exist in registry
- • Registry requires authentication (private repo)
- • Docker Hub rate limit exceeded (100 pulls/6hrs for anonymous)
- • Network connectivity to registry
Solutions:
- Verify Image Exists:
# Test pull locally docker pull your-image:tag # Common mistakes: nginx:latest ✓ Correct nginx:1.21 ✓ Correct nginx ✗ Missing tag nginxinc/nginx ✗ Wrong org nginx:latests ✗ Typo in tag - Add Registry Credentials: If using private registry:
- • Navigate to Providers → Add Docker Registry provider
- • Enter username and password/token
- • SeaGit will create imagePullSecret automatically
- Docker Hub Rate Limit: Use authenticated pulls:
# Add Docker Hub credentials (gets 200 pulls/6hrs) # OR use private registry/mirror to avoid limits - Check Image Pull Logs:
kubectl describe pod <pod-name> # Look at Events section for specific error
CrashLoopBackOff
Symptoms:
- • Pod starts but immediately crashes and restarts repeatedly
- • Restart count keeps increasing
- • Status shows CrashLoopBackOff
Common Causes:
- • Application error on startup (uncaught exception)
- • Missing required environment variables
- • Cannot connect to database or external service
- • Port already in use or misconfigured
- • File permissions or missing files
Solutions:
- Check Container Logs:
# In SeaGit, view deployment logs # OR use kubectl: kubectl logs <pod-name> --previous # --previous shows logs from crashed container - Common Fixes by Error Type:
Error: "Cannot connect to database"
Check DATABASE_URL variable and database connectivity
Error: "Missing required environment variable"
Add the variable at application or instance level
Error: "Port 3000 already in use"
Ensure container port matches PORT env variable
Error: "ENOENT: no such file"
Check file paths - may need absolute paths or working directory
- Test Locally: Run same image locally with same env variables to reproduce issue
- Add Startup Delay: If app needs time to initialize, increase health check initialDelaySeconds
OOMKilled (Out of Memory)
Symptoms:
- • Pod restarts with reason "OOMKilled"
- • Application suddenly terminates without error
- • Exit code 137 in pod status
Common Causes:
- • Memory limit set too low for application needs
- • Memory leak in application code
- • Traffic spike causing high memory usage
- • Large file processing or caching
Solutions:
- Increase Memory Limit:
- • Go to Application → Edit → Resources
- • Increase memory limit (e.g., 512Mi → 1Gi)
- • Redeploy application
- Profile Memory Usage:
# Check current usage kubectl top pods # If usage is near limit, definitely need increase # If usage is low but still OOMKilled, may be spike/leak - Check for Memory Leaks:
- • Use profiling tools (Node.js: --inspect, Python: memory_profiler)
- • Look for growing memory over time
- • Common causes: unclosed connections, large caches, circular references
- Enable Horizontal Pod Autoscaling: Spread load across more pods instead of increasing single pod memory
Pods Pending
Symptoms:
- • Pods stuck in "Pending" state
- • Deployment never reaches "Running"
- • Nodes exist but pods won't schedule
Common Causes:
- • Insufficient cluster resources (CPU or memory)
- • No nodes match pod requirements (affinity, taints)
- • Persistent volume claim cannot be fulfilled
- • Image pull in progress (may just need more time)
Solutions:
- Check Pod Events:
kubectl describe pod <pod-name> # Look for specific error like: # - "Insufficient cpu" # - "Insufficient memory" # - "No nodes available" - Insufficient Resources:
- • Scale up cluster (increase max nodes in node group)
- • Reduce resource requests in application config
- • Remove resource-heavy pods to free capacity
- Node Affinity Issues:
- • Remove or adjust node selectors
- • Add nodes with required labels
- • Check for taints on nodes
- PVC Issues:
- • Verify storage class exists and can provision volumes
- • Check EBS volume quota in AWS
DNS & Ingress Issues
Domain Not Resolving
Symptoms:
- • Browser shows "DNS_PROBE_FINISHED_NXDOMAIN"
- • curl fails with "Could not resolve host"
- • nslookup returns no records
Solutions:
- Check DNS Provider Configuration:
- • Verify DNS provider added in SeaGit (Cloudflare, Route53, PowerDNS)
- • Check API credentials are valid
- • Ensure External DNS add-on is installed on cluster
- Manual DNS Setup: If not using automated DNS:
# Get ALB endpoint from deployment details ALB_ENDPOINT=abc123-12345.us-east-1.elb.amazonaws.com # Create CNAME record: api.yourdomain.com → CNAME → <ALB_ENDPOINT> # Wait 5-10 minutes for DNS propagation - Verify DNS Propagation:
# Check if DNS record exists nslookup api.yourdomain.com # Or use dig dig api.yourdomain.com # Check from multiple locations # https://dnschecker.org
Certificate Pending
Symptoms:
- • HTTPS not working (connection refused on 443)
- • Certificate shows as "Pending" for >10 minutes
- • Browser SSL error
Solutions:
- Check Cert-Manager Status:
kubectl get certificates -A kubectl describe certificate <cert-name> # Look for errors in Status section - DNS Validation: Let's Encrypt needs to verify domain ownership:
- • Ensure domain resolves to ALB
- • Check that port 80 is accessible (needed for HTTP-01 challenge)
- • Verify ALB security group allows inbound 80 and 443
- Rate Limits: Let's Encrypt has limits:
- • 50 certificates per domain per week
- • 5 failed validations per hour
- • Wait if you hit limit, or use staging environment to test
502 Bad Gateway or 504 Gateway Timeout
Symptoms:
- • ALB returns 502 Bad Gateway error
- • Request times out with 504 Gateway Timeout
- • Intermittent connectivity issues
Common Causes:
- • No healthy targets in target group
- • Health check failing on backend pods
- • Application response taking too long
- • Network connectivity between ALB and pods
Solutions:
- Check Pod Health:
kubectl get pods # All pods should show Running and 1/1 ready kubectl logs <pod-name> # Check for application errors - Fix Health Check Endpoint:
- • Ensure /health endpoint returns 200 status
- • Check health endpoint doesn't depend on external services
- • Test:
curl http://<pod-ip>:<port>/health
- Increase Timeouts: If application is slow:
- • Edit application configuration
- • Increase health check timeout and interval
- • Consider increasing ALB idle timeout (default 60s)
- Check Target Group:
- • In AWS Console, go to EC2 → Target Groups
- • Find target group for your ingress
- • Check targets tab - should show healthy targets
Add-on Issues
ALB Not Creating Load Balancer
Symptoms:
- • Ingress created but no ALB appears in AWS
- • Deployment shows no external URL
- • Ingress has no ADDRESS
Solutions:
- Check ALB Controller Status:
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller # Should show 2 running pods kubectl logs -n kube-system <alb-controller-pod> # Check for errors - Verify IAM Permissions:
- • ALB controller needs IAM permissions to create ALBs
- • Check IAM role has ElasticLoadBalancingFullAccess policy
- Check Ingress Annotations:
kubectl describe ingress <ingress-name> # Should have annotation: # kubernetes.io/ingress.class: alb - Subnet Tags: ALB requires specific subnet tags:
# Public subnets need: kubernetes.io/role/elb: 1 # SeaGit auto-adds these, but verify in AWS VPC console
Variables & Secrets Issues
Variable Changes Not Reflected
Symptoms:
- • Updated variable but application still uses old value
- • Environment variable not available in container
Solutions:
- Restart Required: Environment variables are injected at container start
- • Stop and start deployment, OR
- • Create new deployment (redeploy)
- • Variables don't hot-reload in running containers
- Check Inheritance: Variable may be overridden at lower level
- • Use API to get effective config:
GET /api/v1/instances/{id}/config - • Shows final merged values with all inheritance
- • Use API to get effective config:
- Verify in Container:
kubectl exec -it <pod-name> -- env | grep VARIABLE_NAME # Check if variable is actually present
Build & Source Issues
Build Failed
Symptoms:
- • Deployment fails during "Starting - Build" phase
- • Build logs show errors
- • Buildpack detection failed
Solutions:
- Check Build Logs: View detailed error in deployment logs
- Common Issues by Language:
Node.js:
- • Missing package.json or package-lock.json
- • Node version mismatch (specify in package.json:
"engines": {"node": ">=18"}) - • npm install failures (check dependencies)
Python:
- • Missing requirements.txt
- • Dependency conflicts or unavailable packages
- • Python version issues (use runtime.txt to specify)
Go:
- • Missing go.mod file
- • Build errors (fix in code)
- • Private module dependencies (need git credentials)
- Use Dockerfile Instead: If buildpack auto-detection fails, create Dockerfile:
# Add Dockerfile to repo root FROM node:18 WORKDIR /app COPY package*.json ./ RUN npm install COPY . . EXPOSE 3000 CMD ["npm", "start"]
Performance Issues
Slow Response Times
Diagnostic Steps:
- Check Resource Usage:
kubectl top pods # Look for pods near CPU or memory limits - Common Causes & Fixes:
- • CPU throttling: Increase CPU limit
- • Database slow: Optimize queries, add indexes, scale database
- • Cold starts: Increase min replicas to keep pods warm
- • Network latency: Deploy closer to users (multi-region)
- • Unoptimized code: Profile and optimize application
- Enable Autoscaling: Handle traffic spikes automatically
Getting Additional Help
If you're still experiencing issues after trying these solutions:
1. Gather Diagnostics
Collect this information before reaching out:
- • Deployment logs from SeaGit UI
- • Pod status:
kubectl get pods -n <namespace> - • Pod events:
kubectl describe pod <pod-name> - • Application logs:
kubectl logs <pod-name> - • Cluster info: Kubernetes version, node types, add-ons installed
2. Contact Support
- Community Support:
Join our community for peer support and discussions
- Email Support:
For account-specific issues and bug reports
support@seagit.com - GitHub Issues:
Report bugs or request features
github.com/seagits/platform/issues