Troubleshooting Guide

Common issues and solutions for SeaGit platform. Find quick fixes for deployment failures, networking problems, and infrastructure errors.

Quick Navigation: Use browser search (Ctrl+F or Cmd+F) to find specific error messages or symptoms.

Infrastructure Issues

Cluster Creation Failed

Symptoms:

• Cluster stuck in "Creating" state for >30 minutes
• Error: "Failed to create control plane"
• AWS CloudFormation stack failed

Common Causes:

• Insufficient AWS service quotas (VPCs, EIPs, NAT gateways)
• IAM permissions missing for EKS operations
• Selected region doesn't support EKS
• Subnet CIDR blocks conflict with existing resources

Solutions:

Check Service Quotas:

# In AWS Console, check:
Service Quotas → Amazon VPC → VPCs per region (need at least 1 available)
Service Quotas → Amazon EC2 → NAT gateways per AZ (need 2 for HA)
Service Quotas → Amazon EC2 → Elastic IPs (need at least 2)

Verify IAM Permissions: Ensure the IAM user has AmazonEKSClusterPolicy and AmazonEKSVPCResourceController
Try Different Region: Some regions have limited capacity. Try us-east-1, us-west-2, or eu-west-1
Check Logs: View cluster creation logs in SeaGit for specific error messages

Node Not Ready

Symptoms:

• Nodes show status "NotReady" in cluster view
• Pods remain in "Pending" state indefinitely
• kubectl get nodes shows NotReady

Common Causes:

• Network connectivity issues between nodes and control plane
• Kubelet service not running or crashed
• Disk pressure or out of disk space
• Container runtime (containerd/docker) not running

Solutions:

Check Node Status:

kubectl describe node <node-name>
# Look for Conditions section for specific errors

Restart Node: Terminate the EC2 instance; autoscaling group will launch a replacement
Check Security Groups: Ensure nodes can communicate with EKS control plane on port 443
Disk Space: Nodes need at least 20% free disk. Increase disk size if needed

Network Creation Failed

Symptoms:

• VPC creation timeout
• Error: "CIDR block conflicts"
• NAT gateway creation failed

Solutions:

Choose Different CIDR: Use a non-conflicting range (10.0.0.0/16, 172.16.0.0/16, 192.168.0.0/16)
Check Existing VPCs: You may have hit the VPC limit (default 5 per region)
Elastic IP Limit: Request increase if you need more than 5 EIPs

Deployment Issues

ImagePullBackOff

Symptoms:

• Deployment stuck in "Starting" state
• Pod events show ImagePullBackOff or ErrImagePull
• Logs contain "Failed to pull image"

Common Causes:

• Image name or tag is incorrect
• Image doesn't exist in registry
• Registry requires authentication (private repo)
• Docker Hub rate limit exceeded (100 pulls/6hrs for anonymous)
• Network connectivity to registry

Solutions:

Verify Image Exists:

# Test pull locally
docker pull your-image:tag

# Common mistakes:
nginx:latest        ✓ Correct
nginx:1.21          ✓ Correct  
nginx               ✗ Missing tag
nginxinc/nginx      ✗ Wrong org
nginx:latests       ✗ Typo in tag

Add Registry Credentials: If using private registry:
- • Navigate to Providers → Add Docker Registry provider
- • Enter username and password/token
- • SeaGit will create imagePullSecret automatically

Docker Hub Rate Limit: Use authenticated pulls:

# Add Docker Hub credentials (gets 200 pulls/6hrs)
# OR use private registry/mirror to avoid limits

Check Image Pull Logs:

kubectl describe pod <pod-name>
# Look at Events section for specific error

CrashLoopBackOff

Symptoms:

• Pod starts but immediately crashes and restarts repeatedly
• Restart count keeps increasing
• Status shows CrashLoopBackOff

Common Causes:

• Application error on startup (uncaught exception)
• Missing required environment variables
• Cannot connect to database or external service
• Port already in use or misconfigured
• File permissions or missing files

Solutions:

Check Container Logs:

# In SeaGit, view deployment logs
# OR use kubectl:
kubectl logs <pod-name> --previous
# --previous shows logs from crashed container

Common Fixes by Error Type:
Error: "Cannot connect to database"
Check DATABASE_URL variable and database connectivity
Error: "Missing required environment variable"
Add the variable at application or instance level
Error: "Port 3000 already in use"
Ensure container port matches PORT env variable
Error: "ENOENT: no such file"
Check file paths - may need absolute paths or working directory
Test Locally: Run same image locally with same env variables to reproduce issue
Add Startup Delay: If app needs time to initialize, increase health check initialDelaySeconds

OOMKilled (Out of Memory)

Symptoms:

• Pod restarts with reason "OOMKilled"
• Application suddenly terminates without error
• Exit code 137 in pod status

Common Causes:

• Memory limit set too low for application needs
• Memory leak in application code
• Traffic spike causing high memory usage
• Large file processing or caching

Solutions:

Increase Memory Limit:
- • Go to Application → Edit → Resources
- • Increase memory limit (e.g., 512Mi → 1Gi)
- • Redeploy application

Profile Memory Usage:

# Check current usage
kubectl top pods

# If usage is near limit, definitely need increase
# If usage is low but still OOMKilled, may be spike/leak

Check for Memory Leaks:
- • Use profiling tools (Node.js: --inspect, Python: memory_profiler)
- • Look for growing memory over time
- • Common causes: unclosed connections, large caches, circular references
Enable Horizontal Pod Autoscaling: Spread load across more pods instead of increasing single pod memory

Pods Pending

Symptoms:

• Pods stuck in "Pending" state
• Deployment never reaches "Running"
• Nodes exist but pods won't schedule

Common Causes:

• Insufficient cluster resources (CPU or memory)
• No nodes match pod requirements (affinity, taints)
• Persistent volume claim cannot be fulfilled
• Image pull in progress (may just need more time)

Solutions:

Check Pod Events:

kubectl describe pod <pod-name>
# Look for specific error like:
# - "Insufficient cpu"
# - "Insufficient memory"  
# - "No nodes available"

Insufficient Resources:
- • Scale up cluster (increase max nodes in node group)
- • Reduce resource requests in application config
- • Remove resource-heavy pods to free capacity
Node Affinity Issues:
- • Remove or adjust node selectors
- • Add nodes with required labels
- • Check for taints on nodes
PVC Issues:
- • Verify storage class exists and can provision volumes
- • Check EBS volume quota in AWS

DNS & Ingress Issues

Domain Not Resolving

Symptoms:

• Browser shows "DNS_PROBE_FINISHED_NXDOMAIN"
• curl fails with "Could not resolve host"
• nslookup returns no records

Solutions:

Check DNS Provider Configuration:
- • Verify DNS provider added in SeaGit (Cloudflare, Route53, PowerDNS)
- • Check API credentials are valid
- • Ensure External DNS add-on is installed on cluster

Manual DNS Setup: If not using automated DNS:

# Get ALB endpoint from deployment details
ALB_ENDPOINT=abc123-12345.us-east-1.elb.amazonaws.com

# Create CNAME record:
api.yourdomain.com → CNAME → <ALB_ENDPOINT>

# Wait 5-10 minutes for DNS propagation

Verify DNS Propagation:

# Check if DNS record exists
nslookup api.yourdomain.com

# Or use dig
dig api.yourdomain.com

# Check from multiple locations
# https://dnschecker.org

Certificate Pending

Symptoms:

• HTTPS not working (connection refused on 443)
• Certificate shows as "Pending" for >10 minutes
• Browser SSL error

Solutions:

Check Cert-Manager Status:

kubectl get certificates -A
kubectl describe certificate <cert-name>

# Look for errors in Status section

DNS Validation: Let's Encrypt needs to verify domain ownership:
- • Ensure domain resolves to ALB
- • Check that port 80 is accessible (needed for HTTP-01 challenge)
- • Verify ALB security group allows inbound 80 and 443
Rate Limits: Let's Encrypt has limits:
- • 50 certificates per domain per week
- • 5 failed validations per hour
- • Wait if you hit limit, or use staging environment to test

502 Bad Gateway or 504 Gateway Timeout

Symptoms:

• ALB returns 502 Bad Gateway error
• Request times out with 504 Gateway Timeout
• Intermittent connectivity issues

Common Causes:

• No healthy targets in target group
• Health check failing on backend pods
• Application response taking too long
• Network connectivity between ALB and pods

Solutions:

Check Pod Health:

kubectl get pods
# All pods should show Running and 1/1 ready

kubectl logs <pod-name>
# Check for application errors

Fix Health Check Endpoint:
- • Ensure /health endpoint returns 200 status
- • Check health endpoint doesn't depend on external services
- • Test: curl http://<pod-ip>:<port>/health
Increase Timeouts: If application is slow:
- • Edit application configuration
- • Increase health check timeout and interval
- • Consider increasing ALB idle timeout (default 60s)
Check Target Group:
- • In AWS Console, go to EC2 → Target Groups
- • Find target group for your ingress
- • Check targets tab - should show healthy targets

Add-on Issues

ALB Not Creating Load Balancer

Symptoms:

• Ingress created but no ALB appears in AWS
• Deployment shows no external URL
• Ingress has no ADDRESS

Solutions:

Check ALB Controller Status:

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
# Should show 2 running pods

kubectl logs -n kube-system <alb-controller-pod>
# Check for errors

Verify IAM Permissions:
- • ALB controller needs IAM permissions to create ALBs
- • Check IAM role has ElasticLoadBalancingFullAccess policy

Check Ingress Annotations:

kubectl describe ingress <ingress-name>
# Should have annotation:
# kubernetes.io/ingress.class: alb

Subnet Tags: ALB requires specific subnet tags:

# Public subnets need:
kubernetes.io/role/elb: 1

# SeaGit auto-adds these, but verify in AWS VPC console

Variables & Secrets Issues

Variable Changes Not Reflected

Symptoms:

• Updated variable but application still uses old value
• Environment variable not available in container

Solutions:

Restart Required: Environment variables are injected at container start
- • Stop and start deployment, OR
- • Create new deployment (redeploy)
- • Variables don't hot-reload in running containers
Check Inheritance: Variable may be overridden at lower level
- • Use API to get effective config: GET /api/v1/instances/{id}/config
- • Shows final merged values with all inheritance

Verify in Container:

kubectl exec -it <pod-name> -- env | grep VARIABLE_NAME
# Check if variable is actually present

Build & Source Issues

Build Failed

Symptoms:

• Deployment fails during "Starting - Build" phase
• Build logs show errors
• Buildpack detection failed

Solutions:

Check Build Logs: View detailed error in deployment logs
Common Issues by Language:
Node.js:
- • Missing package.json or package-lock.json
- • Node version mismatch (specify in package.json: "engines": {"node": ">=18"})
- • npm install failures (check dependencies)
Python:
- • Missing requirements.txt
- • Dependency conflicts or unavailable packages
- • Python version issues (use runtime.txt to specify)
Go:
- • Missing go.mod file
- • Build errors (fix in code)
- • Private module dependencies (need git credentials)

Use Dockerfile Instead: If buildpack auto-detection fails, create Dockerfile:

# Add Dockerfile to repo root
FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
EXPOSE 3000
CMD ["npm", "start"]

Performance Issues

Slow Response Times

Diagnostic Steps:

Check Resource Usage:

kubectl top pods
# Look for pods near CPU or memory limits

Common Causes & Fixes:
- • CPU throttling: Increase CPU limit
- • Database slow: Optimize queries, add indexes, scale database
- • Cold starts: Increase min replicas to keep pods warm
- • Network latency: Deploy closer to users (multi-region)
- • Unoptimized code: Profile and optimize application
Enable Autoscaling: Handle traffic spikes automatically

Getting Additional Help

If you're still experiencing issues after trying these solutions:

1. Gather Diagnostics

Collect this information before reaching out:

• Deployment logs from SeaGit UI
• Pod status: kubectl get pods -n <namespace>
• Pod events: kubectl describe pod <pod-name>
• Application logs: kubectl logs <pod-name>
• Cluster info: Kubernetes version, node types, add-ons installed

2. Contact Support

Community Support:
Join our community for peer support and discussions
Join Slack →Join Discord →
Email Support:
For account-specific issues and bug reports
support@seagit.com
GitHub Issues:
Report bugs or request features
github.com/seagits/platform/issues

Previous: Getting Started Back to Documentation Home