SEAGIT DOCS
Troubleshooting

Troubleshooting Guide

Common issues and solutions for SeaGit platform. Find quick fixes for deployment failures, networking problems, and infrastructure errors.

Quick Navigation: Use browser search (Ctrl+F or Cmd+F) to find specific error messages or symptoms.

Infrastructure Issues

Cluster Creation Failed

Symptoms:

  • • Cluster stuck in "Creating" state for >30 minutes
  • • Error: "Failed to create control plane"
  • • AWS CloudFormation stack failed

Common Causes:

  • • Insufficient AWS service quotas (VPCs, EIPs, NAT gateways)
  • • IAM permissions missing for EKS operations
  • • Selected region doesn't support EKS
  • • Subnet CIDR blocks conflict with existing resources

Solutions:

  1. Check Service Quotas:
    # In AWS Console, check:
    Service Quotas → Amazon VPC → VPCs per region (need at least 1 available)
    Service Quotas → Amazon EC2 → NAT gateways per AZ (need 2 for HA)
    Service Quotas → Amazon EC2 → Elastic IPs (need at least 2)
  2. Verify IAM Permissions: Ensure the IAM user has AmazonEKSClusterPolicy and AmazonEKSVPCResourceController
  3. Try Different Region: Some regions have limited capacity. Try us-east-1, us-west-2, or eu-west-1
  4. Check Logs: View cluster creation logs in SeaGit for specific error messages

Node Not Ready

Symptoms:

  • • Nodes show status "NotReady" in cluster view
  • • Pods remain in "Pending" state indefinitely
  • • kubectl get nodes shows NotReady

Common Causes:

  • • Network connectivity issues between nodes and control plane
  • • Kubelet service not running or crashed
  • • Disk pressure or out of disk space
  • • Container runtime (containerd/docker) not running

Solutions:

  1. Check Node Status:
    kubectl describe node <node-name>
    # Look for Conditions section for specific errors
  2. Restart Node: Terminate the EC2 instance; autoscaling group will launch a replacement
  3. Check Security Groups: Ensure nodes can communicate with EKS control plane on port 443
  4. Disk Space: Nodes need at least 20% free disk. Increase disk size if needed

Network Creation Failed

Symptoms:

  • • VPC creation timeout
  • • Error: "CIDR block conflicts"
  • • NAT gateway creation failed

Solutions:

  1. Choose Different CIDR: Use a non-conflicting range (10.0.0.0/16, 172.16.0.0/16, 192.168.0.0/16)
  2. Check Existing VPCs: You may have hit the VPC limit (default 5 per region)
  3. Elastic IP Limit: Request increase if you need more than 5 EIPs

Deployment Issues

ImagePullBackOff

Symptoms:

  • • Deployment stuck in "Starting" state
  • • Pod events show ImagePullBackOff or ErrImagePull
  • • Logs contain "Failed to pull image"

Common Causes:

  • • Image name or tag is incorrect
  • • Image doesn't exist in registry
  • • Registry requires authentication (private repo)
  • • Docker Hub rate limit exceeded (100 pulls/6hrs for anonymous)
  • • Network connectivity to registry

Solutions:

  1. Verify Image Exists:
    # Test pull locally
    docker pull your-image:tag
    
    # Common mistakes:
    nginx:latest        ✓ Correct
    nginx:1.21          ✓ Correct  
    nginx               ✗ Missing tag
    nginxinc/nginx      ✗ Wrong org
    nginx:latests       ✗ Typo in tag
  2. Add Registry Credentials: If using private registry:
    • • Navigate to Providers → Add Docker Registry provider
    • • Enter username and password/token
    • • SeaGit will create imagePullSecret automatically
  3. Docker Hub Rate Limit: Use authenticated pulls:
    # Add Docker Hub credentials (gets 200 pulls/6hrs)
    # OR use private registry/mirror to avoid limits
  4. Check Image Pull Logs:
    kubectl describe pod <pod-name>
    # Look at Events section for specific error

CrashLoopBackOff

Symptoms:

  • • Pod starts but immediately crashes and restarts repeatedly
  • • Restart count keeps increasing
  • • Status shows CrashLoopBackOff

Common Causes:

  • • Application error on startup (uncaught exception)
  • • Missing required environment variables
  • • Cannot connect to database or external service
  • • Port already in use or misconfigured
  • • File permissions or missing files

Solutions:

  1. Check Container Logs:
    # In SeaGit, view deployment logs
    # OR use kubectl:
    kubectl logs <pod-name> --previous
    # --previous shows logs from crashed container
  2. Common Fixes by Error Type:

    Error: "Cannot connect to database"

    Check DATABASE_URL variable and database connectivity

    Error: "Missing required environment variable"

    Add the variable at application or instance level

    Error: "Port 3000 already in use"

    Ensure container port matches PORT env variable

    Error: "ENOENT: no such file"

    Check file paths - may need absolute paths or working directory

  3. Test Locally: Run same image locally with same env variables to reproduce issue
  4. Add Startup Delay: If app needs time to initialize, increase health check initialDelaySeconds

OOMKilled (Out of Memory)

Symptoms:

  • • Pod restarts with reason "OOMKilled"
  • • Application suddenly terminates without error
  • • Exit code 137 in pod status

Common Causes:

  • • Memory limit set too low for application needs
  • • Memory leak in application code
  • • Traffic spike causing high memory usage
  • • Large file processing or caching

Solutions:

  1. Increase Memory Limit:
    • • Go to Application → Edit → Resources
    • • Increase memory limit (e.g., 512Mi → 1Gi)
    • • Redeploy application
  2. Profile Memory Usage:
    # Check current usage
    kubectl top pods
    
    # If usage is near limit, definitely need increase
    # If usage is low but still OOMKilled, may be spike/leak
  3. Check for Memory Leaks:
    • • Use profiling tools (Node.js: --inspect, Python: memory_profiler)
    • • Look for growing memory over time
    • • Common causes: unclosed connections, large caches, circular references
  4. Enable Horizontal Pod Autoscaling: Spread load across more pods instead of increasing single pod memory

Pods Pending

Symptoms:

  • • Pods stuck in "Pending" state
  • • Deployment never reaches "Running"
  • • Nodes exist but pods won't schedule

Common Causes:

  • • Insufficient cluster resources (CPU or memory)
  • • No nodes match pod requirements (affinity, taints)
  • • Persistent volume claim cannot be fulfilled
  • • Image pull in progress (may just need more time)

Solutions:

  1. Check Pod Events:
    kubectl describe pod <pod-name>
    # Look for specific error like:
    # - "Insufficient cpu"
    # - "Insufficient memory"  
    # - "No nodes available"
  2. Insufficient Resources:
    • • Scale up cluster (increase max nodes in node group)
    • • Reduce resource requests in application config
    • • Remove resource-heavy pods to free capacity
  3. Node Affinity Issues:
    • • Remove or adjust node selectors
    • • Add nodes with required labels
    • • Check for taints on nodes
  4. PVC Issues:
    • • Verify storage class exists and can provision volumes
    • • Check EBS volume quota in AWS

DNS & Ingress Issues

Domain Not Resolving

Symptoms:

  • • Browser shows "DNS_PROBE_FINISHED_NXDOMAIN"
  • • curl fails with "Could not resolve host"
  • • nslookup returns no records

Solutions:

  1. Check DNS Provider Configuration:
    • • Verify DNS provider added in SeaGit (Cloudflare, Route53, PowerDNS)
    • • Check API credentials are valid
    • • Ensure External DNS add-on is installed on cluster
  2. Manual DNS Setup: If not using automated DNS:
    # Get ALB endpoint from deployment details
    ALB_ENDPOINT=abc123-12345.us-east-1.elb.amazonaws.com
    
    # Create CNAME record:
    api.yourdomain.com → CNAME → <ALB_ENDPOINT>
    
    # Wait 5-10 minutes for DNS propagation
  3. Verify DNS Propagation:
    # Check if DNS record exists
    nslookup api.yourdomain.com
    
    # Or use dig
    dig api.yourdomain.com
    
    # Check from multiple locations
    # https://dnschecker.org

Certificate Pending

Symptoms:

  • • HTTPS not working (connection refused on 443)
  • • Certificate shows as "Pending" for >10 minutes
  • • Browser SSL error

Solutions:

  1. Check Cert-Manager Status:
    kubectl get certificates -A
    kubectl describe certificate <cert-name>
    
    # Look for errors in Status section
  2. DNS Validation: Let's Encrypt needs to verify domain ownership:
    • • Ensure domain resolves to ALB
    • • Check that port 80 is accessible (needed for HTTP-01 challenge)
    • • Verify ALB security group allows inbound 80 and 443
  3. Rate Limits: Let's Encrypt has limits:
    • • 50 certificates per domain per week
    • • 5 failed validations per hour
    • • Wait if you hit limit, or use staging environment to test

502 Bad Gateway or 504 Gateway Timeout

Symptoms:

  • • ALB returns 502 Bad Gateway error
  • • Request times out with 504 Gateway Timeout
  • • Intermittent connectivity issues

Common Causes:

  • • No healthy targets in target group
  • • Health check failing on backend pods
  • • Application response taking too long
  • • Network connectivity between ALB and pods

Solutions:

  1. Check Pod Health:
    kubectl get pods
    # All pods should show Running and 1/1 ready
    
    kubectl logs <pod-name>
    # Check for application errors
  2. Fix Health Check Endpoint:
    • • Ensure /health endpoint returns 200 status
    • • Check health endpoint doesn't depend on external services
    • • Test: curl http://<pod-ip>:<port>/health
  3. Increase Timeouts: If application is slow:
    • • Edit application configuration
    • • Increase health check timeout and interval
    • • Consider increasing ALB idle timeout (default 60s)
  4. Check Target Group:
    • • In AWS Console, go to EC2 → Target Groups
    • • Find target group for your ingress
    • • Check targets tab - should show healthy targets

Add-on Issues

ALB Not Creating Load Balancer

Symptoms:

  • • Ingress created but no ALB appears in AWS
  • • Deployment shows no external URL
  • • Ingress has no ADDRESS

Solutions:

  1. Check ALB Controller Status:
    kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
    # Should show 2 running pods
    
    kubectl logs -n kube-system <alb-controller-pod>
    # Check for errors
  2. Verify IAM Permissions:
    • • ALB controller needs IAM permissions to create ALBs
    • • Check IAM role has ElasticLoadBalancingFullAccess policy
  3. Check Ingress Annotations:
    kubectl describe ingress <ingress-name>
    # Should have annotation:
    # kubernetes.io/ingress.class: alb
  4. Subnet Tags: ALB requires specific subnet tags:
    # Public subnets need:
    kubernetes.io/role/elb: 1
    
    # SeaGit auto-adds these, but verify in AWS VPC console

Variables & Secrets Issues

Variable Changes Not Reflected

Symptoms:

  • • Updated variable but application still uses old value
  • • Environment variable not available in container

Solutions:

  1. Restart Required: Environment variables are injected at container start
    • • Stop and start deployment, OR
    • • Create new deployment (redeploy)
    • • Variables don't hot-reload in running containers
  2. Check Inheritance: Variable may be overridden at lower level
    • • Use API to get effective config: GET /api/v1/instances/{id}/config
    • • Shows final merged values with all inheritance
  3. Verify in Container:
    kubectl exec -it <pod-name> -- env | grep VARIABLE_NAME
    # Check if variable is actually present

Build & Source Issues

Build Failed

Symptoms:

  • • Deployment fails during "Starting - Build" phase
  • • Build logs show errors
  • • Buildpack detection failed

Solutions:

  1. Check Build Logs: View detailed error in deployment logs
  2. Common Issues by Language:

    Node.js:

    • • Missing package.json or package-lock.json
    • • Node version mismatch (specify in package.json: "engines": {"node": ">=18"})
    • • npm install failures (check dependencies)

    Python:

    • • Missing requirements.txt
    • • Dependency conflicts or unavailable packages
    • • Python version issues (use runtime.txt to specify)

    Go:

    • • Missing go.mod file
    • • Build errors (fix in code)
    • • Private module dependencies (need git credentials)
  3. Use Dockerfile Instead: If buildpack auto-detection fails, create Dockerfile:
    # Add Dockerfile to repo root
    FROM node:18
    WORKDIR /app
    COPY package*.json ./
    RUN npm install
    COPY . .
    EXPOSE 3000
    CMD ["npm", "start"]

Performance Issues

Slow Response Times

Diagnostic Steps:

  1. Check Resource Usage:
    kubectl top pods
    # Look for pods near CPU or memory limits
  2. Common Causes & Fixes:
    • CPU throttling: Increase CPU limit
    • Database slow: Optimize queries, add indexes, scale database
    • Cold starts: Increase min replicas to keep pods warm
    • Network latency: Deploy closer to users (multi-region)
    • Unoptimized code: Profile and optimize application
  3. Enable Autoscaling: Handle traffic spikes automatically

Getting Additional Help

If you're still experiencing issues after trying these solutions:

1. Gather Diagnostics

Collect this information before reaching out:

  • • Deployment logs from SeaGit UI
  • • Pod status: kubectl get pods -n <namespace>
  • • Pod events: kubectl describe pod <pod-name>
  • • Application logs: kubectl logs <pod-name>
  • • Cluster info: Kubernetes version, node types, add-ons installed

2. Contact Support