Troubleshooting Guide Reference

This document provides comprehensive troubleshooting procedures for common issues in KubeZero deployments.

Common Issues and Solutions

Stack Deployment Issues

Stack Fails to Deploy

Symptoms:

Stack status shows Failed or Degraded
ArgoCD application is out of sync
Pods are in CrashLoopBackOff state

Diagnosis:

# Check stack status
kubectl get stack my-stack -o yaml

# Check ArgoCD application
kubectl get application my-stack -n argocd

# Review pod logs
kubectl logs -f deployment/my-app -n my-namespace

Solutions:

Resource constraints:

# Check node resources
kubectl top nodes

# Check pod resource requests/limits
kubectl describe pod failing-pod

Configuration errors:

# Validate YAML syntax
kubectl apply --dry-run=server -f stack-config.yaml

# Check configmap/secret references
kubectl get configmap,secret -n my-namespace

Module Dependencies Not Met

Symptoms:

Modules fail to start in correct order
Dependency errors in logs

Solutions:

# Add explicit dependencies in stack configuration
spec:
  modules:
    - name: cert-manager
      dependencies: []
    - name: ingress-nginx
      dependencies: ["cert-manager"]

Networking Issues

Pod-to-Pod Communication Failures

Diagnosis:

# Test connectivity between pods
kubectl exec -it pod1 -- ping pod2.namespace.svc.cluster.local

# Check network policies
kubectl get networkpolicy -A

# Verify service endpoints
kubectl get endpoints my-service

Solutions:

Network Policy Issues:

# Allow communication between namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-cross-namespace
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: allowed-namespace

DNS Resolution:

# Test DNS resolution
kubectl exec -it pod -- nslookup kubernetes.default

# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

Ingress Not Working

Diagnosis:

# Check ingress resource
kubectl describe ingress my-ingress

# Verify ingress controller
kubectl get pods -n ingress-nginx

# Check certificate status
kubectl describe certificate my-cert

Solutions:

Missing TLS certificates:

# Ensure certificate is properly configured
spec:
  tls:
    - hosts:
        - example.com
      secretName: example-tls

Ingress class issues:

# Specify correct ingress class
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx

Storage Issues

Persistent Volume Claims Stuck in Pending

Diagnosis:

# Check PVC status
kubectl describe pvc my-pvc

# Check available storage classes
kubectl get storageclass

# Verify node capacity
kubectl describe nodes

Solutions:

Storage class not available:

# Specify correct storage class
spec:
  storageClassName: fast-ssd

Insufficient node storage:

# Clean up unused volumes
kubectl delete pvc unused-pvc

# Check node disk usage
kubectl exec -it node-shell -- df -h

Security Issues

RBAC Permission Denied

Diagnosis:

# Check current user permissions
kubectl auth can-i create pods

# Review service account permissions
kubectl describe serviceaccount my-sa

# Check role bindings
kubectl get rolebindings,clusterrolebindings -A

Solutions:

# Create appropriate role binding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-binding
subjects:
  - kind: User
    name: my-user
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: my-role
  apiGroup: rbac.authorization.k8s.io

Pod Security Policy Violations

Diagnosis:

# Check pod security violations
kubectl get events --field-selector reason=FailedCreate

# Review pod security standards
kubectl describe namespace my-namespace

Solutions:

# Update pod security context
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000

Performance Issues

High CPU/Memory Usage

Diagnosis:

# Check resource usage
kubectl top pods -A
kubectl top nodes

# Monitor resource limits
kubectl describe pod high-usage-pod

Solutions:

Adjust resource limits:

spec:
  containers:
    - name: my-app
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi

Horizontal Pod Autoscaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Slow Application Response

Diagnosis:

# Check application metrics
kubectl port-forward svc/my-app 8080:80
curl -w "@curl-format.txt" http://localhost:8080/health

# Review ingress performance
kubectl logs -f deployment/ingress-nginx-controller -n ingress-nginx

Solutions:

Enable caching:

# Add caching annotations to ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/server-snippet: |
      location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg)$ {
        expires 1y;
        add_header Cache-Control "public, immutable";
      }

Monitoring and Alerting Issues

Metrics Not Appearing

Diagnosis:

# Check Prometheus targets
kubectl port-forward svc/prometheus 9090:9090
# Visit http://localhost:9090/targets

# Verify service monitors
kubectl get servicemonitor -A

# Check pod annotations
kubectl describe pod my-app

Solutions:

# Add Prometheus annotations
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Alerts Not Firing

Diagnosis:

# Check AlertManager status
kubectl get pods -n monitoring

# Review alert rules
kubectl get prometheusrule -A

# Test alert expressions
# Use Prometheus UI to test queries

Advanced Troubleshooting

Debug Mode Activation

# Enable debug logging for ArgoCD
kubectl patch configmap argocd-cmd-params-cm -n argocd --patch '{"data":{"server.log.level":"debug"}}'

# Enable Kubernetes API audit logging
# Add to kube-apiserver configuration:
# --audit-log-path=/var/log/audit.log
# --audit-policy-file=/etc/kubernetes/audit-policy.yaml

Resource Dump Collection

#!/bin/bash
# Collect comprehensive cluster state
mkdir -p cluster-debug

# Core resources
kubectl get all -A -o yaml > cluster-debug/all-resources.yaml
kubectl get events -A --sort-by='.lastTimestamp' > cluster-debug/events.txt
kubectl describe nodes > cluster-debug/nodes.txt

# KubeZero specific
kubectl get stacks -A -o yaml > cluster-debug/stacks.yaml
kubectl get applications -n argocd -o yaml > cluster-debug/argocd-apps.yaml

# Logs
kubectl logs -n argocd deployment/argocd-application-controller > cluster-debug/argocd-controller.log
kubectl logs -n kubezero-system deployment/kubezero-controller > cluster-debug/kubezero-controller.log

echo "Debug information collected in cluster-debug/"

Performance Analysis

# CPU profiling for Go applications
kubectl exec -it my-go-app -- curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof

# Memory profiling
kubectl exec -it my-go-app -- curl http://localhost:6060/debug/pprof/heap > mem.prof

# Network analysis
kubectl exec -it debug-pod -- tcpdump -i eth0 -w network.pcap

Recovery Procedures

Cluster Recovery

# Emergency cluster recovery
# 1. Identify failed components
kubectl get componentstatuses

# 2. Restart core components
sudo systemctl restart kubelet
sudo systemctl restart docker

# 3. Restore from backup if needed
kubectl apply -f cluster-backup.yaml

# 4. Verify cluster health
kubectl cluster-info
kubectl get nodes

Data Recovery

# Restore from volume snapshots
kubectl apply -f volume-snapshot-restore.yaml

# Database recovery
kubectl exec -it postgres-pod -- psql -c "SELECT pg_start_backup('recovery');"
# Restore database files
kubectl exec -it postgres-pod -- psql -c "SELECT pg_stop_backup();"

Prevention Strategies

Health Checks

# Comprehensive health checks
spec:
  containers:
    - name: my-app
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5

Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "1000m"
    requests.memory: 2Gi
    limits.cpu: "2000m"
    limits.memory: 4Gi
    persistentvolumeclaims: "10"

Chaos Engineering

# Chaos Monkey deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chaos-monkey
spec:
  replicas: 1
  selector:
    matchLabels:
      app: chaos-monkey
  template:
    spec:
      containers:
        - name: chaos-monkey
          image: quay.io/linki/chaoskube:v0.21.0
          args:
            - --interval=10m
            - --dry-run=false
            - --log-level=INFO

Common Issues and Solutions​

Stack Deployment Issues​

Stack Fails to Deploy​

Module Dependencies Not Met​

Networking Issues​

Pod-to-Pod Communication Failures​

Ingress Not Working​

Storage Issues​

Persistent Volume Claims Stuck in Pending​

Security Issues​

RBAC Permission Denied​

Pod Security Policy Violations​

Performance Issues​

High CPU/Memory Usage​

Slow Application Response​

Monitoring and Alerting Issues​

Metrics Not Appearing​

Alerts Not Firing​

Advanced Troubleshooting​

Debug Mode Activation​

Resource Dump Collection​

Performance Analysis​

Recovery Procedures​

Cluster Recovery​

Data Recovery​

Prevention Strategies​

Health Checks​

Resource Quotas​

Chaos Engineering​

Common Issues and Solutions

Stack Deployment Issues

Stack Fails to Deploy

Module Dependencies Not Met

Networking Issues

Pod-to-Pod Communication Failures

Ingress Not Working

Storage Issues

Persistent Volume Claims Stuck in Pending

Security Issues

RBAC Permission Denied

Pod Security Policy Violations

Performance Issues

High CPU/Memory Usage

Slow Application Response

Monitoring and Alerting Issues

Metrics Not Appearing

Alerts Not Firing

Advanced Troubleshooting

Debug Mode Activation

Resource Dump Collection

Performance Analysis

Recovery Procedures

Cluster Recovery

Data Recovery

Prevention Strategies

Health Checks

Resource Quotas

Chaos Engineering