Skip to main content

Troubleshooting Guide Reference

This document provides comprehensive troubleshooting procedures for common issues in KubeZero deployments.

Common Issues and Solutions

Stack Deployment Issues

Stack Fails to Deploy

Symptoms:

  • Stack status shows Failed or Degraded
  • ArgoCD application is out of sync
  • Pods are in CrashLoopBackOff state

Diagnosis:

# Check stack status
kubectl get stack my-stack -o yaml

# Check ArgoCD application
kubectl get application my-stack -n argocd

# Review pod logs
kubectl logs -f deployment/my-app -n my-namespace

Solutions:

  1. Resource constraints:
# Check node resources
kubectl top nodes

# Check pod resource requests/limits
kubectl describe pod failing-pod
  1. Configuration errors:
# Validate YAML syntax
kubectl apply --dry-run=server -f stack-config.yaml

# Check configmap/secret references
kubectl get configmap,secret -n my-namespace

Module Dependencies Not Met

Symptoms:

  • Modules fail to start in correct order
  • Dependency errors in logs

Solutions:

# Add explicit dependencies in stack configuration
spec:
modules:
- name: cert-manager
dependencies: []
- name: ingress-nginx
dependencies: ["cert-manager"]

Networking Issues

Pod-to-Pod Communication Failures

Diagnosis:

# Test connectivity between pods
kubectl exec -it pod1 -- ping pod2.namespace.svc.cluster.local

# Check network policies
kubectl get networkpolicy -A

# Verify service endpoints
kubectl get endpoints my-service

Solutions:

  1. Network Policy Issues:
# Allow communication between namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-cross-namespace
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: allowed-namespace
  1. DNS Resolution:
# Test DNS resolution
kubectl exec -it pod -- nslookup kubernetes.default

# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

Ingress Not Working

Diagnosis:

# Check ingress resource
kubectl describe ingress my-ingress

# Verify ingress controller
kubectl get pods -n ingress-nginx

# Check certificate status
kubectl describe certificate my-cert

Solutions:

  1. Missing TLS certificates:
# Ensure certificate is properly configured
spec:
tls:
- hosts:
- example.com
secretName: example-tls
  1. Ingress class issues:
# Specify correct ingress class
metadata:
annotations:
kubernetes.io/ingress.class: nginx

Storage Issues

Persistent Volume Claims Stuck in Pending

Diagnosis:

# Check PVC status
kubectl describe pvc my-pvc

# Check available storage classes
kubectl get storageclass

# Verify node capacity
kubectl describe nodes

Solutions:

  1. Storage class not available:
# Specify correct storage class
spec:
storageClassName: fast-ssd
  1. Insufficient node storage:
# Clean up unused volumes
kubectl delete pvc unused-pvc

# Check node disk usage
kubectl exec -it node-shell -- df -h

Security Issues

RBAC Permission Denied

Diagnosis:

# Check current user permissions
kubectl auth can-i create pods

# Review service account permissions
kubectl describe serviceaccount my-sa

# Check role bindings
kubectl get rolebindings,clusterrolebindings -A

Solutions:

# Create appropriate role binding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: my-binding
subjects:
- kind: User
name: my-user
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: my-role
apiGroup: rbac.authorization.k8s.io

Pod Security Policy Violations

Diagnosis:

# Check pod security violations
kubectl get events --field-selector reason=FailedCreate

# Review pod security standards
kubectl describe namespace my-namespace

Solutions:

# Update pod security context
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000

Performance Issues

High CPU/Memory Usage

Diagnosis:

# Check resource usage
kubectl top pods -A
kubectl top nodes

# Monitor resource limits
kubectl describe pod high-usage-pod

Solutions:

  1. Adjust resource limits:
spec:
containers:
- name: my-app
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
  1. Horizontal Pod Autoscaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

Slow Application Response

Diagnosis:

# Check application metrics
kubectl port-forward svc/my-app 8080:80
curl -w "@curl-format.txt" http://localhost:8080/health

# Review ingress performance
kubectl logs -f deployment/ingress-nginx-controller -n ingress-nginx

Solutions:

  1. Enable caching:
# Add caching annotations to ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/server-snippet: |
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}

Monitoring and Alerting Issues

Metrics Not Appearing

Diagnosis:

# Check Prometheus targets
kubectl port-forward svc/prometheus 9090:9090
# Visit http://localhost:9090/targets

# Verify service monitors
kubectl get servicemonitor -A

# Check pod annotations
kubectl describe pod my-app

Solutions:

# Add Prometheus annotations
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"

Alerts Not Firing

Diagnosis:

# Check AlertManager status
kubectl get pods -n monitoring

# Review alert rules
kubectl get prometheusrule -A

# Test alert expressions
# Use Prometheus UI to test queries

Advanced Troubleshooting

Debug Mode Activation

# Enable debug logging for ArgoCD
kubectl patch configmap argocd-cmd-params-cm -n argocd --patch '{"data":{"server.log.level":"debug"}}'

# Enable Kubernetes API audit logging
# Add to kube-apiserver configuration:
# --audit-log-path=/var/log/audit.log
# --audit-policy-file=/etc/kubernetes/audit-policy.yaml

Resource Dump Collection

#!/bin/bash
# Collect comprehensive cluster state
mkdir -p cluster-debug

# Core resources
kubectl get all -A -o yaml > cluster-debug/all-resources.yaml
kubectl get events -A --sort-by='.lastTimestamp' > cluster-debug/events.txt
kubectl describe nodes > cluster-debug/nodes.txt

# KubeZero specific
kubectl get stacks -A -o yaml > cluster-debug/stacks.yaml
kubectl get applications -n argocd -o yaml > cluster-debug/argocd-apps.yaml

# Logs
kubectl logs -n argocd deployment/argocd-application-controller > cluster-debug/argocd-controller.log
kubectl logs -n kubezero-system deployment/kubezero-controller > cluster-debug/kubezero-controller.log

echo "Debug information collected in cluster-debug/"

Performance Analysis

# CPU profiling for Go applications
kubectl exec -it my-go-app -- curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof

# Memory profiling
kubectl exec -it my-go-app -- curl http://localhost:6060/debug/pprof/heap > mem.prof

# Network analysis
kubectl exec -it debug-pod -- tcpdump -i eth0 -w network.pcap

Recovery Procedures

Cluster Recovery

# Emergency cluster recovery
# 1. Identify failed components
kubectl get componentstatuses

# 2. Restart core components
sudo systemctl restart kubelet
sudo systemctl restart docker

# 3. Restore from backup if needed
kubectl apply -f cluster-backup.yaml

# 4. Verify cluster health
kubectl cluster-info
kubectl get nodes

Data Recovery

# Restore from volume snapshots
kubectl apply -f volume-snapshot-restore.yaml

# Database recovery
kubectl exec -it postgres-pod -- psql -c "SELECT pg_start_backup('recovery');"
# Restore database files
kubectl exec -it postgres-pod -- psql -c "SELECT pg_stop_backup();"

Prevention Strategies

Health Checks

# Comprehensive health checks
spec:
containers:
- name: my-app
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "1000m"
requests.memory: 2Gi
limits.cpu: "2000m"
limits.memory: 4Gi
persistentvolumeclaims: "10"

Chaos Engineering

# Chaos Monkey deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaos-monkey
spec:
replicas: 1
selector:
matchLabels:
app: chaos-monkey
template:
spec:
containers:
- name: chaos-monkey
image: quay.io/linki/chaoskube:v0.21.0
args:
- --interval=10m
- --dry-run=false
- --log-level=INFO