Troubleshooting Guide Reference
This document provides comprehensive troubleshooting procedures for common issues in KubeZero deployments.
Common Issues and Solutions
Stack Deployment Issues
Stack Fails to Deploy
Symptoms:
- Stack status shows
Failed
orDegraded
- ArgoCD application is out of sync
- Pods are in
CrashLoopBackOff
state
Diagnosis:
# Check stack status
kubectl get stack my-stack -o yaml
# Check ArgoCD application
kubectl get application my-stack -n argocd
# Review pod logs
kubectl logs -f deployment/my-app -n my-namespace
Solutions:
- Resource constraints:
# Check node resources
kubectl top nodes
# Check pod resource requests/limits
kubectl describe pod failing-pod
- Configuration errors:
# Validate YAML syntax
kubectl apply --dry-run=server -f stack-config.yaml
# Check configmap/secret references
kubectl get configmap,secret -n my-namespace
Module Dependencies Not Met
Symptoms:
- Modules fail to start in correct order
- Dependency errors in logs
Solutions:
# Add explicit dependencies in stack configuration
spec:
modules:
- name: cert-manager
dependencies: []
- name: ingress-nginx
dependencies: ["cert-manager"]
Networking Issues
Pod-to-Pod Communication Failures
Diagnosis:
# Test connectivity between pods
kubectl exec -it pod1 -- ping pod2.namespace.svc.cluster.local
# Check network policies
kubectl get networkpolicy -A
# Verify service endpoints
kubectl get endpoints my-service
Solutions:
- Network Policy Issues:
# Allow communication between namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-cross-namespace
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: allowed-namespace
- DNS Resolution:
# Test DNS resolution
kubectl exec -it pod -- nslookup kubernetes.default
# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml
Ingress Not Working
Diagnosis:
# Check ingress resource
kubectl describe ingress my-ingress
# Verify ingress controller
kubectl get pods -n ingress-nginx
# Check certificate status
kubectl describe certificate my-cert
Solutions:
- Missing TLS certificates:
# Ensure certificate is properly configured
spec:
tls:
- hosts:
- example.com
secretName: example-tls
- Ingress class issues:
# Specify correct ingress class
metadata:
annotations:
kubernetes.io/ingress.class: nginx
Storage Issues
Persistent Volume Claims Stuck in Pending
Diagnosis:
# Check PVC status
kubectl describe pvc my-pvc
# Check available storage classes
kubectl get storageclass
# Verify node capacity
kubectl describe nodes
Solutions:
- Storage class not available:
# Specify correct storage class
spec:
storageClassName: fast-ssd
- Insufficient node storage:
# Clean up unused volumes
kubectl delete pvc unused-pvc
# Check node disk usage
kubectl exec -it node-shell -- df -h
Security Issues
RBAC Permission Denied
Diagnosis:
# Check current user permissions
kubectl auth can-i create pods
# Review service account permissions
kubectl describe serviceaccount my-sa
# Check role bindings
kubectl get rolebindings,clusterrolebindings -A
Solutions:
# Create appropriate role binding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: my-binding
subjects:
- kind: User
name: my-user
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: my-role
apiGroup: rbac.authorization.k8s.io
Pod Security Policy Violations
Diagnosis:
# Check pod security violations
kubectl get events --field-selector reason=FailedCreate
# Review pod security standards
kubectl describe namespace my-namespace
Solutions:
# Update pod security context
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
Performance Issues
High CPU/Memory Usage
Diagnosis:
# Check resource usage
kubectl top pods -A
kubectl top nodes
# Monitor resource limits
kubectl describe pod high-usage-pod
Solutions:
- Adjust resource limits:
spec:
containers:
- name: my-app
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
- Horizontal Pod Autoscaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Slow Application Response
Diagnosis:
# Check application metrics
kubectl port-forward svc/my-app 8080:80
curl -w "@curl-format.txt" http://localhost:8080/health
# Review ingress performance
kubectl logs -f deployment/ingress-nginx-controller -n ingress-nginx
Solutions:
- Enable caching:
# Add caching annotations to ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/server-snippet: |
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}
Monitoring and Alerting Issues
Metrics Not Appearing
Diagnosis:
# Check Prometheus targets
kubectl port-forward svc/prometheus 9090:9090
# Visit http://localhost:9090/targets
# Verify service monitors
kubectl get servicemonitor -A
# Check pod annotations
kubectl describe pod my-app
Solutions:
# Add Prometheus annotations
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
Alerts Not Firing
Diagnosis:
# Check AlertManager status
kubectl get pods -n monitoring
# Review alert rules
kubectl get prometheusrule -A
# Test alert expressions
# Use Prometheus UI to test queries
Advanced Troubleshooting
Debug Mode Activation
# Enable debug logging for ArgoCD
kubectl patch configmap argocd-cmd-params-cm -n argocd --patch '{"data":{"server.log.level":"debug"}}'
# Enable Kubernetes API audit logging
# Add to kube-apiserver configuration:
# --audit-log-path=/var/log/audit.log
# --audit-policy-file=/etc/kubernetes/audit-policy.yaml
Resource Dump Collection
#!/bin/bash
# Collect comprehensive cluster state
mkdir -p cluster-debug
# Core resources
kubectl get all -A -o yaml > cluster-debug/all-resources.yaml
kubectl get events -A --sort-by='.lastTimestamp' > cluster-debug/events.txt
kubectl describe nodes > cluster-debug/nodes.txt
# KubeZero specific
kubectl get stacks -A -o yaml > cluster-debug/stacks.yaml
kubectl get applications -n argocd -o yaml > cluster-debug/argocd-apps.yaml
# Logs
kubectl logs -n argocd deployment/argocd-application-controller > cluster-debug/argocd-controller.log
kubectl logs -n kubezero-system deployment/kubezero-controller > cluster-debug/kubezero-controller.log
echo "Debug information collected in cluster-debug/"
Performance Analysis
# CPU profiling for Go applications
kubectl exec -it my-go-app -- curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
# Memory profiling
kubectl exec -it my-go-app -- curl http://localhost:6060/debug/pprof/heap > mem.prof
# Network analysis
kubectl exec -it debug-pod -- tcpdump -i eth0 -w network.pcap
Recovery Procedures
Cluster Recovery
# Emergency cluster recovery
# 1. Identify failed components
kubectl get componentstatuses
# 2. Restart core components
sudo systemctl restart kubelet
sudo systemctl restart docker
# 3. Restore from backup if needed
kubectl apply -f cluster-backup.yaml
# 4. Verify cluster health
kubectl cluster-info
kubectl get nodes
Data Recovery
# Restore from volume snapshots
kubectl apply -f volume-snapshot-restore.yaml
# Database recovery
kubectl exec -it postgres-pod -- psql -c "SELECT pg_start_backup('recovery');"
# Restore database files
kubectl exec -it postgres-pod -- psql -c "SELECT pg_stop_backup();"
Prevention Strategies
Health Checks
# Comprehensive health checks
spec:
containers:
- name: my-app
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "1000m"
requests.memory: 2Gi
limits.cpu: "2000m"
limits.memory: 4Gi
persistentvolumeclaims: "10"
Chaos Engineering
# Chaos Monkey deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaos-monkey
spec:
replicas: 1
selector:
matchLabels:
app: chaos-monkey
template:
spec:
containers:
- name: chaos-monkey
image: quay.io/linki/chaoskube:v0.21.0
args:
- --interval=10m
- --dry-run=false
- --log-level=INFO