Skip to main content

Monitoring Configuration Reference

This document provides comprehensive configuration options for monitoring and observability in KubeZero environments.

Core Monitoring Stack

Prometheus Configuration

Basic Setup

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: kubezero-prometheus
namespace: monitoring
spec:
replicas: 2
retention: "30d"
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd

Service Discovery

spec:
serviceMonitorSelector:
matchLabels:
team: platform
podMonitorSelector:
matchLabels:
monitoring: enabled
ruleSelector:
matchLabels:
prometheus: kubezero

Grafana Configuration

Dashboard Provisioning

apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
data:
kubernetes-overview.json: |
{
"dashboard": {
"id": null,
"title": "Kubernetes Overview",
"tags": ["kubernetes"],
"panels": []
}
}

Data Source Configuration

apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy
isDefault: true

Alerting Configuration

AlertManager Setup

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: kubezero-alertmanager
spec:
replicas: 3
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi

Alert Rules

Cluster Health Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cluster-health
spec:
groups:
- name: cluster.rules
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"

Application Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
spec:
groups:
- name: apps.rules
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"

Notification Channels

Slack Integration

global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'slack-notifications'

receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: 'KubeZero Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Email Notifications

receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.gmail.com:587'
auth_username: '[email protected]'
auth_password: 'app-password'
subject: 'KubeZero Alert: {{ .GroupLabels.alertname }}'

Logging Configuration

Fluent Bit Setup

apiVersion: logging.coreos.com/v1
kind: FluentBit
metadata:
name: kubezero-fluent-bit
spec:
inputs:
- name: tail
path: /var/log/containers/*.log
parser: cri
tag: kubernetes.*
filters:
- name: kubernetes
match: kubernetes.*
kube_url: https://kubernetes.default.svc:443
outputs:
- name: elasticsearch
match: '*'
host: elasticsearch.logging.svc.cluster.local
port: 9200

Loki Configuration

apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
data:
loki.yaml: |
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11

Tracing Configuration

Jaeger Setup

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: kubezero-jaeger
spec:
strategy: production
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy

OpenTelemetry Collector

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: kubezero-otel-collector
spec:
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]

Custom Metrics

ServiceMonitor Examples

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: my-application
endpoints:
- port: metrics
interval: 30s
path: /metrics

PodMonitor Examples

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: pod-metrics
spec:
selector:
matchLabels:
monitoring: enabled
podMetricsEndpoints:
- port: metrics
interval: 15s

Performance Tuning

Prometheus Optimization

spec:
prometheus:
prometheusSpec:
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi

Grafana Optimization

spec:
grafana:
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
persistence:
enabled: true
size: 10Gi
storageClassName: fast-ssd

Security Configuration

RBAC for Monitoring

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources: ["nodes", "services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: monitoring-network-policy
spec:
podSelector:
matchLabels:
app.kubernetes.io/part-of: monitoring
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
egress:
- to:
- namespaceSelector: {}