Edge stabilization
Edge stabilization (make edge a safe proving ground)
This runbook is about getting kubernetes/edge to a state where we can safely trial changes before rolling them into main.
Definition of done
flux checkis clean onedge.- All Flux sources and kustomizations are Ready.
- Core baseline apps are healthy (CNI/DNS/metrics/reloader at minimum).
- You can reconcile any one app and understand failures quickly (events/logs workflow).
Quick context: how edge is wired
Repo paths of interest:
- Flux config and root Kustomizations:
kubernetes/edge/flux/config/cluster.yaml(GitRepository +clusterKustomization)kubernetes/edge/flux/apps.yaml(cluster-appsKustomization →./kubernetes/edge/apps)kubernetes/edge/flux/repositories/kustomization.yaml(applieskubernetes/shared/repositories)
Baseline apps currently in edge (not exhaustive):
kubernetes/edge/apps/kube-system/(cilium, coredns, metrics-server, reloader, etc.)kubernetes/edge/apps/flux-system/addons/(notifications/webhooks/monitoring)kubernetes/edge/apps/observability/prometheus-operator-crds/
Baseline verification commands
Run these on your workstation against the edge kubecontext.
Flux health
flux check
flux get sources all -A
flux get ks -A
flux get hr -A
If something is failing, pull the details:
flux get ks -A --status-selector ready=false
flux get hr -A --status-selector ready=false
Cluster baseline pods
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running
kubectl -n flux-system get pods
kubectl -n kube-system get pods
Events (last ~50)
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
Reconcile workflow (standard approach)
Reconcile the cluster “entry points”
Start at the top and work downward:
flux reconcile kustomization cluster -n flux-system --with-source
flux reconcile kustomization cluster-apps -n flux-system --with-source
Then reconcile repositories if needed:
flux reconcile kustomization repositories -n flux-system --with-source
Reconcile one app
flux reconcile kustomization <app-ks-name> -n flux-system --with-source
Or if it is a HelmRelease-driven app:
flux reconcile helmrelease <hr-name> -n <namespace> --with-source
Known failure classes (and what to do)
1) Source / artifact errors
Symptoms: - GitRepository not ready - HelmRepository / OCIRepository not ready
What to do:
- Check flux get sources -A output
- Verify namespaces on namespaced sources (e.g., OCIRepository must have metadata.namespace)
2) Kustomization build errors
Symptoms: - “accumulating resources” errors - missing files / bad patches
What to do:
- flux logs --kind Kustomization --name <name> -n flux-system
- inspect the referenced path: in the Kustomization
3) Helm upgrade failures / rollbacks
Symptoms: - HelmRelease stuck, repeated rollbacks - “field is immutable” errors
What to do:
- kubectl -n <ns> describe helmrelease <name>
- kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -n 50
Immutable selector remediation (Deployment/StatefulSet/DaemonSet)
If you see spec.selector ... field is immutable, the controller must be deleted and recreated.
flux suspend helmrelease <name> -n <namespace>
kubectl -n <namespace> get deploy,sts,ds,cronjob -l helm.toolkit.fluxcd.io/name=<name>
kubectl -n <namespace> delete deployment|statefulset|daemonset <workload-name> --wait=true
flux resume helmrelease <name> -n <namespace>
flux reconcile helmrelease <name> -n <namespace> --with-source
Notes: - Deleting the controller does not delete PVCs unless you delete PVCs separately.
“Edge is fixed” checklist (copy/paste)
- [ ]
flux checkpasses - [ ]
flux get ks -Ashows noReady=False - [ ]
flux get hr -Ashows noReady=False - [ ]
kubectl -n flux-system get podsall Running/Ready - [ ]
kubectl -n kube-system get podsall Running/Ready (cilium + coredns healthy) - [ ] You can reconcile one app end-to-end without guessing (events/logs workflow works)