Health signals with wait false
Health signals with wait: false (reference repo patterns)
This documents the patterns used in the reference repo (example-ops/onedr0p-home-ops) to get strong health signals even when many Flux Kustomizations are set to wait: false.
The goal is not to copy blindly, but to capture the “signal sources” we may want to adopt in haynes-ops later.
The core idea
wait: false prevents a Flux Kustomization from blocking on health checks for everything it applies. That reduces “deadlocks” (one broken app shouldn’t stall the whole cluster-apps tree), but it means you need other mechanisms to surface failures quickly and reliably.
The reference repo gets those signals from:
- Flux Notifications (events → Alertmanager / GitHub status)
- Prometheus scraping + alert rules for Flux itself
- HelmRelease remediation defaults (retries/rollback behavior is consistent)
- Targeted
healthChecks/healthCheckExprsfor a small set of “critical infra” Kustomizations
Pattern 1: Flux Notifications (errors → Alertmanager)
Reference repo uses Flux notifications to send error events for most Flux objects to Alertmanager.
Files:
example-ops/onedr0p-home-ops/kubernetes/components/alerts/alertmanager/provider.yamlexample-ops/onedr0p-home-ops/kubernetes/components/alerts/alertmanager/alert.yaml
Notable behaviors:
- Broad coverage: the
Alertwatches errors fromFluxInstance,GitRepository,Kustomization,HelmRelease,OCIRepository, etc. - Noise control: an
exclusionListdrops known-flaky patterns (e.g., transient GitHub DNS/timeout lookups).
Why it matters with wait: false:
- Even if a
Kustomization“applies clean”, a failingHelmReleasestill emits events and flipsREADY=False. Notifications catch that without relying on KS health gating.
Pattern 2: Flux Notifications (Kustomization → GitHub status)
Reference repo also drives GitHub status from Kustomizations.
Files:
example-ops/onedr0p-home-ops/kubernetes/components/alerts/github-status/provider.yamlexample-ops/onedr0p-home-ops/kubernetes/components/alerts/github-status/alert.yaml
Why it matters:
- This is a “developer feedback loop” signal. It’s not cluster correctness on its own, but it makes breakage visible even when
wait: falseis used widely.
Pattern 3: Flux Instance monitoring (PodMonitor + PrometheusRule)
The reference repo monitors Flux controllers and alerts if Flux itself isn’t healthy.
Files:
example-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/podmonitor.yamlexample-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/prometheusrule.yamlexample-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/grafanadashboard.yaml
What it does:
- PodMonitor scrapes metrics from:
source-controllerkustomize-controllerhelm-controllernotification-controller- PrometheusRule alerts on:
- FluxInstanceAbsent (no metrics)
- FluxInstanceNotReady (ready != True for 5m)
- GrafanaDashboard imports Flux dashboards from upstream URLs (Flux Operator + flux2 monitoring example).
Why it matters with wait: false:
- If Flux controllers are degraded,
waitsettings on downstream Kustomizations are irrelevant. This gives a first-line signal that “GitOps is broken”.
Pattern 4: “Fast apply” + “strong remediation” defaults
Reference repo uses a root cluster-apps Kustomization with wait: false, but patches all child Kustomizations to inject HelmRelease remediation defaults.
File:
example-ops/onedr0p-home-ops/kubernetes/flux/cluster/ks.yaml
What it injects (high level):
- Install/upgrade strategies (
RetryOnFailure,RemediateOnFailure) - CRD install/upgrade behavior (
CreateReplace) - Rollback cleanup + recreate
- Upgrade retries/remediation behavior
Why it matters with wait: false:
- You get consistent “self-healing” without relying on Kustomization health gating.
- Failures become visible primarily in
HelmReleasestatus/events, and are forwarded via notifications.
Pattern 5: Targeted healthChecks / healthCheckExprs (selected infra)
The reference repo adds explicit healthChecks and healthCheckExprs on some infrastructure Kustomizations (examples include cert-manager, onepassword, and cloudflare-dns).
Examples:
example-ops/onedr0p-home-ops/kubernetes/apps/cert-manager/cert-manager/ks.yaml- health check a
HelmReleaseplus aClusterIssuer - expression-based readiness for
ClusterIssuerconditions example-ops/onedr0p-home-ops/kubernetes/apps/external-secrets/onepassword/ks.yaml- expression-based readiness for
ClusterSecretStoreconditions example-ops/onedr0p-home-ops/kubernetes/apps/network/cloudflare-dns/ks.yaml- health check a
HelmReleaseplus a CRD
Important nuance:
healthChecksare only useful if Flux is actually performing health evaluation for that Kustomization. Inhaynes-ops, if we adopt this pattern, we should be explicit about which Kustomizations are “gated” (setwait: truewhere we truly want readiness to reflect downstream health) vs “fast apply” (keepwait: false).
Receiver pattern: GitHub webhook triggers immediate reconcile
The reference repo creates a Receiver that listens for GitHub push events and triggers reconciliation of the GitRepository and root Kustomization.
File:
example-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/receiver.yaml
Why it matters:
- Reduces time-to-detect for broken commits without requiring very tight intervals.
What we could adopt in haynes-ops (later)
If we want “strong health signals” while keeping wait: false broadly:
- Flux → Alertmanager notifications: likely the highest value and lowest risk.
- Flux controller metrics scraping + rules: also high value; requires Prometheus stack integration.
- HelmRelease remediation defaults via root patches: medium risk (behavior change), but powerful.
- Selective
wait: true+ healthChecks on critical infra: use sparingly to avoid deadlocks. - GitHub status provider/alert: optional; most useful if you care about PR/commit status feedback.