Health signals with wait false

Health signals with `wait: false` (reference repo patterns)

This documents the patterns used in the reference repo (example-ops/onedr0p-home-ops) to get strong health signals even when many Flux Kustomizations are set to wait: false.

The goal is not to copy blindly, but to capture the “signal sources” we may want to adopt in haynes-ops later.

The core idea

wait: false prevents a Flux Kustomization from blocking on health checks for everything it applies. That reduces “deadlocks” (one broken app shouldn’t stall the whole cluster-apps tree), but it means you need other mechanisms to surface failures quickly and reliably.

The reference repo gets those signals from:

Flux Notifications (events → Alertmanager / GitHub status)
Prometheus scraping + alert rules for Flux itself
HelmRelease remediation defaults (retries/rollback behavior is consistent)
Targeted healthChecks / healthCheckExprs for a small set of “critical infra” Kustomizations

Pattern 1: Flux Notifications (errors → Alertmanager)

Reference repo uses Flux notifications to send error events for most Flux objects to Alertmanager.

Files:

example-ops/onedr0p-home-ops/kubernetes/components/alerts/alertmanager/provider.yaml
example-ops/onedr0p-home-ops/kubernetes/components/alerts/alertmanager/alert.yaml

Notable behaviors:

Broad coverage: the Alert watches errors from FluxInstance, GitRepository, Kustomization, HelmRelease, OCIRepository, etc.
Noise control: an exclusionList drops known-flaky patterns (e.g., transient GitHub DNS/timeout lookups).

Why it matters with wait: false:

Even if a Kustomization “applies clean”, a failing HelmRelease still emits events and flips READY=False. Notifications catch that without relying on KS health gating.

Pattern 2: Flux Notifications (Kustomization → GitHub status)

Reference repo also drives GitHub status from Kustomizations.

Files:

example-ops/onedr0p-home-ops/kubernetes/components/alerts/github-status/provider.yaml
example-ops/onedr0p-home-ops/kubernetes/components/alerts/github-status/alert.yaml

Why it matters:

This is a “developer feedback loop” signal. It’s not cluster correctness on its own, but it makes breakage visible even when wait: false is used widely.

Pattern 3: Flux Instance monitoring (PodMonitor + PrometheusRule)

The reference repo monitors Flux controllers and alerts if Flux itself isn’t healthy.

Files:

example-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/podmonitor.yaml
example-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/prometheusrule.yaml
example-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/grafanadashboard.yaml

What it does:

PodMonitor scrapes metrics from:
source-controller
kustomize-controller
helm-controller
notification-controller
PrometheusRule alerts on:
FluxInstanceAbsent (no metrics)
FluxInstanceNotReady (ready != True for 5m)
GrafanaDashboard imports Flux dashboards from upstream URLs (Flux Operator + flux2 monitoring example).

Why it matters with wait: false:

If Flux controllers are degraded, wait settings on downstream Kustomizations are irrelevant. This gives a first-line signal that “GitOps is broken”.

Pattern 4: “Fast apply” + “strong remediation” defaults

Reference repo uses a root cluster-apps Kustomization with wait: false, but patches all child Kustomizations to inject HelmRelease remediation defaults.

File:

example-ops/onedr0p-home-ops/kubernetes/flux/cluster/ks.yaml

What it injects (high level):

Install/upgrade strategies (RetryOnFailure, RemediateOnFailure)
CRD install/upgrade behavior (CreateReplace)
Rollback cleanup + recreate
Upgrade retries/remediation behavior

Why it matters with wait: false:

You get consistent “self-healing” without relying on Kustomization health gating.
Failures become visible primarily in HelmRelease status/events, and are forwarded via notifications.

Pattern 5: Targeted `healthChecks` / `healthCheckExprs` (selected infra)

The reference repo adds explicit healthChecks and healthCheckExprs on some infrastructure Kustomizations (examples include cert-manager, onepassword, and cloudflare-dns).

Examples:

example-ops/onedr0p-home-ops/kubernetes/apps/cert-manager/cert-manager/ks.yaml
health check a HelmRelease plus a ClusterIssuer
expression-based readiness for ClusterIssuer conditions
example-ops/onedr0p-home-ops/kubernetes/apps/external-secrets/onepassword/ks.yaml
expression-based readiness for ClusterSecretStore conditions
example-ops/onedr0p-home-ops/kubernetes/apps/network/cloudflare-dns/ks.yaml
health check a HelmRelease plus a CRD

Important nuance:

healthChecks are only useful if Flux is actually performing health evaluation for that Kustomization. In haynes-ops, if we adopt this pattern, we should be explicit about which Kustomizations are “gated” (set wait: true where we truly want readiness to reflect downstream health) vs “fast apply” (keep wait: false).

Receiver pattern: GitHub webhook triggers immediate reconcile

The reference repo creates a Receiver that listens for GitHub push events and triggers reconciliation of the GitRepository and root Kustomization.

File:

example-ops/onedr0p-home-ops/kubernetes/apps/flux-system/flux-instance/app/receiver.yaml

Why it matters:

Reduces time-to-detect for broken commits without requiring very tight intervals.

What we could adopt in `haynes-ops` (later)

If we want “strong health signals” while keeping wait: false broadly:

Flux → Alertmanager notifications: likely the highest value and lowest risk.
Flux controller metrics scraping + rules: also high value; requires Prometheus stack integration.
HelmRelease remediation defaults via root patches: medium risk (behavior change), but powerful.
Selective wait: true + healthChecks on critical infra: use sparingly to avoid deadlocks.
GitHub status provider/alert: optional; most useful if you care about PR/commit status feedback.

Health signals with wait false