Index

Standardization roadmap (haynes-ops)

This folder is the planning + runbook hub for modernizing haynes-ops toward the conventions used by the broader home-ops community, using two concrete references:

The reference repo included in this workspace: example-ops/onedr0p-home-ops
The widely-forked template many home-ops repos are based on: onedr0p/cluster-template

The goal is to make it easier to:

Lift patterns directly from kubesearch examples
Reduce YAML duplication (use components + global defaults)
Manage risk during reconciles (batching, rollback paths, edge-first where needed)

Principles

Docs first: we do not refactor manifests until the docs/runbooks are agreed.
GitOps strictly: changes happen through commits; Flux applies them.
Blast radius control: change one dimension at a time (structure vs behavior vs versions).
Edge as proving ground: if the change is risky or hard to roll back, validate on edge first.
Assume immutable fields exist: Deployments/StatefulSets can require delete/recreate when selectors/labels change.

Current repo realities (important constraints)

Two clusters: kubernetes/main and kubernetes/edge.
Flux entry points:
kubernetes/*/flux/config/cluster.yaml (GitRepository + cluster Kustomization)
kubernetes/*/flux/apps.yaml (the cluster-apps Kustomization that points at kubernetes/*/apps)
kubernetes/*/flux/repositories/kustomization.yaml (applies kubernetes/shared/repositories)
Shared resources live under kubernetes/shared/ (repositories, components, etc.).

Target direction (adapted, not copied blindly)

We’re aiming for:

Components-first (Kustomize Components for repeatable patterns like volsync, alerts, gatus)
Fewer “templates” (avoid duplicated YAML that drifts)
Chart sources standardized (clear, consistent HelmRelease chart sourcing)
Global defaults applied by Flux root Kustomization patches (where safe), similar to the patterns in the reference repo

Important nuance (about the reference patterns):

example-ops/onedr0p-home-ops uses per-app OCIRepository objects and strong global patching defaults.
haynes-ops currently uses shared repositories under kubernetes/shared/repositories/.

Neither is “the one true way” for all home-ops repos; we’re choosing what fits this repo best while reducing operational risk.

Work phases (ordered)

Phase 0: Documentation set (now)

Definition of done:

This README explains the ordering, risk, and how to run/verify each phase.
Each risky/tedious phase has a breakout runbook.

Phase 1: Edge stabilization (prereq)

We want edge to be a reliable proving ground before we take on risky refactors.

Runbook: edge-stabilization.md

Phase 2: Quick wins (low risk, high value)

These should be mostly additive or purely structural:

Remove/merge obvious duplication between kubernetes/shared/templates/ and kubernetes/shared/components/ where it doesn’t change output.
Adopt a consistent component usage approach and document it.
Runbook: components-over-templates.md

Phase 3: Standardize chart sourcing (medium risk)

Standardize HelmRelease chart sourcing patterns, starting with the app-template fleet, with careful batching and known recovery procedures.

Runbook: helmrelease-chartref-migration.md
Related decision: repository-source-strategy.md

Phase 4: Flux global defaults / patches (medium to high risk)

Expand Flux root Kustomization patching to reduce boilerplate and make remediation behavior consistent. Some defaults can materially change reconcile behavior, so we stage this carefully.

Runbook: flux-global-patches.md

Phase 4.5: Flux Operator migration (very high risk, edge first)

Migrating to Flux Operator changes how Flux itself is installed and configured. Keep this as a separate project with a dedicated rollout and rollback plan.

Runbook: flux-operator-migration.md

Phase 5: Network modernization (high risk)

Ingress → Gateway API (Traefik provider changes + resource type migrations) can cause downtime if done incorrectly. Keep this separate from other refactors.

Existing notes (to be split later): todo-refactor.md

Standard commands (copy/paste)

Flux status

flux check
flux get ks -A
flux get hr -A

Reconcile a specific resource

flux reconcile helmrelease <name> -n <namespace> --with-source
flux reconcile kustomization <name> -n flux-system --with-source

Immutable selector remediation (pattern)

When you see spec.selector ... field is immutable:

flux suspend helmrelease <name> -n <namespace>
kubectl -n <namespace> get deploy,sts,ds,cronjob -l helm.toolkit.fluxcd.io/name=<name>
kubectl -n <namespace> delete deployment|statefulset|daemonset <workload-name> --wait=true
flux resume helmrelease <name> -n <namespace>
flux reconcile helmrelease <name> -n <namespace> --with-source

Incident note: `comfyui` rollback during `app-template` v3 → v4

During the chartRef migration / app-template v3→v4 upgrade, HelmRelease/ai/comfyui failed and rolled back even though the Flux Kustomization looked “clean” at a glance.

Why it failed: Kubernetes forbids changes to most StatefulSet.spec fields. The chart upgrade attempted a forbidden StatefulSet change, so Helm failed the upgrade and rolled back to [email protected].
Why KS didn’t obviously show it: comfyui is applied with wait: false, so the KS primarily reflects “applied manifests”, not “Helm upgrade succeeded”. The HelmRelease status is the source of truth for chart upgrade outcomes.
Remediation pattern: suspend HR → delete the blocking workload (StatefulSet for comfyui, Deployment for ollama-*) → resume + reconcile HR, then reconcile KS to refresh its health.

Breakout documents (index)

edge-stabilization.md: get edge to a trustworthy baseline
components-over-templates.md: converge on components; stop template drift
sops-scope-and-kustomization-namespacing.md: keep SOPS confined to flux/vars while allowing app Kustomizations outside flux-system
seed-secrets-and-removing-sops.md: future task — external-secrets seed strategy + removing per-app *.sops.yaml
helmrelease-chartref-migration.md: migrate HelmRelease to chartRef safely (batching + recovery)
flux-global-patches.md: staged approach to onedr0p-style global defaults
flux-operator-migration.md: edge-first migration plan to Flux Operator + Flux Instance
health-signals-with-wait-false.md: how the reference repo gets strong health signals without relying on KS wait: true
gatus-deployment-alignment.md: align Gatus deployment + substitution behavior to the reference repo
repository-source-strategy.md: decide shared vs per-app OCI sources, and how that affects migrations
todo-refactor.md: backlog (includes Gateway API migration ideas; treat as high risk)