Index
Standardization roadmap (haynes-ops)
This folder is the planning + runbook hub for modernizing haynes-ops toward the conventions used by the broader home-ops community, using two concrete references:
- The reference repo included in this workspace:
example-ops/onedr0p-home-ops - The widely-forked template many home-ops repos are based on:
onedr0p/cluster-template
The goal is to make it easier to:
- Lift patterns directly from kubesearch examples
- Reduce YAML duplication (use components + global defaults)
- Manage risk during reconciles (batching, rollback paths, edge-first where needed)
Principles
- Docs first: we do not refactor manifests until the docs/runbooks are agreed.
- GitOps strictly: changes happen through commits; Flux applies them.
- Blast radius control: change one dimension at a time (structure vs behavior vs versions).
- Edge as proving ground: if the change is risky or hard to roll back, validate on
edgefirst. - Assume immutable fields exist: Deployments/StatefulSets can require delete/recreate when selectors/labels change.
Current repo realities (important constraints)
- Two clusters:
kubernetes/mainandkubernetes/edge. - Flux entry points:
kubernetes/*/flux/config/cluster.yaml(GitRepository +clusterKustomization)kubernetes/*/flux/apps.yaml(thecluster-appsKustomization that points atkubernetes/*/apps)kubernetes/*/flux/repositories/kustomization.yaml(applieskubernetes/shared/repositories)- Shared resources live under
kubernetes/shared/(repositories, components, etc.).
Target direction (adapted, not copied blindly)
We’re aiming for:
- Components-first (Kustomize Components for repeatable patterns like volsync, alerts, gatus)
- Fewer “templates” (avoid duplicated YAML that drifts)
- Chart sources standardized (clear, consistent
HelmReleasechart sourcing) - Global defaults applied by Flux root Kustomization patches (where safe), similar to the patterns in the reference repo
Important nuance (about the reference patterns):
example-ops/onedr0p-home-opsuses per-appOCIRepositoryobjects and strong global patching defaults.haynes-opscurrently uses shared repositories underkubernetes/shared/repositories/.
Neither is “the one true way” for all home-ops repos; we’re choosing what fits this repo best while reducing operational risk.
Work phases (ordered)
Phase 0: Documentation set (now)
Definition of done:
- This README explains the ordering, risk, and how to run/verify each phase.
- Each risky/tedious phase has a breakout runbook.
Phase 1: Edge stabilization (prereq)
We want edge to be a reliable proving ground before we take on risky refactors.
- Runbook:
edge-stabilization.md
Phase 2: Quick wins (low risk, high value)
These should be mostly additive or purely structural:
- Remove/merge obvious duplication between
kubernetes/shared/templates/andkubernetes/shared/components/where it doesn’t change output. -
Adopt a consistent component usage approach and document it.
-
Runbook:
components-over-templates.md
Phase 3: Standardize chart sourcing (medium risk)
Standardize HelmRelease chart sourcing patterns, starting with the app-template fleet, with careful batching and known recovery procedures.
- Runbook:
helmrelease-chartref-migration.md - Related decision:
repository-source-strategy.md
Phase 4: Flux global defaults / patches (medium to high risk)
Expand Flux root Kustomization patching to reduce boilerplate and make remediation behavior consistent. Some defaults can materially change reconcile behavior, so we stage this carefully.
- Runbook:
flux-global-patches.md
Phase 4.5: Flux Operator migration (very high risk, edge first)
Migrating to Flux Operator changes how Flux itself is installed and configured. Keep this as a separate project with a dedicated rollout and rollback plan.
- Runbook:
flux-operator-migration.md
Phase 5: Network modernization (high risk)
Ingress → Gateway API (Traefik provider changes + resource type migrations) can cause downtime if done incorrectly. Keep this separate from other refactors.
- Existing notes (to be split later):
todo-refactor.md
Standard commands (copy/paste)
Flux status
flux check
flux get ks -A
flux get hr -A
Reconcile a specific resource
flux reconcile helmrelease <name> -n <namespace> --with-source
flux reconcile kustomization <name> -n flux-system --with-source
Immutable selector remediation (pattern)
When you see spec.selector ... field is immutable:
flux suspend helmrelease <name> -n <namespace>
kubectl -n <namespace> get deploy,sts,ds,cronjob -l helm.toolkit.fluxcd.io/name=<name>
kubectl -n <namespace> delete deployment|statefulset|daemonset <workload-name> --wait=true
flux resume helmrelease <name> -n <namespace>
flux reconcile helmrelease <name> -n <namespace> --with-source
Incident note: comfyui rollback during app-template v3 → v4
During the chartRef migration / app-template v3→v4 upgrade, HelmRelease/ai/comfyui failed and rolled back even though the Flux Kustomization looked “clean” at a glance.
- Why it failed: Kubernetes forbids changes to most
StatefulSet.specfields. The chart upgrade attempted a forbiddenStatefulSetchange, so Helm failed the upgrade and rolled back to[email protected]. - Why KS didn’t obviously show it:
comfyuiis applied withwait: false, so the KS primarily reflects “applied manifests”, not “Helm upgrade succeeded”. The HelmRelease status is the source of truth for chart upgrade outcomes. - Remediation pattern: suspend HR → delete the blocking workload (StatefulSet for
comfyui, Deployment forollama-*) → resume + reconcile HR, then reconcile KS to refresh its health.
Breakout documents (index)
edge-stabilization.md: getedgeto a trustworthy baselinecomponents-over-templates.md: converge on components; stop template driftsops-scope-and-kustomization-namespacing.md: keep SOPS confined toflux/varswhile allowing appKustomizations outsideflux-systemseed-secrets-and-removing-sops.md: future task — external-secrets seed strategy + removing per-app*.sops.yamlhelmrelease-chartref-migration.md: migrateHelmReleasetochartRefsafely (batching + recovery)flux-global-patches.md: staged approach to onedr0p-style global defaultsflux-operator-migration.md: edge-first migration plan to Flux Operator + Flux Instancehealth-signals-with-wait-false.md: how the reference repo gets strong health signals without relying on KSwait: truegatus-deployment-alignment.md: align Gatus deployment + substitution behavior to the reference reporepository-source-strategy.md: decide shared vs per-app OCI sources, and how that affects migrationstodo-refactor.md: backlog (includes Gateway API migration ideas; treat as high risk)