From 0a205c1d28cb81cb714e442aa83af350367b7353 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Sat, 30 May 2026 14:52:33 +0300 Subject: [PATCH 1/5] fix(cluster-install): correct v1.4.x namespace, storageclass gate, dashboard OIDC Three corrections to the cluster-install skill found while installing Cozystack v1.4.2 on a fresh Talos cluster end to end. cozy-installer namespace: install into cozy-system with --create-namespace. v1.4.0 stopped templating the cozy-system Namespace in the chart and moved namespace creation to --create-namespace plus a pre-install labeler hook (in kube-system, hostNetwork, NotReady-tolerant) that stamps PSA=privileged. The previous --namespace kube-system form (no --create-namespace) makes the labeler hook fail with 'namespaces "cozy-system" not found' and aborts the install before the operator deploys. Both v1.3.x and v1.4.x forms are now documented side by side. StorageClasses: gate on a live 'kubectl get storageclass' check instead of a version branch. The prior guidance skipped this phase on installer_version >= 1.4.0 assuming the tenant CRD exposes spec.storageClasses and the operator auto-creates them. On v1.4.2 the shipped tenant CRD has no storageClasses field and nothing creates the classes, so the cluster reaches all-HRs-Ready with zero StorageClasses and every stateful PVC stuck Pending. The live check creates the linstor defaults whenever the cluster comes up empty and self-skips if a future release starts creating them. Also documents that SC creation belongs inside the Phase 8 watch loop, gated on linstor-controller Ready, to avoid deadlocking on PVC-dependent HRs. dashboard OIDC: enable authentication.oidc, set the root tenant spec.host (it does not inherit publishing.host), and expose keycloak. The non-OIDC token-proxy dashboard path is broken on v1.4.2 (container never binds its port and CrashLoops on its own liveness probe), so a working web dashboard requires Keycloak. This mirrors what the upstream e2e install does. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- .../cozystack/skills/cluster-install/SKILL.md | 72 ++++++++++++++++--- .../references/values-template.md | 42 +++++++++-- 2 files changed, 100 insertions(+), 14 deletions(-) diff --git a/plugins/cozystack/skills/cluster-install/SKILL.md b/plugins/cozystack/skills/cluster-install/SKILL.md index 6ad9a62..6d670bc 100644 --- a/plugins/cozystack/skills/cluster-install/SKILL.md +++ b/plugins/cozystack/skills/cluster-install/SKILL.md @@ -623,16 +623,25 @@ Namespace adoption first if `cozy-system` exists and lacks Helm metadata (see `r # Normalise: v1.3.3 → 1.3.3 (Helm's OCI client matches the registry tag as-is) INSTALLER_VERSION_OCI="${INSTALLER_VERSION#v}" +# v1.4.0+ (current): release lives in cozy-system; --create-namespace is REQUIRED. +# The chart no longer templates the cozy-system Namespace (changed in cozystack/cozystack#2508); +# a pre-install hook Job (cozy-system-labeler, in kube-system, hostNetwork, +# tolerant of NotReady/CNI-not-ready) stamps the new namespace with +# PSA enforce=privileged + cozystack.io/system=true. Passing the old +# `--namespace kube-system` without --create-namespace makes the labeler hook +# fail with `namespaces "cozy-system" not found` and aborts the install. helm --kube-context $CTX upgrade --install cozy-installer \ oci://ghcr.io/cozystack/cozystack/cozy-installer \ --version "$INSTALLER_VERSION_OCI" \ - --namespace kube-system \ + --namespace cozy-system --create-namespace \ --set cozystackOperator.variant=$INSTALLER_VARIANT \ --set cozystack.apiServerHost=$API_HOST \ --set cozystack.apiServerPort=$API_PORT \ --wait --timeout 10m ``` +For a v1.3.x install the form is `--namespace kube-system` with NO `--create-namespace` (the v1.3 chart templates the namespace itself). See `references/values-template.md` for both forms side by side — pick the one matching `installer_version`. + For `talos` and `hosted`, drop `cozystack.apiServerHost` / `apiServerPort` if not required by the chart. Verify: @@ -671,20 +680,45 @@ This phase merges what used to be Phase 7.5 (root Tenant ingress patch) into the **Why the patch is needed**: cozystack's dashboard ships gatekeeper (oauth2-proxy) which, on startup, does OIDC discovery against the **public FQDN** `https://keycloak.${HOST}/realms/cozy/.well-known/openid-configuration` — not an in-cluster service. Without the root ingress controller running, nothing listens on 443, gatekeeper CrashLoopBackOffs, the `cozy-dashboard/dashboard` HR sits in `Unknown: Running 'install' action with timeout of 10m0s` and then `InstallFailed: context deadline exceeded`, Flux remediates and retries forever. `cozy-fluxcd/flux-plunger` has a hard dependency on `cozy-dashboard/dashboard` and stays `False: dependency is not ready`. The phase would never go green. +**The dashboard requires OIDC/Keycloak — it is not optional on the supported path.** The "Why the patch is needed" note above describes gatekeeper doing OIDC discovery against Keycloak. But Keycloak only deploys when `authentication.oidc.enabled: true` in the Platform Package — and that key defaults to `false`, with the `isp-full*` overlays NOT turning it on. If the skill enables `ingress` + sets `host` but never enables OIDC, the result on v1.4.2 is: no `cozy-keycloak` namespace, no Keycloak HR, and the dashboard falls back to its non-OIDC `token-proxy` container, which is **broken on v1.4.2** — the container starts, never binds `:8000` (connection refused, zero logs), and is killed by its own `/ping` liveness probe every ~45 s → CrashLoopBackOff → the `cozy-dashboard/dashboard` HR fails install → `flux-plunger` and the rest of the chain hang. The cluster reports 88/90 HRs Ready and looks "almost done" forever. + +So for a usable dashboard the skill must enable OIDC. This is exactly what cozystack's own e2e (`hack/e2e-install-cozystack.bats`) does — patch the root tenant host, then enable OIDC and expose Keycloak: + +```bash +# Enable OIDC and expose keycloak (do this once, after the Package exists). +kubectl --context $CTX patch package cozystack.cozystack-platform --type merge \ + --patch '{"spec":{"components":{"platform":{"values":{"authentication":{"oidc":{"enabled":true}}}}}}}' + +# keycloak must be in publishing.exposedServices so its public ingress (and +# therefore its LE cert + issuer URL) exists; api+dashboard alone are not enough. +kubectl --context $CTX patch package cozystack.cozystack-platform --type merge \ + --patch '{"spec":{"components":{"platform":{"values":{"publishing":{"exposedServices":["api","dashboard","keycloak"]}}}}}}' +``` + +Better: bake both into the Platform Package CR written in Phase 4/7 from the start (`authentication.oidc.enabled: true` and `keycloak` in `exposedServices`) so there is no second reconcile. The Phase 4 intake should collect a **dashboard auth** decision — OIDC/Keycloak (recommended; the only path with a working dashboard on 1.4.2) vs none (API-only, no web dashboard) — and only enable OIDC when the operator wants the dashboard. When OIDC is enabled, Keycloak needs a working LE cert for `keycloak.${HOST}`, so the same DNS/port-80 preconditions as the dashboard host apply (Phase 4 publishing gate already covers this — just make sure `keycloak.${HOST}` is inside the wildcard). + Skip the root-Tenant patch entirely on `isp-hosted` or when the `system` bundle was disabled in Phase 4 — there is no root Tenant CR in those modes. Watch loop (per 30 s poll): ```bash # 1) Has the root Tenant CR landed? If yes and not yet patched, patch it. +# Set BOTH spec.host and spec.ingress. The root tenant ships with +# spec.host: "" and does NOT inherit publishing.host from the Platform +# Package (verified on v1.4.2). With an empty host the per-tenant ingress +# objects (dashboard.${HOST}, keycloak.${HOST}, …) render against an empty +# domain and Keycloak/dashboard never get usable URLs. $HOST is the +# publishing.host collected in Phase 4. if kubectl --context $CTX --namespace tenant-root get tenants.apps.cozystack.io root \ --output jsonpath='{.metadata.name}' 2>/dev/null | grep -q '^root$'; then - CURRENT=$(kubectl --context $CTX --namespace tenant-root get tenants.apps.cozystack.io root \ - --output jsonpath='{.spec.ingress}') - if [ "$CURRENT" != "true" ]; then + CUR_INGRESS=$(kubectl --context $CTX --namespace tenant-root get tenants.apps.cozystack.io root \ + --output jsonpath='{.spec.ingress}') + CUR_HOST=$(kubectl --context $CTX --namespace tenant-root get tenants.apps.cozystack.io root \ + --output jsonpath='{.spec.host}') + if [ "$CUR_INGRESS" != "true" ] || [ "$CUR_HOST" != "$HOST" ]; then kubectl --context $CTX --namespace tenant-root patch tenants.apps.cozystack.io root \ - --type=merge --patch '{"spec":{"ingress":true}}' - echo "patched tenants/root.spec.ingress=true at $(TZ=UTC date -Iseconds)" + --type=merge --patch "{\"spec\":{\"ingress\":true,\"host\":\"${HOST}\"}}" + echo "patched tenants/root spec.host=${HOST} ingress=true at $(TZ=UTC date -Iseconds)" fi fi @@ -693,6 +727,8 @@ kubectl --context $CTX get hr --all-namespaces \ --output jsonpath='{range .items[?(@.status.conditions[?(@.type=="Ready" && @.status!="True")])]}{.metadata.namespace}/{.metadata.name} {end}' ``` +On the full `system`-bundle path you may also want the root tenant's `etcd`/`monitoring`/`seaweedfs` services (this is what cozystack's own `hack/e2e-install-cozystack.bats` patches): extend the patch to `{"spec":{"ingress":true,"host":"","monitoring":true,"etcd":true,"seaweedfs":true}}` when those were selected in Phase 4. Leave them at their defaults otherwise. + ```text HelmRelease $NS/$NAME has been Failing for $T minutes. Last condition: @@ -775,11 +811,27 @@ kubectl --context $CTX --namespace cozy-linstor exec deploy/linstor-controller - # Expect one ZFS row per storage-providing node with non-zero Capacity. ``` -## Phase 8.6 — Default StorageClasses (cozystack v1.3.x compatibility) +## Phase 8.6 — Default StorageClasses -Skip on `cluster.cozystack.installer_version` ≥ `1.4.0`. The cozystack `tenants.apps.cozystack.io` CRD in v1.4+ exposes `spec.storageClasses` and the operator creates the StorageClasses based on the tenant declaration. v1.3.x does **not** do this — the cluster reaches "all HRs Ready" with zero StorageClasses, and every stateful tenant workload sits in `Pending: pod has unbound immediate PersistentVolumeClaims` until the operator applies them by hand. +**Gate on the live cluster, not on a version number.** Earlier guidance skipped this phase on `installer_version ≥ 1.4.0` on the assumption that the `tenants.apps.cozystack.io` CRD exposes `spec.storageClasses` and the operator creates the StorageClasses from the tenant declaration. That assumption is **false on at least v1.4.2** — the shipped tenant CRD has no `storageClasses` field (`kubectl get crd tenants.apps.cozystack.io -o yaml | grep -c storageClass` → `0`), nothing auto-creates StorageClasses, and the cluster reaches "all HRs Ready" with `kubectl get storageclass` empty. Every stateful workload (keycloak-db, etcd, seaweedfs, vmstorage/vlstorage) then sits in `Pending: unbound immediate PersistentVolumeClaims`, which cascades: keycloak CrashLoops with no DB → cozystack-api/controller/dashboard never go Ready. -The skill writes two StorageClasses by default for v1.3.x: +So the correct gate is a live check, not a version branch: + +```bash +# Only create defaults if the cluster has none AND nothing else owns the names. +EXISTING_SC=$(kubectl --context $CTX get storageclass --output name 2>/dev/null | wc -l | tr -d ' ') +if [ "$EXISTING_SC" -gt 0 ]; then + echo "StorageClasses already present — skip (operator or a future chart created them):" + kubectl --context $CTX get storageclass +else + echo "No StorageClasses — applying linstor defaults (see manifest below)." + # apply storageclasses-default.yaml +fi +``` + +If a future cozystack release does start auto-creating StorageClasses, the live check skips this phase automatically — no version bump to the skill needed. Until then, the skill creates them on every version where the cluster comes up empty. + +The skill writes two StorageClasses: ```yaml # /storageclasses-default.yaml @@ -823,6 +875,8 @@ kubectl --context $CTX get storageclass # replicated (default) linstor.csi.linbit.com ... true ``` +**Timing — create the StorageClasses inside the Phase 8 watch loop, not after it.** Apply them as soon as `local`/`replicated` are absent and the LINSTOR pools are registered (same gate as the inline pool registration), NOT after "all HRs Ready". stateful HRs in the `paas`/`monitoring` bundles (keycloak, etcd, seaweedfs, vmstorage) request PVCs that stay `Pending` until a default StorageClass exists, so an "all-HRs-Ready → then create SCs" ordering deadlocks the watch loop the same way the LINSTOR pool registration would. Folding SC creation into the loop (gated on `linstor-controller` Ready) lets those PVCs bind and the dependent HRs converge. One subtlety: a PVC created with no `storageClassName` **before** a default SC exists records `storageClassName: ""` and will NOT retroactively pick up a later default — but every cozystack chart pins `storageClassName` explicitly (`replicated`), so in practice the Pending PVCs bind as soon as the named class appears. If you do hit a genuinely class-less Pending PVC, it must be recreated after the default exists. + `replicated` is marked as the default; `local` is a single-replica fallback for system workloads that don't need replication. On clusters with fewer than 3 storage-providing nodes, drop `placementCount` for `replicated` to match — the skill auto-derives this from `cozystack.storage.nodes[]` count. ## Phase 9 — Post-install verification diff --git a/plugins/cozystack/skills/cluster-install/references/values-template.md b/plugins/cozystack/skills/cluster-install/references/values-template.md index 9cd6484..903963b 100644 --- a/plugins/cozystack/skills/cluster-install/references/values-template.md +++ b/plugins/cozystack/skills/cluster-install/references/values-template.md @@ -2,7 +2,11 @@ ## Installer chart values -The cozy-installer Helm release lives in `kube-system` (the chart itself templates `Namespace cozy-system`, so the release secret can't live there). Two keys matter at install time: +The cozy-installer Helm release lives in `cozy-system`. Since cozystack v1.4.0 (cozystack/cozystack#2508) the chart no longer ships a `Namespace cozy-system` resource on the helm-install path — instead the caller passes `--create-namespace` and a pre-install hook Job (`cozy-system-labeler`, running in `kube-system`, hostNetwork, tolerant of NotReady/CNI-not-ready nodes) stamps the freshly-created namespace with the PodSecurity `enforce=privileged` and `cozystack.io/system=true` labels. + +> **v1.3.x vs v1.4.x — do not mix these up.** On v1.3.x the release lived in `kube-system` and the chart templated the `Namespace cozy-system` itself, so `--create-namespace` was forbidden (it collided with the chart's own Namespace). On v1.4.0+ that is inverted: the release lives in `cozy-system`, the chart does NOT template the namespace, and `--create-namespace` is REQUIRED. Passing the old `--namespace kube-system` (no `--create-namespace`) against a v1.4 chart makes the `cozy-system-labeler` pre-install hook fail with `namespaces "cozy-system" not found`, and the whole install aborts before the operator deploys. Pick the form that matches `installer_version`. + +Two keys matter at install time: ```yaml cozystackOperator: @@ -24,17 +28,31 @@ cozystack: Install command shape: ```bash +# v1.4.0+ (current): release in cozy-system, --create-namespace REQUIRED, +# the pre-install labeler hook stamps PSA=privileged on the new namespace. helm --kube-context $CTX upgrade --install cozy-installer \ oci://ghcr.io/cozystack/cozystack/cozy-installer \ --version $INSTALLER_VERSION \ - --namespace kube-system \ + --namespace cozy-system --create-namespace \ --set cozystackOperator.variant=$INSTALLER_VARIANT \ --set cozystack.apiServerHost=$API_HOST \ --set cozystack.apiServerPort=$API_PORT \ --wait --timeout 10m ``` -If `cozy-system` already exists, the chart refuses with `invalid ownership metadata`. Adopt it first: +For a v1.3.x install the form differs (release in `kube-system`, NO `--create-namespace` — the chart templates the namespace itself): + +```bash +# v1.3.x ONLY — do not use against a v1.4+ chart. +helm --kube-context $CTX upgrade --install cozy-installer \ + oci://ghcr.io/cozystack/cozystack/cozy-installer \ + --version $INSTALLER_VERSION \ + --namespace kube-system \ + --set cozystackOperator.variant=$INSTALLER_VARIANT \ + --wait --timeout 10m +``` + +If `cozy-system` already exists **and lacks helm ownership metadata** (a stale bare namespace from a previous attempt), the v1.4 `--create-namespace` install adopts it cleanly only when the labels don't conflict; if helm refuses with `invalid ownership metadata`, adopt it first: ```bash kubectl --context $CTX patch namespace cozy-system --type=merge --patch '{ @@ -42,13 +60,13 @@ kubectl --context $CTX patch namespace cozy-system --type=merge --patch '{ "labels": {"app.kubernetes.io/managed-by": "Helm"}, "annotations": { "meta.helm.sh/release-name": "cozy-installer", - "meta.helm.sh/release-namespace": "kube-system" + "meta.helm.sh/release-namespace": "cozy-system" } } }' ``` -(Skip adoption if the namespace doesn't exist; `--create-namespace` would conflict with the chart's own `Namespace` template, so don't pass it.) +If the namespace is owned by another release, **refuse** — do not relabel. ## Platform Package CR @@ -73,6 +91,15 @@ spec: enabled: true naas: enabled: true + authentication: + oidc: + enabled: true # REQUIRED for a working web dashboard. + # Defaults to false; the isp-full* overlays do NOT turn it on. + # When false, no Keycloak is deployed and the dashboard falls back + # to its token-proxy container, which is broken on v1.4.2 (never + # binds :8000, CrashLoops on its own liveness probe). Set true to + # deploy Keycloak and switch the dashboard to oauth2-proxy. Omit + # (leave false) only for an API-only install with no web dashboard. networking: podCIDR: "10.244.0.0/16" # cozystack default, from packages/core/platform/values.yaml podGateway: "10.244.0.1" # first IP of podCIDR @@ -86,11 +113,16 @@ spec: exposedServices: - api - dashboard + - keycloak # REQUIRED when authentication.oidc.enabled + # is true — gives Keycloak its public + # ingress + LE cert + issuer URL. externalIPs: - 192.0.2.10 exposure: externalIPs # or "loadBalancer" ``` +> **Root tenant `spec.host` does not inherit `publishing.host`.** On v1.4.2 the root tenant CR ships with `spec.host: ""` and is not back-filled from the Package's `publishing.host`. The skill must patch it explicitly in the Phase 8 watch loop (`spec.host` + `spec.ingress: true`), otherwise the per-tenant ingress objects render against an empty domain and Keycloak/dashboard never get usable URLs. See SKILL.md Phase 8. + ## extractedprism (generic kube-apiserver HA) On the `generic` variant, `cozystack:cluster-install` Phase 5.6 installs the extractedprism DaemonSet **before** the cozy-installer chart so the operator's apiServerHost already resolves to a healthy CP endpoint when cozystack-operator starts dialing. From 4a4da4778c342560103da09548bbdcbbdde7854a Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Mon, 1 Jun 2026 12:19:22 +0300 Subject: [PATCH 2/5] fix(cluster-install): normalise installer OCI tag in values-template helm commands The cozy-installer OCI chart tags are published as X.Y.Z with no leading v (verified against the ghcr registry), but both helm command examples in the reference passed $INSTALLER_VERSION verbatim. An operator-supplied v1.4.2 would not match the registry tag 1.4.2 and the install would fail. Strip the leading v into $INSTALLER_VERSION_OCI before --version, matching what SKILL.md Phase 6 already does. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- .../skills/cluster-install/references/values-template.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/plugins/cozystack/skills/cluster-install/references/values-template.md b/plugins/cozystack/skills/cluster-install/references/values-template.md index 903963b..bfc0133 100644 --- a/plugins/cozystack/skills/cluster-install/references/values-template.md +++ b/plugins/cozystack/skills/cluster-install/references/values-template.md @@ -30,9 +30,11 @@ Install command shape: ```bash # v1.4.0+ (current): release in cozy-system, --create-namespace REQUIRED, # the pre-install labeler hook stamps PSA=privileged on the new namespace. +# OCI chart tags are X.Y.Z (no leading v) — normalise before --version. +INSTALLER_VERSION_OCI="${INSTALLER_VERSION#v}" helm --kube-context $CTX upgrade --install cozy-installer \ oci://ghcr.io/cozystack/cozystack/cozy-installer \ - --version $INSTALLER_VERSION \ + --version "$INSTALLER_VERSION_OCI" \ --namespace cozy-system --create-namespace \ --set cozystackOperator.variant=$INSTALLER_VARIANT \ --set cozystack.apiServerHost=$API_HOST \ @@ -44,9 +46,11 @@ For a v1.3.x install the form differs (release in `kube-system`, NO `--create-na ```bash # v1.3.x ONLY — do not use against a v1.4+ chart. +# OCI chart tags are X.Y.Z (no leading v) — normalise before --version. +INSTALLER_VERSION_OCI="${INSTALLER_VERSION#v}" helm --kube-context $CTX upgrade --install cozy-installer \ oci://ghcr.io/cozystack/cozystack/cozy-installer \ - --version $INSTALLER_VERSION \ + --version "$INSTALLER_VERSION_OCI" \ --namespace kube-system \ --set cozystackOperator.variant=$INSTALLER_VARIANT \ --wait --timeout 10m From bd6727179e7fe54d14e0c0aa511072df923c2d52 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Mon, 1 Jun 2026 12:20:49 +0300 Subject: [PATCH 3/5] fix(cluster-install): align operator-facing snippets to v1.4 cozy-system namespace Phase 6 was updated to the v1.4 release layout (cozy-system, --create-namespace, pre-install labeler hook), but the plan-view summary, the artifacts summary, and the issue reproduction template still described the v1.3 form where cozy-installer lived in kube-system and the chart templated the namespace itself. An operator following those leftover snippets would run --namespace kube-system against a v1.4 chart and the labeler hook would abort the install. Bring all three in line with Phase 6. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- plugins/cozystack/skills/cluster-install/SKILL.md | 8 ++++---- .../skills/cluster-install/references/issue-templates.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/plugins/cozystack/skills/cluster-install/SKILL.md b/plugins/cozystack/skills/cluster-install/SKILL.md index 6d670bc..bddbf6b 100644 --- a/plugins/cozystack/skills/cluster-install/SKILL.md +++ b/plugins/cozystack/skills/cluster-install/SKILL.md @@ -373,7 +373,7 @@ cozystack:cluster-install plan context: $CTX ($API_URL) installer release: oci://ghcr.io/cozystack/cozystack/cozy-installer:$INSTALLER_VERSION_OCI (OCI tag = git tag with the v stripped) installer variant: $INSTALLER_VARIANT -helm release ns: kube-system (chart templates Namespace cozy-system itself) +helm release ns: cozy-system (--create-namespace; labeler hook stamps PSA — v1.4+) platform variant: $PLATFORM_VARIANT bundles: $BUNDLES_CSV @@ -408,8 +408,8 @@ storage (ZFS): actions on Continue: 1. Storage provisioning per node (Phase 5.5; one approval per node) 2. (generic only, unless --no-extractedprism) install extractedprism DaemonSet for kube-apiserver HA (~1 min) - 3. (if cozy-system namespace exists but unowned) adopt namespace into kube-system/cozy-installer - 4. helm upgrade --install cozy-installer ... --namespace kube-system (~2 min) + 3. (if cozy-system namespace exists but unowned) adopt namespace into cozy-system/cozy-installer + 4. helm upgrade --install cozy-installer ... --namespace cozy-system --create-namespace (~2 min) 5. wait deploy/cozystack-operator Available; wait CRD packages.cozystack.io Established 6. kubectl apply --filename /tmp/.../platform-package.yaml 7. wait root Tenant CR, patch spec.ingress=true (~3 min — required for Phase 8 to ever finish; breaks the OIDC chicken-and-egg) @@ -1045,7 +1045,7 @@ credentials: artifacts on disk: values file: /cozystack-platform-package.yaml - helm release: kube-system/cozy-installer + helm release: cozy-system/cozy-installer cluster-scoped: package.cozystack.io/cozystack.cozystack-platform handy commands: diff --git a/plugins/cozystack/skills/cluster-install/references/issue-templates.md b/plugins/cozystack/skills/cluster-install/references/issue-templates.md index c5b9245..9a7f91f 100644 --- a/plugins/cozystack/skills/cluster-install/references/issue-templates.md +++ b/plugins/cozystack/skills/cluster-install/references/issue-templates.md @@ -62,7 +62,7 @@ All public. English. Singular first person. No private cluster names or client i ### Steps to reproduce 1. Fresh v cluster, bootstrapped per `docs/v/install/kubernetes//`. -2. `helm upgrade --install cozy-installer oci://ghcr.io/cozystack/cozystack/cozy-installer --version --namespace kube-system --set cozystackOperator.variant= --set cozystack.apiServerHost=` +2. `helm upgrade --install cozy-installer oci://ghcr.io/cozystack/cozystack/cozy-installer --version --namespace cozy-system --create-namespace --set cozystackOperator.variant= --set cozystack.apiServerHost=` (v1.4+; on v1.3.x use `--namespace kube-system` with no `--create-namespace`) 3. Apply Platform Package with `spec.variant: `. Full Package YAML attached. 4. Observe `kubectl get hr --all-namespaces` — . From 70ca303b21ccaaf2f61427ea682892b4cecc6755 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Mon, 1 Jun 2026 12:22:36 +0300 Subject: [PATCH 4/5] fix(cluster-install): correct StorageClass auto-creation claim for v1.4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 8.6 was changed to create default StorageClasses whenever the live cluster comes up empty, on the finding that v1.4.2 does not auto-create them. But two sibling references still asserted the opposite — that v1.4+ exposes tenants.apps.cozystack.io spec.storageClasses and the operator creates the classes from the tenant declaration. That field is absent from the shipped tenant CRD on v1.4.2 and from the monorepo source through current HEAD, so nothing auto-creates StorageClasses on v1.4 either. Align the pitfall note and the wizard preflight entry with the live-gate guidance so the bundle no longer carries contradictory instructions. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- .../skills/cluster-install/references/provider-pitfalls.md | 6 +++--- plugins/cozystack/skills/wizard/SKILL.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/plugins/cozystack/skills/cluster-install/references/provider-pitfalls.md b/plugins/cozystack/skills/cluster-install/references/provider-pitfalls.md index 5067f87..eca3267 100644 --- a/plugins/cozystack/skills/cluster-install/references/provider-pitfalls.md +++ b/plugins/cozystack/skills/cluster-install/references/provider-pitfalls.md @@ -78,13 +78,13 @@ wipefs --all "$DEVICE" The skill's Phase 5.5 step 7 (pre-existing-data check) catches this before `zpool create` and refuses to proceed without operator approval of the wipe. -## Cozystack v1.3.x does not create StorageClasses automatically +## Cozystack does not create StorageClasses automatically (v1.3.x and v1.4.2) **Symptom**: cluster reaches "all HRs Ready", but every stateful tenant workload sits in `Pending: pod has unbound immediate PersistentVolumeClaims`. `kubectl get storageclass` returns no rows. -**Mechanism**: in v1.3.x, neither the cozy-installer chart nor the Platform Package emits StorageClasses; they expect the operator to apply them by hand after `linstor storage-pool create`. v1.4+ exposes `tenants.apps.cozystack.io spec.storageClasses` and the operator creates them based on the tenant declaration. +**Mechanism**: neither the cozy-installer chart nor the Platform Package emits StorageClasses; the operator must apply them by hand after `linstor storage-pool create`. An earlier assumption that v1.4+ exposes `tenants.apps.cozystack.io spec.storageClasses` and auto-creates the classes is **false** — the field is absent from the shipped tenant CRD on v1.4.2 (`kubectl get crd tenants.apps.cozystack.io -o yaml | grep -c storageClass` → `0`) and from the monorepo source through current HEAD, so nothing auto-creates them on v1.4 either. -**Fix**: SKILL.md Phase 8.6 creates `local` (placementCount=1) and `replicated` (placementCount=3, isDefaultClass=true) for v1.3.x. Skip on v1.4+. +**Fix**: SKILL.md Phase 8.6 creates `local` (placementCount=1) and `replicated` (placementCount=3, isDefaultClass=true) whenever the live cluster comes up with no StorageClasses. The gate is the live `kubectl get storageclass` check, not a version number, so it self-skips if a future release ever starts creating them. ## Cozystack v1.3.3 `isp-full` bundle does not include Keycloak diff --git a/plugins/cozystack/skills/wizard/SKILL.md b/plugins/cozystack/skills/wizard/SKILL.md index 7c6007f..c73ba5f 100644 --- a/plugins/cozystack/skills/wizard/SKILL.md +++ b/plugins/cozystack/skills/wizard/SKILL.md @@ -453,7 +453,7 @@ cozystack:wizard — known landmines for your specific combination 3. [MEDIUM] cozystack v1.3.3 does not create StorageClasses automatically source: cozystack/cozystack@v1.3.3 packages/system/linstor/ templates/storageclass.yaml.disabled - why: tenants CRD spec.storageClasses lands only in v1.4. + why: tenants CRD has no spec.storageClasses field (absent through v1.4.2) — nothing auto-creates SCs. mitigation: cluster-install Phase 8.6 will apply local + replicated SCs. 4. [LOW / stale-check] "OCI dashboard install fails after 5 min" From eed8670e3d04b59c1b964fe6ebf797efed92c9d4 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Mon, 1 Jun 2026 12:23:47 +0300 Subject: [PATCH 5/5] fix(cluster-install): patch root tenant host alongside ingress everywhere The Phase 8 watch loop was updated to set both spec.host and spec.ingress on the root tenant, because it ships with spec.host: "" and does not inherit publishing.host. Every other spot still patched or described ingress alone: the plan-view action list, the standing rule in the lessons section, the system-bundle extended-patch example (which also used a placeholder instead of ${HOST}), the After Package apply reference snippet, and the stalled-install recovery snippet in known-failures. Set host in all of them so no example leaves the per-tenant ingress objects rendering against an empty domain. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- plugins/cozystack/skills/cluster-install/SKILL.md | 6 +++--- .../skills/cluster-install/references/known-failures.md | 6 +++--- .../skills/cluster-install/references/values-template.md | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/plugins/cozystack/skills/cluster-install/SKILL.md b/plugins/cozystack/skills/cluster-install/SKILL.md index bddbf6b..8605e17 100644 --- a/plugins/cozystack/skills/cluster-install/SKILL.md +++ b/plugins/cozystack/skills/cluster-install/SKILL.md @@ -412,7 +412,7 @@ actions on Continue: 4. helm upgrade --install cozy-installer ... --namespace cozy-system --create-namespace (~2 min) 5. wait deploy/cozystack-operator Available; wait CRD packages.cozystack.io Established 6. kubectl apply --filename /tmp/.../platform-package.yaml - 7. wait root Tenant CR, patch spec.ingress=true (~3 min — required for Phase 8 to ever finish; breaks the OIDC chicken-and-egg) + 7. wait root Tenant CR, patch spec.host + spec.ingress=true (~3 min — required for Phase 8 to ever finish; breaks the OIDC chicken-and-egg) 8. poll HRs every 30s until all Ready=True (~30–60 min) 9. print access summary @@ -727,7 +727,7 @@ kubectl --context $CTX get hr --all-namespaces \ --output jsonpath='{range .items[?(@.status.conditions[?(@.type=="Ready" && @.status!="True")])]}{.metadata.namespace}/{.metadata.name} {end}' ``` -On the full `system`-bundle path you may also want the root tenant's `etcd`/`monitoring`/`seaweedfs` services (this is what cozystack's own `hack/e2e-install-cozystack.bats` patches): extend the patch to `{"spec":{"ingress":true,"host":"","monitoring":true,"etcd":true,"seaweedfs":true}}` when those were selected in Phase 4. Leave them at their defaults otherwise. +On the full `system`-bundle path you may also want the root tenant's `etcd`/`monitoring`/`seaweedfs` services (this is what cozystack's own `hack/e2e-install-cozystack.bats` patches): extend the patch to `{"spec":{"ingress":true,"host":"${HOST}","monitoring":true,"etcd":true,"seaweedfs":true}}` when those were selected in Phase 4. Leave them at their defaults otherwise. ```text HelmRelease $NS/$NAME has been Failing for $T minutes. @@ -1086,7 +1086,7 @@ If any phase hits a fatal failure that looks like an upstream bug or doc gap, fo - NEVER bootstrap Talos nodes or invoke `boot-to-talos` / `talm` from inside this skill — that flow lives in `/cozystack:talos-bootstrap`. Refuse and hand off. - NEVER auto-rollback a partially provisioned storage state — print backout commands and let the operator decide. - NEVER accept a custom `publishing.host` without an explicit operator confirmation that they own the domain and will configure wildcard DNS — the HTTP-01 cert solver fails silently otherwise. nip.io patterns skip this gate because nip.io is publicly hosted DNS. -- ALWAYS patch `tenants/root.spec.ingress=true` from inside the Phase 8 watch loop as soon as the CR appears, on `system`-bundle installs. The OIDC chicken-and-egg makes Phase 8 unreachable otherwise — dashboard / keycloak / flux-plunger loop forever, every other downstream HR stalls on the missing root ingress. The CR can appear at any point during the watch loop; do not gate the patch behind a fixed pre-Phase-8 wait. +- ALWAYS patch `tenants/root` with BOTH `spec.host=$HOST` and `spec.ingress=true` from inside the Phase 8 watch loop as soon as the CR appears, on `system`-bundle installs. The root tenant ships with `spec.host: ""` and does not inherit `publishing.host`, so an ingress-only patch leaves every per-tenant ingress object rendering against an empty domain. The OIDC chicken-and-egg makes Phase 8 unreachable otherwise — dashboard / keycloak / flux-plunger loop forever, every other downstream HR stalls on the missing root ingress. The CR can appear at any point during the watch loop; do not gate the patch behind a fixed pre-Phase-8 wait. - ALWAYS read variant overlays and `requirements.md` before declaring "this looks fine" — variant-specific checks (CP-label value, ZFS availability, KubeOVN MASTER_NODES) are easy to miss. - ALWAYS pull live data over cached assumption: `kubectl get` over "I think this is …". - ALWAYS write Phase 4 collected values to disk in `/cozystack-platform-package.yaml` before applying — the file is part of the diagnostic bundle if Phase 8 fails. ZFS pool registration is stored separately under `/.state.yaml` `cozystack.storage.nodes[]` and replayed by the Phase 8 post-Ready hook (there is no `LinstorSatelliteConfiguration` CR for the ZFS path). diff --git a/plugins/cozystack/skills/cluster-install/references/known-failures.md b/plugins/cozystack/skills/cluster-install/references/known-failures.md index 33fa08e..965d7a2 100644 --- a/plugins/cozystack/skills/cluster-install/references/known-failures.md +++ b/plugins/cozystack/skills/cluster-install/references/known-failures.md @@ -36,11 +36,11 @@ The root ingress controller doesn't start until `tenants.apps.cozystack.io/root` This is a chicken-and-egg of the `isp-full*` variant + OIDC combination, not a bug in any single component: -- Platform Package does not patch `tenant root.spec.ingress`. +- Platform Package does not patch `tenant root.spec.host` / `spec.ingress`. - The cozystack dependency graph is built so gatekeeper can't come up before ingress, and dashboard can't come up before gatekeeper. - But flux-plunger waits on dashboard, which waits on ingress, which waits on the missing manual patch. -`cozystack:cluster-install` Phase 8 patches `tenants/root.spec.ingress=true` inline as soon as the CR appears in the watch loop, which avoids the trap entirely on a fresh install regardless of when the CRD lands relative to other HRs. +`cozystack:cluster-install` Phase 8 patches `tenants/root` with both `spec.host` and `spec.ingress=true` inline as soon as the CR appears in the watch loop, which avoids the trap entirely on a fresh install regardless of when the CRD lands relative to other HRs. **Recovery on an install that has already stalled in Phase 8** @@ -49,7 +49,7 @@ kubectl --context $CTX --namespace tenant-root wait tenants.apps.cozystack.io/ro --for=jsonpath='{.metadata.name}'=root --timeout=300s kubectl --context $CTX --namespace tenant-root patch tenants.apps.cozystack.io root \ - --type=merge --patch '{"spec":{"ingress":true}}' + --type=merge --patch "{\"spec\":{\"ingress\":true,\"host\":\"${HOST}\"}}" ``` Within ~2 min: diff --git a/plugins/cozystack/skills/cluster-install/references/values-template.md b/plugins/cozystack/skills/cluster-install/references/values-template.md index bfc0133..0fcb3a0 100644 --- a/plugins/cozystack/skills/cluster-install/references/values-template.md +++ b/plugins/cozystack/skills/cluster-install/references/values-template.md @@ -217,13 +217,13 @@ Use `nip.io` dash notation: if the LB IP is `192.0.2.10`, set `publishing.host: ## After Package apply -If `system` bundle is on and `cozystack_tenant_root_ingress` semantics are desired, patch the root tenant after the operator creates it: +If `system` bundle is on and `cozystack_tenant_root_ingress` semantics are desired, patch the root tenant after the operator creates it — set BOTH `spec.host` and `spec.ingress`, since the root tenant ships with `spec.host: ""` and does not inherit `publishing.host` (see the note above). The skill does this from inside the Phase 8 watch loop (SKILL.md Phase 8), not as a separate post-apply step: ```bash kubectl --context $CTX wait tenants.apps.cozystack.io/root --namespace tenant-root \ --for=jsonpath='{.metadata.name}'=root --timeout=300s kubectl --context $CTX --namespace tenant-root patch tenants.apps.cozystack.io root \ - --type=merge --patch '{"spec":{"ingress":true}}' + --type=merge --patch "{\"spec\":{\"ingress\":true,\"host\":\"${HOST}\"}}" ``` This is what creates the `IngressClass` and brings up `ingress-nginx`.