catalyst-onboard CLI

Automates the full onboarding of HTTP apps on the Catalyst Kubernetes Platform — from Harbor project creation to DNS, OPNsense firewall rules, and ArgoCD GitOps deployment on two clusters.

⚡ Python 3.11+ · Typer · httpx · pydantic-settings

📖 What is catalyst-onboard?

A single CLI command that replaces a multi-step manual process: creating a Harbor registry project, scaffolding a Helm chart from a template, pushing to Git, creating ArgoCD Applications on two clusters, syncing until healthy, creating Cloudflare DNS records, and adding OPNsense WAN firewall rules.

🔁 Idempotent

Every step can be re-run safely. Re-run new on a partially failed onboard — it will skip what already exists.

🎯 Selective

Use --skip-harbor, --skip-scaffold, --skip-dns, --skip-opnsense to jump to a specific step.

🏗️ Dual-cluster

Every app is deployed on catalyst-01 and catalyst-02, named <app>-c1 / <app>-c2.

🔍 Dry-run

Use --dry-run to preview the full plan without applying any changes.

🏛️ Architecture

The CLI orchestrates six external systems in sequence. No local state is stored — everything is discovered at runtime from ArgoCD.

1
Harbor

Creates the container registry project and grants argocd Maintainer access.

2
Git / Helm Chart

Clones the repo, copies example-https-app, replaces all references with <app-name>, commits on feature/<app>, pushes.

3
ArgoCD Applications

Creates <app>-c1 and <app>-c2 with Helm parameter overrides (image, tag, port, VIP, hostname, CF token, Harbor creds).

4
Sync + Wait

Triggers ArgoCD sync on both clusters and polls until Healthy (up to SYNC_TIMEOUT_SECONDS).

5
Cloudflare DNS

Idempotent upsert of A <hostname> → <gatewayIp> in the catalyst.xcoinfra.net zone.

6
OPNsense WAN rules

Clones the template firewall rule for ports 80/443 targeted at the new VIP, then applies.

🚀 Quick Start

# Enter the CLI directory and activate the venv
cd git-catalyst-apps/.tooling/catalyst-onboard

# Check connectivity to all systems
.venv/bin/python -m catalyst_onboard.cli doctor

# List used / free VIPs
.venv/bin/python -m catalyst_onboard.cli list-ips

# Preview an onboard without applying
.venv/bin/python -m catalyst_onboard.cli new \
  -n my-app -i my-image -p 8080 -t 1.0.0 --ip 185.33.182.X \
  --dry-run

⌨️ All Commands

doctor

.venv/bin/python -m catalyst_onboard.cli doctor

Checks connectivity and authentication for ArgoCD, Harbor, Cloudflare, and OPNsense.

list-apps

.venv/bin/python -m catalyst_onboard.cli list-apps [--json]

Lists all ArgoCD Applications across both clusters with sync/health status and VIP.

list-ips

.venv/bin/python -m catalyst_onboard.cli list-ips [--json]

Shows used and free VIPs in the BGP pool 185.33.182.30–60. IPs are discovered at runtime from ArgoCD — no local state.

status

.venv/bin/python -m catalyst_onboard.cli status <app-name> [--json]

new

.venv/bin/python -m catalyst_onboard.cli new \
  -n <app-name> -i <harbor-image-name> -p <port> -t <tag> --ip 185.33.182.X
  [--replicas N] [--dry-run] [--update]
  [--skip-harbor] [--skip-scaffold] [--skip-dns] [--skip-opnsense]

delete

.venv/bin/python -m catalyst_onboard.cli delete <app-name> [--yes] [--dry-run]

post-deploy-check

.venv/bin/python -m catalyst_onboard.cli post-deploy-check <app-name>

Probes ArgoCD health, DNS resolution, TCP:443, and HTTPS response.

Onboard Workflow (new)

Steps are executed in a fixed order. Each step has a --skip-* flag. All steps are idempotent — re-running on a partially-failed onboard is the standard recovery path.

#StepWhat it doesSkip flag
1PreflightValidates IP is free in ArgoCD; rejects if app already exists
2HarborCreates project <app>, adds HARBOR_MAINTAINER_USER as Maintainer--skip-harbor
3ScaffoldCopies example-https-app Helm chart, string-replaces, commits on feature/<app>--skip-scaffold
4ArgoCDCreates <app>-c1 + <app>-c2 Applications with Helm overrides--skip-argocd
5Sync + waitSyncs and polls until Healthy--skip-sync
6Cloudflare DNSUpserts A <hostname> → <VIP>--skip-dns
7OPNsenseClones WAN rules for ports 80/443 on the new VIP--skip-opnsense
8Final verificationConfirm the app answers HTTPS 200 end-to-end (see Final Verification)
⚠️ Always pass --ip explicitly. Never guess — use list-ips first to confirm the VIP is free.
⚠️ --skip-harbor skips the entire Harbor step, including add_project_member. The step is idempotent — only use --skip-harbor if you are sure the argocd user is already a Maintainer on the project.

Final Deployment Verification

After new or --update, always verify the app is actually serving HTTP 200. ArgoCD reports Healthy based on pod health — it does not prove end-to-end reachability.

# Built-in check (DNS + HTTPS + ArgoCD health)
.venv/bin/python -m catalyst_onboard.cli post-deploy-check <app-name>

# Manual curl (ground truth — bypass DNS with --resolve)
curl -o /dev/null -sw "%{http_code}\n" \
  --resolve "<app>.catalyst.xcoinfra.net:443:<VIP>" \
  https://<app>.catalyst.xcoinfra.net/

If you don't get 200, diagnose in this order:

SymptomLikely causeAction
Could not resolve hostDNS not propagated yetUse --resolve; wait up to 300 s
Failed to connectOPNsense rule missing or BGP not announcing VIPCheck get_all_rules; check list-ips
000 (no response)Pods not running / ImagePullBackOffRun status <app>, inspect ArgoCD resource tree
ImagePullBackOffWrong image arch or Harbor pull secret missingCheck arch is amd64; check argocd is Harbor Maintainer
502 / 503 with pods ReadyHTTPRoute port mismatch or NetworkPolicy blockingCheck httproutes.yml and network-policies.yml use nginx.containerPort
200 on --resolve but not on DNSDNS propagation lagWait; use dig @1.1.1.1 to confirm

Testing before DNS propagates

After a new DNS record is created, Cloudflare's resolver (1.1.1.1) propagates within seconds. Google (8.8.8.8) and corporate DNS servers can take up to 300s.

# Check which resolver sees the record
dig <app>.catalyst.xcoinfra.net @1.1.1.1 +short   # Cloudflare
dig <app>.catalyst.xcoinfra.net @8.8.8.8 +short   # Google (may lag)

# Force resolution to known VIP — bypasses DNS entirely
curl -o /dev/null -sw "%{http_code}\n" \
  --resolve "<app>.catalyst.xcoinfra.net:443:185.33.182.X" \
  https://<app>.catalyst.xcoinfra.net/
ℹ️ post-deploy-check uses --resolve internally, so it passes even when DNS is still propagating.

🔄 Update Workflow (new --update)

Use --update to redeploy an existing app — e.g. after a new image tag or replica change.

StepFresh new--update
Preflight app existsFailsWarns, continues
Harbor projectCreateNo-op if exists
Git scaffoldNew branchRe-uses existing branch
ArgoCD ApplicationsCreateUpsert (overwrite params)
DNS / OPNsenseUpsertSame (idempotent)

Building the image (important on Apple Silicon)

# Always --platform linux/amd64 — K8S nodes are amd64.
# Plain 'docker build' on Mac M1/M2/M3/M4 produces arm64 → ImagePullBackOff.
docker buildx build --platform linux/amd64 \
  -t harbor.catalyst.xcoinfra.net/<project>/<image>:<tag> \
  --push .

New image tag only

.venv/bin/python -m catalyst_onboard.cli new \
  -n myapp -i myimage -p 8080 -t 2.0.0 --ip 185.33.182.35 \
  --update --skip-harbor --skip-scaffold --skip-dns --skip-opnsense
⚠️ Changing the VIP via --update is not supported. Delete the app first, then re-create with the new IP.

🗑️ Delete Workflow

Removes all resources created by new. Git branch is never touched by design (audit trail).

.venv/bin/python -m catalyst_onboard.cli delete <app-name> [--yes] [--dry-run]
  [--keep-harbor]    # do NOT delete the Harbor project
  [--keep-dns]       # do NOT delete the Cloudflare DNS A record
  [--keep-opnsense]  # do NOT delete the OPNsense WAN rules

Order of operations:

ArgoCD cascade delete (c1 + c2)Harbor: delete all repositories then the projectCloudflare DNS A recordOPNsense WAN rules (CATALYST_AUTO_<app>_*)
ℹ️ Harbor delete requires HARBOR_USER=admin in .env. The argocd user gets 403 on project deletion.

🔀 Rename / Move to a New IP

The CLI has no rename command. To rename an app or move it to a new VIP:

1
Delete the old app

.venv/bin/python -m catalyst_onboard.cli delete <old-app> --yes

2
Clean up the git branch (not done by delete — audit trail by design)

cd git-catalyst-apps
git push origin --delete feature/<old-app>

3
Build + push the new image (with --platform linux/amd64)

docker buildx build --platform linux/amd64 \
  -t harbor.catalyst.xcoinfra.net/<new-app>/<image>:latest --push .

4
Onboard the new app

.venv/bin/python -m catalyst_onboard.cli new \
  -n <new-app> -i <image> -p <port> -t latest --ip 185.33.182.X

If the old app used the same VIP, wait ~30s for ArgoCD to garbage-collect before reusing that IP.

⚙️ Environment (.env)

Located at .tooling/catalyst-onboard/.env. Gitignored, chmod 600.

# ── ArgoCD ──────────────────────────────────────────
ARGOCD_URL=https://argocd.catalyst.xcoinfra.net
ARGOCD_USER=admin
ARGOCD_PASSWORD=<password>
ARGOCD_PROJECT=dev-project
ARGOCD_VERIFY_TLS=false
ARGOCD_CLUSTER_1_URL=https://172.16.3.29:6443
ARGOCD_CLUSTER_2_URL=https://172.16.3.119:6443

# ── Harbor (use admin — argocd can't delete projects) ─
HARBOR_URL=https://harbor.catalyst.xcoinfra.net
HARBOR_USER=admin
HARBOR_PASSWORD=<password>
HARBOR_VERIFY_TLS=true
HARBOR_MAINTAINER_USER=argocd

# ── Cloudflare ──────────────────────────────────────
CLOUDFLARE_ZONE_ID=<zone-id>
CLOUDFLARE_TOKEN=<token>

# ── OPNsense ────────────────────────────────────────
OPNSENSE_URL=https://109.234.111.251
OPNSENSE_API_KEY=<key>
OPNSENSE_API_SECRET=<secret>
OPNSENSE_VERIFY_TLS=false
OPNSENSE_TEMPLATE_RULE_DESCRIPTION=CATALYST_TEMPLATE_CHANGE_IP

# ── Git ─────────────────────────────────────────────
GITHUB_TOKEN=<PAT>
GIT_CATALYST_APPS_REPO=https://github.com/centralnicgroup/git-catalyst-apps.git
GIT_CATALYST_APPS_DEFAULT_BRANCH=main
GIT_USER_NAME=catalyst-onboard
GIT_USER_EMAIL=catalyst-onboard@xcoinfra.net

# ── BGP pool ─────────────────────────────────────────
BGP_POOL_START=185.33.182.30
BGP_POOL_END=185.33.182.60

# ── Tuning ──────────────────────────────────────────
SYNC_TIMEOUT_SECONDS=300
SYNC_POLL_INTERVAL_SECONDS=5

🐛 Known Bugs & Fixes

BugCauseStatus
415 on delete ArgoCD DELETE requires Content-Type: application/json Fixedclients/argocd.py
403 on Harbor project check argocd user can't GET /projects/{name} Fixed — HEAD probe + 403 stub fallback
403 CSRF token not found on Harbor mutations Harbor enforces CSRF after first Set-Cookie Fixed_clear_session() before & after each mutating call
containerPort hardcoded to 80 example-https-app had literal 80 instead of Helm value Fixed on feature/paul-is-happy — needs merge to main
Harbor delete 403 argocd user not sysadmin Fixed — use HARBOR_USER=admin
Harbor delete 412 "project contains repositories" delete_project called without emptying repos first Fixeddelete_project now deletes all repos first
400 "use the upsert flag" on --update ArgoCD rejects duplicate POST without upsert param Fixed?upsert=true added in clients/argocd.py
400 "another operation in progress" ArgoCD rejects sync while previous still running Fixed — retry loop 10×, 5 s sleep
ImagePullBackOff / ErrImagePull: NotFound (wrong arch) On Apple Silicon, docker build without --platform produces an arm64 image. K8S nodes are linux/amd64. Push succeeds so it's easy to miss. User error — always use docker buildx build --platform linux/amd64 --push
Pods Running but 0/1 Ready, probe returns 503 App returns 503 on / (maintenance page); default probe path hits / Fixed — use a dedicated /healthz endpoint as probe path
503 upstream connect error / timeout after pods Ready HTTPRoute backendRef.port hardcoded to 80 Fixed — use nginx.containerPort Helm value in httproutes.yml
503 despite pods Ready and route correct CiliumNetworkPolicy port "80" hardcoded, blocked non-80 traffic Fixed — templated in network-policies.yml
Service port hardcoded to 80 nginx.yml Service port: 80 routed to wrong port Fixed — templated in nginx.yml
--skip-harbor skips Maintainer add too The whole Harbor step is skipped, including add_project_member Gotcha — Harbor step is idempotent; only skip if argocd is already Maintainer
OPNsense searchRule returns total: 0 Broken in OPNsense 25.x even when rules exist Worked around — use GET /firewall/filter/get and walk filter.rules.rule
OPNsense fields read as "N/A" Reading nested destination.network — doesn't exist Gotcha — fields are flat: destination_net, destination_port, source_net

⚠️ OPNsense: where to see our rules in the UI

Rules created by the CLI via the os-firewall plugin API do NOT appear on the classic Firewall → Rules → WAN page (firewall_rules.php?if=wan). This is by design — the plugin stores its rules separately.

SystemStorageUI locationAPI
Legacy (manual)/conf/config.xml<filter><rule>Firewall → Rules → WANno public API
os-firewall plugin (ours)separate XMLFirewall → Automation → Filter/api/firewall/filter/*

Both compile into the same pf kernel ruleset — both are active. They just live on different UI pages. If you don't see your rules in Rules → WAN, go to Automation → Filter.

Cancelling a stuck ArgoCD operation

If sync fails with "another operation is already in progress" after all retries, cancel it manually:

TOKEN=$(curl -sk -X POST https://argocd.catalyst.xcoinfra.net/api/v1/session \
  -H 'Content-Type: application/json' \
  -d '{"username":"admin","password":"<pwd>"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['token'])")

curl -sk -X DELETE \
  "https://argocd.catalyst.xcoinfra.net/api/v1/applications/<app>-c1/operation" \
  -H "Authorization: Bearer $TOKEN"

curl -sk -X DELETE \
  "https://argocd.catalyst.xcoinfra.net/api/v1/applications/<app>-c2/operation" \
  -H "Authorization: Bearer $TOKEN"

Then re-run the CLI command that failed.