catalyst-onboard CLI
Automates the full onboarding of HTTP apps on the Catalyst Kubernetes Platform — from Harbor project creation to DNS, OPNsense firewall rules, and ArgoCD GitOps deployment on two clusters.
📖 What is catalyst-onboard?
A single CLI command that replaces a multi-step manual process: creating a Harbor registry project, scaffolding a Helm chart from a template, pushing to Git, creating ArgoCD Applications on two clusters, syncing until healthy, creating Cloudflare DNS records, and adding OPNsense WAN firewall rules.
Every step can be re-run safely. Re-run new on a partially failed onboard — it will skip what already exists.
Use --skip-harbor, --skip-scaffold, --skip-dns, --skip-opnsense to jump to a specific step.
Every app is deployed on catalyst-01 and catalyst-02, named <app>-c1 / <app>-c2.
Use --dry-run to preview the full plan without applying any changes.
🏛️ Architecture
The CLI orchestrates six external systems in sequence. No local state is stored — everything is discovered at runtime from ArgoCD.
Creates the container registry project and grants argocd Maintainer access.
Clones the repo, copies example-https-app, replaces all references with <app-name>, commits on feature/<app>, pushes.
Creates <app>-c1 and <app>-c2 with Helm parameter overrides (image, tag, port, VIP, hostname, CF token, Harbor creds).
Triggers ArgoCD sync on both clusters and polls until Healthy (up to SYNC_TIMEOUT_SECONDS).
Idempotent upsert of A <hostname> → <gatewayIp> in the catalyst.xcoinfra.net zone.
Clones the template firewall rule for ports 80/443 targeted at the new VIP, then applies.
🚀 Quick Start
# Enter the CLI directory and activate the venv
cd git-catalyst-apps/.tooling/catalyst-onboard
# Check connectivity to all systems
.venv/bin/python -m catalyst_onboard.cli doctor
# List used / free VIPs
.venv/bin/python -m catalyst_onboard.cli list-ips
# Preview an onboard without applying
.venv/bin/python -m catalyst_onboard.cli new \
-n my-app -i my-image -p 8080 -t 1.0.0 --ip 185.33.182.X \
--dry-run
⌨️ All Commands
doctor
.venv/bin/python -m catalyst_onboard.cli doctor
Checks connectivity and authentication for ArgoCD, Harbor, Cloudflare, and OPNsense.
list-apps
.venv/bin/python -m catalyst_onboard.cli list-apps [--json]
Lists all ArgoCD Applications across both clusters with sync/health status and VIP.
list-ips
.venv/bin/python -m catalyst_onboard.cli list-ips [--json]
Shows used and free VIPs in the BGP pool 185.33.182.30–60. IPs are discovered at runtime from ArgoCD — no local state.
status
.venv/bin/python -m catalyst_onboard.cli status <app-name> [--json]
new
.venv/bin/python -m catalyst_onboard.cli new \
-n <app-name> -i <harbor-image-name> -p <port> -t <tag> --ip 185.33.182.X
[--replicas N] [--dry-run] [--update]
[--skip-harbor] [--skip-scaffold] [--skip-dns] [--skip-opnsense]
delete
.venv/bin/python -m catalyst_onboard.cli delete <app-name> [--yes] [--dry-run]
post-deploy-check
.venv/bin/python -m catalyst_onboard.cli post-deploy-check <app-name>
Probes ArgoCD health, DNS resolution, TCP:443, and HTTPS response.
✨ Onboard Workflow (new)
Steps are executed in a fixed order. Each step has a --skip-* flag. All steps are idempotent — re-running on a partially-failed onboard is the standard recovery path.
| # | Step | What it does | Skip flag |
|---|---|---|---|
| 1 | Preflight | Validates IP is free in ArgoCD; rejects if app already exists | — |
| 2 | Harbor | Creates project <app>, adds HARBOR_MAINTAINER_USER as Maintainer | --skip-harbor |
| 3 | Scaffold | Copies example-https-app Helm chart, string-replaces, commits on feature/<app> | --skip-scaffold |
| 4 | ArgoCD | Creates <app>-c1 + <app>-c2 Applications with Helm overrides | --skip-argocd |
| 5 | Sync + wait | Syncs and polls until Healthy | --skip-sync |
| 6 | Cloudflare DNS | Upserts A <hostname> → <VIP> | --skip-dns |
| 7 | OPNsense | Clones WAN rules for ports 80/443 on the new VIP | --skip-opnsense |
| 8 | Final verification | Confirm the app answers HTTPS 200 end-to-end (see Final Verification) | — |
--ip explicitly. Never guess — use list-ips first to confirm the VIP is free.
--skip-harbor skips the entire Harbor step, including add_project_member. The step is idempotent — only use --skip-harbor if you are sure the argocd user is already a Maintainer on the project.
✅ Final Deployment Verification
After new or --update, always verify the app is actually serving HTTP 200. ArgoCD reports Healthy based on pod health — it does not prove end-to-end reachability.
# Built-in check (DNS + HTTPS + ArgoCD health)
.venv/bin/python -m catalyst_onboard.cli post-deploy-check <app-name>
# Manual curl (ground truth — bypass DNS with --resolve)
curl -o /dev/null -sw "%{http_code}\n" \
--resolve "<app>.catalyst.xcoinfra.net:443:<VIP>" \
https://<app>.catalyst.xcoinfra.net/
If you don't get 200, diagnose in this order:
| Symptom | Likely cause | Action |
|---|---|---|
Could not resolve host | DNS not propagated yet | Use --resolve; wait up to 300 s |
Failed to connect | OPNsense rule missing or BGP not announcing VIP | Check get_all_rules; check list-ips |
000 (no response) | Pods not running / ImagePullBackOff | Run status <app>, inspect ArgoCD resource tree |
ImagePullBackOff | Wrong image arch or Harbor pull secret missing | Check arch is amd64; check argocd is Harbor Maintainer |
502 / 503 with pods Ready | HTTPRoute port mismatch or NetworkPolicy blocking | Check httproutes.yml and network-policies.yml use nginx.containerPort |
200 on --resolve but not on DNS | DNS propagation lag | Wait; use dig @1.1.1.1 to confirm |
Testing before DNS propagates
After a new DNS record is created, Cloudflare's resolver (1.1.1.1) propagates within seconds. Google (8.8.8.8) and corporate DNS servers can take up to 300s.
# Check which resolver sees the record
dig <app>.catalyst.xcoinfra.net @1.1.1.1 +short # Cloudflare
dig <app>.catalyst.xcoinfra.net @8.8.8.8 +short # Google (may lag)
# Force resolution to known VIP — bypasses DNS entirely
curl -o /dev/null -sw "%{http_code}\n" \
--resolve "<app>.catalyst.xcoinfra.net:443:185.33.182.X" \
https://<app>.catalyst.xcoinfra.net/
post-deploy-check uses --resolve internally, so it passes even when DNS is still propagating.🔄 Update Workflow (new --update)
Use --update to redeploy an existing app — e.g. after a new image tag or replica change.
| Step | Fresh new | --update |
|---|---|---|
| Preflight app exists | Fails | Warns, continues |
| Harbor project | Create | No-op if exists |
| Git scaffold | New branch | Re-uses existing branch |
| ArgoCD Applications | Create | Upsert (overwrite params) |
| DNS / OPNsense | Upsert | Same (idempotent) |
Building the image (important on Apple Silicon)
# Always --platform linux/amd64 — K8S nodes are amd64.
# Plain 'docker build' on Mac M1/M2/M3/M4 produces arm64 → ImagePullBackOff.
docker buildx build --platform linux/amd64 \
-t harbor.catalyst.xcoinfra.net/<project>/<image>:<tag> \
--push .
New image tag only
.venv/bin/python -m catalyst_onboard.cli new \
-n myapp -i myimage -p 8080 -t 2.0.0 --ip 185.33.182.35 \
--update --skip-harbor --skip-scaffold --skip-dns --skip-opnsense
--update is not supported. Delete the app first, then re-create with the new IP.
🗑️ Delete Workflow
Removes all resources created by new. Git branch is never touched by design (audit trail).
.venv/bin/python -m catalyst_onboard.cli delete <app-name> [--yes] [--dry-run]
[--keep-harbor] # do NOT delete the Harbor project
[--keep-dns] # do NOT delete the Cloudflare DNS A record
[--keep-opnsense] # do NOT delete the OPNsense WAN rules
Order of operations:
ArgoCD cascade delete (c1 + c2)
→ Harbor: delete all repositories then the project
→ Cloudflare DNS A record
→ OPNsense WAN rules (CATALYST_AUTO_<app>_*)
HARBOR_USER=admin in .env. The argocd user gets 403 on project deletion.
🔀 Rename / Move to a New IP
The CLI has no rename command. To rename an app or move it to a new VIP:
.venv/bin/python -m catalyst_onboard.cli delete <old-app> --yes
delete — audit trail by design)
cd git-catalyst-apps
git push origin --delete feature/<old-app>
--platform linux/amd64)
docker buildx build --platform linux/amd64 \
-t harbor.catalyst.xcoinfra.net/<new-app>/<image>:latest --push .
.venv/bin/python -m catalyst_onboard.cli new \
-n <new-app> -i <image> -p <port> -t latest --ip 185.33.182.X
If the old app used the same VIP, wait ~30s for ArgoCD to garbage-collect before reusing that IP.
⚙️ Environment (.env)
Located at .tooling/catalyst-onboard/.env. Gitignored, chmod 600.
# ── ArgoCD ──────────────────────────────────────────
ARGOCD_URL=https://argocd.catalyst.xcoinfra.net
ARGOCD_USER=admin
ARGOCD_PASSWORD=<password>
ARGOCD_PROJECT=dev-project
ARGOCD_VERIFY_TLS=false
ARGOCD_CLUSTER_1_URL=https://172.16.3.29:6443
ARGOCD_CLUSTER_2_URL=https://172.16.3.119:6443
# ── Harbor (use admin — argocd can't delete projects) ─
HARBOR_URL=https://harbor.catalyst.xcoinfra.net
HARBOR_USER=admin
HARBOR_PASSWORD=<password>
HARBOR_VERIFY_TLS=true
HARBOR_MAINTAINER_USER=argocd
# ── Cloudflare ──────────────────────────────────────
CLOUDFLARE_ZONE_ID=<zone-id>
CLOUDFLARE_TOKEN=<token>
# ── OPNsense ────────────────────────────────────────
OPNSENSE_URL=https://109.234.111.251
OPNSENSE_API_KEY=<key>
OPNSENSE_API_SECRET=<secret>
OPNSENSE_VERIFY_TLS=false
OPNSENSE_TEMPLATE_RULE_DESCRIPTION=CATALYST_TEMPLATE_CHANGE_IP
# ── Git ─────────────────────────────────────────────
GITHUB_TOKEN=<PAT>
GIT_CATALYST_APPS_REPO=https://github.com/centralnicgroup/git-catalyst-apps.git
GIT_CATALYST_APPS_DEFAULT_BRANCH=main
GIT_USER_NAME=catalyst-onboard
GIT_USER_EMAIL=catalyst-onboard@xcoinfra.net
# ── BGP pool ─────────────────────────────────────────
BGP_POOL_START=185.33.182.30
BGP_POOL_END=185.33.182.60
# ── Tuning ──────────────────────────────────────────
SYNC_TIMEOUT_SECONDS=300
SYNC_POLL_INTERVAL_SECONDS=5
🐛 Known Bugs & Fixes
| Bug | Cause | Status |
|---|---|---|
415 on delete |
ArgoCD DELETE requires Content-Type: application/json |
Fixed — clients/argocd.py |
403 on Harbor project check |
argocd user can't GET /projects/{name} |
Fixed — HEAD probe + 403 stub fallback |
403 CSRF token not found on Harbor mutations |
Harbor enforces CSRF after first Set-Cookie | Fixed — _clear_session() before & after each mutating call |
containerPort hardcoded to 80 |
example-https-app had literal 80 instead of Helm value |
Fixed on feature/paul-is-happy — needs merge to main |
| Harbor delete 403 | argocd user not sysadmin |
Fixed — use HARBOR_USER=admin |
Harbor delete 412 "project contains repositories" |
delete_project called without emptying repos first |
Fixed — delete_project now deletes all repos first |
400 "use the upsert flag" on --update |
ArgoCD rejects duplicate POST without upsert param | Fixed — ?upsert=true added in clients/argocd.py |
400 "another operation in progress" |
ArgoCD rejects sync while previous still running | Fixed — retry loop 10×, 5 s sleep |
ImagePullBackOff / ErrImagePull: NotFound (wrong arch) |
On Apple Silicon, docker build without --platform produces an arm64 image. K8S nodes are linux/amd64. Push succeeds so it's easy to miss. |
User error — always use docker buildx build --platform linux/amd64 --push |
Pods Running but 0/1 Ready, probe returns 503 |
App returns 503 on / (maintenance page); default probe path hits / |
Fixed — use a dedicated /healthz endpoint as probe path |
503 upstream connect error / timeout after pods Ready |
HTTPRoute backendRef.port hardcoded to 80 |
Fixed — use nginx.containerPort Helm value in httproutes.yml |
503 despite pods Ready and route correct |
CiliumNetworkPolicy port "80" hardcoded, blocked non-80 traffic |
Fixed — templated in network-policies.yml |
| Service port hardcoded to 80 | nginx.yml Service port: 80 routed to wrong port |
Fixed — templated in nginx.yml |
--skip-harbor skips Maintainer add too |
The whole Harbor step is skipped, including add_project_member |
Gotcha — Harbor step is idempotent; only skip if argocd is already Maintainer |
OPNsense searchRule returns total: 0 |
Broken in OPNsense 25.x even when rules exist | Worked around — use GET /firewall/filter/get and walk filter.rules.rule |
| OPNsense fields read as "N/A" | Reading nested destination.network — doesn't exist |
Gotcha — fields are flat: destination_net, destination_port, source_net |
⚠️ OPNsense: where to see our rules in the UI
Rules created by the CLI via the os-firewall plugin API do NOT appear on the classic Firewall → Rules → WAN page (firewall_rules.php?if=wan). This is by design — the plugin stores its rules separately.
| System | Storage | UI location | API |
|---|---|---|---|
| Legacy (manual) | /conf/config.xml → <filter><rule> | Firewall → Rules → WAN | no public API |
os-firewall plugin (ours) | separate XML | Firewall → Automation → Filter | /api/firewall/filter/* |
Both compile into the same pf kernel ruleset — both are active. They just live on different UI pages. If you don't see your rules in Rules → WAN, go to Automation → Filter.
Cancelling a stuck ArgoCD operation
If sync fails with "another operation is already in progress" after all retries, cancel it manually:
TOKEN=$(curl -sk -X POST https://argocd.catalyst.xcoinfra.net/api/v1/session \
-H 'Content-Type: application/json' \
-d '{"username":"admin","password":"<pwd>"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['token'])")
curl -sk -X DELETE \
"https://argocd.catalyst.xcoinfra.net/api/v1/applications/<app>-c1/operation" \
-H "Authorization: Bearer $TOKEN"
curl -sk -X DELETE \
"https://argocd.catalyst.xcoinfra.net/api/v1/applications/<app>-c2/operation" \
-H "Authorization: Bearer $TOKEN"
Then re-run the CLI command that failed.