feat(ci): CI & runner monitoring — capacity view, live Actions CI, thin alerting #3

Merged
gmackie merged 15 commits from feat/ci-runner-monitoring into main 2026-06-05 05:15:12 +00:00
Owner

Implements docs/plans/2026-06-05-ci-runner-monitoring. Phase 1 runner capacity view, Phase 2 live Forgejo Actions CI, Phase 3 thin alerting + truthful deploy status. Full pnpm typecheck (10/10) + 94 web tests + go tests green.

Implements docs/plans/2026-06-05-ci-runner-monitoring. Phase 1 runner capacity view, Phase 2 live Forgejo Actions CI, Phase 3 thin alerting + truthful deploy status. Full pnpm typecheck (10/10) + 94 web tests + go tests green.
Reads action_runner via psql (self-gating on non-forgejo nodes), derives
online/offline from last_online staleness, POSTs to the runner-report
endpoint. Wired into the agent run cycle on a 60s cadence.
Cross-compile + deploy to hetzner-fg is a separate ops step.
Polls action_run/action_run_job and tails action_task .log files every 2.5s,
pushing per-job status + new log lines. NOTE: verified against a real log on
hetzner-fg — the format is plain newline-delimited '<RFC3339Nano> <content>'
text, NOT the length-prefixed format the plan assumed. Confirmed the live
Forgejo DB is 'forgejo'. Agent POSTs progress to the server (relay in 2.2).
Cross-compile + deploy to hetzner-fg is a separate ops step.
New POST /api/fg/ci/actions/progress maps the agent's commitSha -> changeset ->
app and broadcasts an ephemeral pipeline_event (no DB row) onto the app's hub
channel, reusing use-pipeline-stream. Maps via commit (not build) because the
workflow_run webhook only lands a build at completion.
Subscribes via use-pipeline-stream; renders per-job status chips
(pending->running->green) and an auto-tailing log pane fed by the relayed
ci-actions progress events. Live runs surface here (no build row exists until
the workflow_run webhook lands at completion).
fireAlert dedupes on a firing dedupe_key (one row + one notify per transition);
resolveAlert flips firing rows resolved. Adds alerts.dedupe_key column. Extends
the alerts-client AlertType union + label map for the new types.
Adds ci_webhook_status table; agent now reports Forgejo webhook last_status
alongside runners. POST /api/fg/ci/monitoring/evaluate evaluates runner_offline
/ webhook_delivery_failed / ci_failed into fire/resolve via a pure tested
evaluator. Scheduler: documented systemd timer on hetzner-master (CF cron
fallback noted) per the plan.
notifyAlertFired now resolves the alert's node workspace, sends push, and emails
workspace members via a new lib/email.ts (no-ops with a warn when RESEND_API_KEY
is unset — it currently is). Respects alert_rules: a type whose rules are all
disabled is suppressed. Fires once per transition (dedupe in fireAlert).
fix(deploy): truthful deploy status — secret-sync no longer masks a healthy deploy
Some checks failed
AI Code Review / review (pull_request) Failing after 9s
CI / ci (pull_request) Failing after 6m18s
d09659d6f8
deploy.yml: secret-sync is now continue-on-error; an explicit success signal is
POSTed to /api/fg/deploys/complete after wrangler deploy + health check, and a
failure signal only when the deploy/health steps (not secret-sync) fail. New
endpoint resolves/fires the deploy_failed alert accordingly.
Note: requires an FG_API_TOKEN Forgejo secret (curl is non-blocking if unset).
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
gmackie/ForgeGraph!3
No description provided.