I’ve been exploring multi-agent systems and wanted to test a specific idea: can three AI agents with fixed roles actually ship software to production on their own, daily? So I built Utility Forge to find out.

Live demo · Repo

The Three Agents

  • Ava PO (Product Owner): generates one tool idea per day, scores candidates, creates a GitHub issue with acceptance criteria
  • Eve SE (Software Engineer): picks up the issue, implements the tool, runs tests, opens a PR
  • Nora QA (QA Reviewer): validates the PR against acceptance criteria, auto-merges on pass

There’s no shared runtime or message queue. They talk through GitHub: issues, labels, comments, repository_dispatch events. All state lives in the repo, so you can look at any issue and see exactly what happened.

How It Flows

Ava fires at 9 AM UTC. She generates three tool candidates via OpenAI, scores each one based on value, effort, confidence, and a novelty penalty that discourages repeating recent ideas. The winner becomes a GitHub issue. Ava dispatches se_ready, Eve picks it up, generates the tool under site/tools/, runs tests, and opens a PR. Eve then dispatches se_pr_ready. Nora waits 15 minutes before starting (without that delay she’d sometimes evaluate a PR before GitHub had finished processing it), then runs the test suite, checks the acceptance criteria from the original issue, and auto-merges if everything passes. The merge triggers the Pages deploy.

There’s also a watchdog that runs hourly and re-dispatches any agent that’s been sitting idle too long. It’s what keeps the pipeline from silently stalling after a transient failure.

What’s Shipped So Far

  • JSON Formatter / Minifier / Key Sorter
  • Markdown Table Builder from CSV
  • SQL Formatter and Pretty Printer
  • Cron Expression Explainer

Small, focused, no-install tools. Exactly what the scoring formula favors.

A Few Things I Learned

I didn’t need a message queue or a custom orchestration framework. GitHub Issues + labels + dispatch events turned out to be enough to coordinate three independent agents. That surprised me.

The part I underestimated was how much the PO prompt matters. Ava’s acceptance criteria get re-read by Nora later to drive QA decisions. When Ava writes something vague, Nora makes vague decisions. Getting that first prompt right had more impact than anything I did in the SE or QA workflows.

I also added fallbacks everywhere. Ava falls back to a seed idea file when OpenAI fails, Nora retries failed merges with exponential backoff. An autonomous system that hard-crashes on any API hiccup isn’t really autonomous.

Still alpha, but it ships something every day. That was the goal.