Building SplitDecision: Multi-Agent AI Debates in Next.js

Ever been stuck choosing between two options? What if you could watch four AI agents with wildly different personalities argue it out for you, live?

That’s exactly what I built with SplitDecision. You type in two options, hit compare, and a full-blown debate unfolds in real-time, token by token, complete with rebuttals, a final verdict, and a confidence score.

Here’s what I learned and how it all works under the hood.

What Is SplitDecision?

You enter two options you’re torn between. Four AI agents debate them in two rounds, then a synthesizer agent delivers a final verdict with a confidence score. The whole thing streams live.

SplitDecision input form

The stack:

Frontend: Next.js 15 (App Router), React 19, TypeScript
Styling & Animations: Tailwind CSS, Framer Motion
AI: OpenAI API (GPT-4o Mini / GPT-4.1 Nano / GPT-4.1 Mini)
Rate Limiting & Storage: Upstash Redis
Deployment: Vercel

The Four Agents

Each agent has a fixed archetype that shapes how they approach any comparison:

The Analyst (Blue) - Data-driven. Cites specs, benchmarks, and numbers. Won’t give you a vague opinion.
The Contrarian (Red) - Defends the underdog. Exposes hidden costs and overlooked advantages.
The Pragmatist (Green) - Your experienced friend who’s actually used both options.
The Wildcard (Purple) - Sees angles nobody else does. Future trends, philosophical implications, second-order effects.

The Analyst agent responding in a Mac vs Windows debate

The interactions between them are where it gets interesting. The Contrarian tears apart The Analyst’s data, The Wildcard reframes the entire discussion, and The Pragmatist brings everyone back to earth.

How the Debate Works

The debate flows through four phases:

1. Validation - A validation agent checks whether the comparison makes sense. Temperature set to 0.0 for deterministic output, plus OpenAI’s moderation API for content safety.

2. Round 1 - Initial Takes - Each agent responds sequentially with 400 tokens. Temperature at 0.9 for personality-rich responses.

3. Round 2 - Rebuttals - Each agent receives the full Round 1 transcript and responds directly to the others by name. 250 tokens to keep things punchy.

4. Verdict - A synthesizer agent reads everything and produces a structured verdict with a winner, confidence score (50-95%), and conditional recommendations.

The orchestration is a simple loop on the client side:

for (const agentKey of AGENT_ORDER) {
  const msgId = `r1-${agentKey}`;
  setMessages(prev => [...prev, {
    id: msgId, agentKey, round: 1, text: '', isStreaming: true
  }]);

  let fullText = '';
  for await (const chunk of streamChat(apiKey, {
    type: 'agent', agentKey, ...
  })) {
    fullText += chunk;
    setMessages(prev => prev.map(msg =>
      msg.id === msgId ? { ...msg, text: fullText } : msg
    ));
  }

  round1Results[agentKey] = fullText;
}

Each agent streams token by token. When Round 1 finishes, the full transcript gets bundled into Round 2 prompts so agents can reference each other.

Prompt Engineering

This is where the real work happened.

Personality Through System Prompts

Vague instructions like “be analytical” don’t work. You need to tell the model exactly how to think. Here’s the Analyst’s default system prompt:

You are The Analyst, a data-driven, no-nonsense comparison expert. Focus exclusively on specs, numbers, benchmarks, cost, market data, and measurable differences. Quantify everything you can. Your tone is professional and concise. Never give vague opinions, back claims with data or concrete reasoning. Keep your response under 150 words.

And the same agent in “Startup Bros” theme:

You are The Analyst, a growth-obsessed startup metrics guru. You talk in terms of TAM, CAC, LTV, burn rate, and runway. Every comparison is framed as a market opportunity. You reference Y Combinator, a16z, and Series A benchmarks. You say things like ’the unit economics here are clear.’ Keep your response under 150 words.

Same archetype, completely different personality. This is how I support 9 debate themes with 72 unique agent prompts total.

Making AI Sound Human

I added global writing rules to every prompt to avoid AI slop:

Use contractions (don’t, can’t, it’s)
Vary sentence length
Never use em dashes or semicolons
Avoid hedge words (arguably, essentially, fundamentally)
No filler phrases (at the end of the day, when it comes to)

These small constraints made a massive difference. The responses feel like actual personalities instead of four variations of “certainly, here’s my analysis.”

Temperature and Token Budgets

Phase	Temperature	Why
Validation	0.0	Deterministic yes/no
Debates (R1 & R2)	0.9	Creative, personality-rich
Verdict	0.7	Grounded synthesis

Lower temperatures made all four agents sound the same. 0.9 gave each personality room to breathe.

Token limits shape how agents communicate too. Round 1 gets 400 tokens for real arguments, Round 2 gets 250 to force direct engagement instead of restating positions, and the Verdict gets 500 for structured synthesis.

Streaming

I built dual-mode streaming: direct browser calls when users bring their own API key, and server-proxied calls for the free tier.

Direct Browser Streaming

async function* streamDirect(
  apiKey: string, req: StreamRequest
): AsyncGenerator<string> {
  const client = new OpenAI({
    apiKey,
    dangerouslyAllowBrowser: true
  });
  const stream = await client.chat.completions.create({
    model, messages, max_tokens, temperature, stream: true
  });
  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content;
    if (text) yield text;
  }
}

The AsyncGenerator pattern lets the UI consume tokens with a simple for await loop, keeping orchestration code clean.

Server-Proxied Streaming

export async function POST(req: Request) {
  const ip = req.headers.get('x-forwarded-for');
  const limit = await rateLimit(ip);
  if (!limit.ok) return new Response('Rate limited', { status: 429 });

  const stream = await client.chat.completions.create({
    model, messages, max_tokens, temperature, stream: true
  });

  return new Response(new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content;
        if (text) controller.enqueue(encoder.encode(text));
      }
      controller.close();
    }
  }));
}

Same experience for the user, but the API key stays on the server. Rate limiting uses Upstash’s fixed-window algorithm, 50 comparisons per IP per 24 hours.

The Verdict System

The verdict isn’t just “Option A wins.” I enforce a structured format in the prompt:

WINNER: [Option A or Option B]
CONFIDENCE: [50-95]%

[3-4 sentence synthesis]

WHAT WOULD FLIP THIS: [1-2 sentences]

PICK [Option A] IF: [1 sentence]
PICK [Option B] IF: [1 sentence]

Then parse it with regex as the verdict streams in:

export function parseVerdict(
  text: string, optionA: string, optionB: string
) {
  const winnerMatch = text.match(/WINNER:\s*(.+)/);
  const confMatch = text.match(/CONFIDENCE:\s*(\d+)/);

  let winner = null;
  if (winnerMatch) {
    const raw = winnerMatch[1].trim().toLowerCase();
    if (raw.includes(optionA.toLowerCase())) winner = optionA;
    else if (raw.includes(optionB.toLowerCase())) winner = optionB;
  }

  const confidence = confMatch
    ? Math.max(50, Math.min(95, parseInt(confMatch[1], 10)))
    : null;

  return { winner, confidence };
}

Confidence is clamped to 50-95%. Below 50% doesn’t make sense for a binary choice, and above 95% feels dishonest for subjective comparisons.

The Theming System

There are 9 debate themes that completely reshape agent personalities:

Default Panel - Professional experts giving measured takes
Startup Bros - Everything framed as market opportunity with VC lingo
Academic Panel - Peer-reviewed discourse with citations
Bar Argument - Friends arguing over drinks, one of them definitely googled it
Shark Tank - Is this comparison worth investing in?
Reddit Thread - Upvotes, hot takes, and “this is the way”
Courtroom Trial - Legal drama with expert witnesses and objections
Sports Commentary - Play-by-play analysis of the matchup
Philosophy Seminar - Existential deliberation on the nature of choice

All 9 debate themes available in SplitDecision

Each theme rewrites all four agent prompts for both rounds, 72 total plus validation and verdict prompts. A “React vs Svelte” debate feels completely different as a Courtroom Trial versus a Bar Argument.

Trending comparisons feed with winners and confidence scores

What I Learned

Constraints create character. Without strict writing rules, token limits, and specific vocabulary guidance, all four agents sound the same. The more constraints I added, the more distinct each voice became.

Agent ordering matters. The first agent sets the frame and everyone else reacts to it. The Analyst always goes first, which means it has outsized influence on every debate.

Structured LLM output is fragile. LLMs don’t always follow format instructions perfectly. Clear regex patterns with sensible fallbacks are essential.

Token budgets shape behavior. The Round 2 limit of 250 tokens was the breakthrough. It forced agents to actually engage with each other instead of restating their position.

Streaming changes perception. Watching agents “think” token by token feels like a live event. It’s fundamentally different from waiting for a complete response.

Claude Code made this viable as a solo project. From streaming logic to 72 unique agent prompts, AI-assisted development made the scope manageable.

Wrap Up

Multi-agent systems don’t need to be complicated infrastructure projects. Good prompt engineering, clear agent archetypes, and a streaming-first architecture can create AI interactions that feel like watching a real discussion.

The biggest takeaway? Personality in prompts matters more than model selection. GPT-4o Mini with a great system prompt produces more engaging debates than a larger model with a generic one. The constraints you put on your agents are what give them character.

If you’re interested in building multi-agent systems, start with clearly defined roles, invest time in prompt writing, and let your agents actually interact with each other’s output. That’s where the magic happens.

What Is SplitDecision?#

The Four Agents#

How the Debate Works#

Prompt Engineering#

Personality Through System Prompts#

Making AI Sound Human#

Temperature and Token Budgets#

Streaming#

Direct Browser Streaming#

Server-Proxied Streaming#

The Verdict System#

The Theming System#

What I Learned#

Wrap Up#