ai ai-tooling fsharp infrastructure lessons-learned

I Built an AI Coding Orchestration System. Here's What I Learned.

· 6 min read

Most devs use AI tools now. Org productivity hasn't moved. I built a system to fix that, and here's what actually worked.

Over 91% of developers now use AI coding tools. PR volume is surging. And organizational productivity has barely moved.

That gap keeps showing up in the data. Teams adopt AI coding tools, generate more PRs, but review time climbs even faster. More code going in, same bottleneck at review, same deployment queues, same or worse quality gates.

The Bottleneck Shifted

AI coding assistants genuinely speed up code generation. But that just moves the constraint downstream into code review, QA, merge conflicts, and deployment queues. None of those got faster. They got worse, because they’re processing more volume at lower average quality.

It wasn’t a code generation problem. It was a pipeline problem.

Why I Built Hivemind

About a year ago I was running Claude Code manually across three repos at the same time. Environments were contaminating each other, context was getting tangled, and I was spending more time managing the AI than actually writing code. I dealt with that for months before I finally started building something to fix it.

Hivemind is the orchestration platform that came out of that. I’ve been using it daily across four projects for the past four months or so. It doesn’t try to make AI write code faster. Instead, it runs the full development pipeline: plan, implement, verify, review, fix, PR.

Each phase is explicit with defined inputs and outputs. The plan phase constrains what gets implemented. The verify phase catches what the implement phase broke. The review phase looks at the actual diff.

The thing I keep coming back to is that the orchestration matters more than the model. SWE-bench data shows the same model scoring anywhere from 42% to 78% depending on which agent wraps it. Same intelligence, different scaffolding, wildly different results.

Container Isolation

Every phase runs inside its own Docker container. A “drone” is a container that executes a sequence of phases (plan, then implement, then verify) each as its own “thread” with isolated context. Multiple drones run in parallel across different tasks and repos.

If you’re giving an AI write access to your codebase, you need a blast radius. Environment contamination was one of the first problems I hit when running agents manually: one task’s changes bleeding into another, dependencies conflicting, git state getting tangled.

Containers fix this pretty cleanly. Fresh environment every time. If a drone goes sideways, you kill it and nothing else is affected.

In March 2026, Docker shipped Sandboxes, microVM-based isolation designed specifically for AI agents. They arrived at the same conclusion I did months earlier. When you actually run AI agents on real code at any kind of scale, isolation stops being optional.

Why F#

Building an AI orchestrator in F# gets some raised eyebrows since most of this ecosystem is Python and TypeScript. I considered a few languages early on, and it mostly came down to F# and Rust. Rust would have been the safe pick for infrastructure tooling, but I wanted to try the SAFE stack and see what end-to-end type safety actually felt like using functional programming across the full stack. Months in, I’d make the same choice again.

Orchestrating AI agents is a state machine problem. Each drone moves through discrete phases with specific transitions. In F#, discriminated unions make this explicit:

type DronePhase =
    | Planning
    | Implementing
    | Verifying
    | Reviewing
    | Fixing
    | CreatingPR
    | Complete
    | Failed of string

The compiler enforces exhaustive pattern matching. Every state transition is accounted for, every failure mode is handled. When you’re coordinating a fleet of Docker containers that modify your codebase, having the compiler catch state bugs is worth the trade-off of a less popular ecosystem.

Three things F# gives me that I’d actually miss:

Immutability by default. Running 10 drones in parallel means shared state is a minefield. Immutable data structures eliminate an entire category of concurrency bugs.

Async computation expressions. Docker operations are I/O-heavy. F#’s async {} workflows handle concurrent container orchestration naturally: starting drones, streaming logs, polling status.

Shared types across the stack. Hivemind runs on the SAFE stack: Fable compiles F# to JavaScript for the frontend, Saturn handles the API, PostgreSQL stores state. The drone lifecycle types are shared between server and client. Change a state, both ends know at compile time.

What Actually Worked

Fleet operations. Three parallel drones fixed 7 API issues in about 90 minutes. Each picked up a task, ran the full pipeline, and opened a PR. I reviewed and merged. That kind of throughput isn’t something I could match manually.

Verify as a hard gate. Drones that can’t pass their own tests don’t progress to review. This single constraint caught most of the “looks right but isn’t” problems that come up constantly in manual AI workflows.

Plan-first. Giving the AI a planning phase before implementation made a big difference in output quality. A drone with a clear plan produces focused, scoped changes. Without one, you get unfocused exploratory code that touches way more than it should.

Dogfooding. Hivemind has 753+ tests. The first Hivemind CLI PR shipped through the full pipeline end to end. Using it on itself surfaces problems that synthetic benchmarks don’t.

What Didn’t

Cold starts. Docker image pulls take 2-3 minutes on first run. When you’re iterating on drone config and restarting frequently, that adds up. Image caching helps but it’s still the biggest friction point.

Fable’s generic type erasure. F# generics don’t survive compilation to JavaScript cleanly. I hit bugs where the server and client disagreed on type shapes at runtime, which is exactly the kind of thing F#’s type system is supposed to prevent. The fix: explicit type annotations everywhere Fable touches generics.

CI races from parallel drones. Multiple drones opening PRs against the same repo trigger competing CI runs and merge conflicts. Queue-based merging handles it now, but “embarrassingly parallel” has an asterisk when your target is a shared git repository.

The Numbers

I’m not going to claim 10x productivity gains. Here’s what I can actually back up:

  • Three parallel drones, seven resolved issues, 90 minutes. The same work manually would be most of a day.
  • 753+ tests in the Hivemind project, most generated and maintained through the pipeline itself.
  • Daily use across four active projects: Hivemind, its CLI, and two other production applications.
  • The full pipeline (plan through PR) runs end-to-end without human intervention on well-scoped tasks.

Well-scoped tasks with clear specs see big throughput gains. Ambiguous or architectural problems still need a human driving. Hivemind automates the execution of engineering decisions, not the decisions themselves.

What I’d Tell You If You’re Thinking About This

The AI conversation keeps circling around models. Which is smartest, which scores highest on benchmarks, which will “replace developers.” Meanwhile the same model swings 36 points on coding benchmarks depending entirely on the scaffolding around it.

If you’re getting value from AI coding tools individually but not seeing it show up in your team’s delivery metrics, the model probably isn’t the problem. The system around it is.

That’s what I’ve been building, and so far the results have held up.


I’m sharing what I learn as I build this. Connect with me on LinkedIn or follow along here.