Clair Flow

· 3 min read

Cross-platform AI voice dictation — speak naturally, get polished text injected into any app.

The Problem

Voice dictation tools either produce garbage transcripts or lock you into a single platform. The good ones require cloud accounts with zero transparency about what happens to your audio. And none of them understand that “new paragraph” means a paragraph break, not the literal words “new paragraph.”

The Solution

Clair Flow is a voice dictation system built across three codebases — native clients for macOS and Linux, backed by a hosted cloud service that handles the heavy lifting.

You press a hotkey, speak naturally, and polished text appears in whatever app you’re focused on. The system captures audio locally, streams it to the cloud for real-time transcription, then runs the raw transcript through LLM post-processing to clean up formatting, punctuation, and dictation commands before injecting the final text.

Architecture

┌──────────────┐     ┌──────────────────┐     ┌──────────────┐
│  Native      │────▶│  Cloud Service   │────▶│  Polished    │
│  Client      │ WS  │  (F# / Giraffe)  │     │  Text Back   │
│  (macOS/     │     │                  │     │  to Client   │
│   Linux)     │◀────│  STT + LLM Post  │◀────│              │
└──────────────┘     └──────────────────┘     └──────────────┘
       │                     │
       ▼                     ▼
┌──────────────┐     ┌──────────────────┐
│  Audio       │     │  AssemblyAI /    │
│  Capture     │     │  Deepgram STT    │
│  (local)     │     │  + Gemini /      │
└──────────────┘     │  Anthropic /     │
                     │  Cerebras LLM    │
                     └──────────────────┘

The pipeline in practice:

  1. Local audio capture at 16 kHz mono PCM — native APIs on each platform (AVAudioEngine on macOS, PipeWire/PulseAudio on Linux).
  2. WebSocket stream to the cloud service for real-time speech-to-text via AssemblyAI Universal-3 Pro or Deepgram.
  3. LLM post-processing — the raw transcript passes through a configurable post-processor (Gemini, Anthropic, or Cerebras) that handles punctuation, formatting, dictation commands, and domain-specific cleanup.
  4. Text injection — the polished result streams back to the client and gets pasted into the focused application.

Native Clients

macOS (Swift)

A menu bar utility app with:

  • Global hotkey for push-to-talk dictation
  • AVAudioEngine microphone capture
  • Keychain-backed device authentication
  • Accessibility-based text insertion with clipboard fallback
  • Permission diagnostics and audio device monitoring
  • Signed, notarized DMG distribution

Linux (Rust)

A daemon + CLI architecture built for composability:

  • clairflowd — background daemon managing dictation sessions
  • clairflowctl — CLI for sign-in, dictation control, status
  • Unix socket IPC between daemon and CLI
  • systemd user service with Waybar integration
  • First-class Omarchy/Hyprland support

Cloud Backend (F# / SAFE Stack)

The hosted service handles everything the clients shouldn’t:

  • Multi-provider STT — runtime switching between AssemblyAI and Deepgram based on availability and quality.
  • AI post-processing — configurable LLM pipeline with provider fallback. Cerebras for speed, Gemini/Anthropic for quality.
  • Custom glossaries — per-user and per-team word lists for domain-specific accuracy (medical terms, product names, jargon).
  • Device auth — email magic links for browser sessions, long-lived per-device API keys for native clients. No passwords.
  • Billing — Stripe integration with usage-based metering.
  • Dashboard — Fable/Feliz/Elmish SPA for account management, device approvals, usage tracking, and glossary editing.

Key Decisions

  • Native clients, not Electron. Dictation needs sub-second latency and deep OS integration (accessibility APIs, audio hardware, system services). Web tech can’t deliver that.
  • Rust for Linux. Low resource overhead for a background daemon, strong async ecosystem (Tokio), and zero-cost abstractions for the audio pipeline.
  • F# for the cloud. Same language family as Hivemind. Strong typing catches streaming state machine bugs at compile time. Fable.Remoting gives type-safe client-server contracts for free.
  • Multi-provider STT. No single provider wins on all dimensions. AssemblyAI handles accuracy, Deepgram handles speed. The system can failover transparently.
  • LLM post-processing as a feature. Raw STT output is never good enough for professional use. Running it through an LLM to handle formatting, punctuation, and command interpretation is what makes dictation actually usable.

Status

Active development. macOS client shipping with push-to-talk dictation, Linux client running as a user service on Omarchy, cloud backend handling streaming sessions and post-processing.