Case Study

MetricFlow

The harness wrote the code. Here's what worked and what didn't.

A Phoenix LiveView marketing analytics platform built by CodeMySpec in 13 working days. OAuth integrations across five platforms, multi-tenant accounts with agency white-labeling, correlation analysis, AI insights. If the app holds up, the tool that built it holds up. Here's both sides.

13
Working Days
51
Stories
434
Acceptance Criteria
50+
BDD Specs

What is MetricFlow?

MetricFlow is a Phoenix LiveView marketing analytics platform that correlates advertising spend across Google Ads, Facebook Ads, and Google Analytics with revenue data from QuickBooks. It offers multi-tenant accounts with role-based access, OAuth integrations for five platforms, correlation analysis with AI-powered insights, interactive Vega-Lite dashboards, and an AI chat interface for data exploration.

Google Analytics

Website traffic, user behavior, and conversion tracking

Google Ads

Advertising spend, campaign performance, and ROI tracking

Google Search Console

Search performance, indexing status, and keyword rankings

QuickBooks

Revenue, expenses, daily credit/debit breakdowns

Facebook Ads

Ad performance, reach, and campaign ROI

Key Features

  • Correlation analysis engine: raw and SmartAI modes with goal configuration
  • Interactive Vega-Lite dashboards with date range controls and platform filters
  • Multi-tenant accounts with agency white-labeling and client management
  • Automated daily data sync with Oban job processing
  • LLM-generated custom reports from natural language prompts
  • AI chat interface for data exploration
  • Full BDD specification framework with automated QA

User Stories

The stories that drove MetricFlow's development, from user registration through correlation analysis and AI chat. Every feature started as a story with testable acceptance criteria.

51
Stories
434
Acceptance Criteria

Open Source Repository

MetricFlow is fully open source. Browse the code, check the commit history, and see the methodology in action.

View on GitHub
Elixir / Phoenix
Language
PostgreSQL
Database
Oban
Job Processing
OAuth2 + RBAC
Auth

The Dev Story

The Good, The Bad, and The Ugly

An honest assessment of building a full-stack Phoenix app with CodeMySpec. The calendar span is January 30 to March 20, 2026, but the actual work happened across just 13 working days. What we learned changed how the entire pipeline works.

The Velocity Was Real

The first commit landed on January 30: a Phoenix scaffold. Two days later, a 48-hour sprint laid down the entire foundation: Oban, Vault, PromEx, the BDD framework, and 25,000+ lines of specifications before a single line of business logic existed. By end of February, the core domain was complete: Accounts, Integrations, Metrics, DataSync with four providers, ~22,000 lines of unit tests, and 50+ BDD spex files.

The Architecture Held Up

Boundary-enforced module dependencies, context modules as the public API surface, %Scope{} threaded through every public function, repository pattern separating queries from business logic. The factory understood Phoenix conventions and produced idiomatic Elixir. The invitation system came with the full lifecycle (send, accept, cancel, role assignment, token invalidation) with appropriate authorization checks.

The Dashboard Came Out Clean

The dashboard redesign is a good example of what the factory can do when the spec is clear. It generated a multi-series Vega-Lite chart with date range controls, platform filters, data tables, and proper Phoenix hooks for CSP-safe rendering. The QueryBuilder that feeds it was clean, tested, and worked on the first deploy.

QA Infrastructure Was the Best Output

Over the final two weeks, automated Vibium browser agents ran QA scenarios against 25+ stories, producing hundreds of screenshots and structured pass/fail reports. The QA agents found real bugs (nil account guards, broken token invalidation, missing navigation links, role select defaults) and the coding agents fixed them in the same commit cycle.

What We Changed After MetricFlow

The Pipeline Had Too Many Steps

MetricFlow ran the full six-phase pipeline: BDD specs, component specs, design review, unit tests, implementation, QA. Most of those intermediate steps added ceremony without catching real problems. The specs that mattered were the BDD scenarios at the top and the QA runs at the bottom. Everything in between was overhead that slowed the factory down without making the output better.

The Core Loop is Three Steps

After MetricFlow, we stripped the pipeline to what actually works: BDD specs, code, QA. That's it. Write the scenarios that define what "done" looks like, let the machine build it, then verify it works in the running application. Component specs, design reviews, and unit tests are still available for teams that want them, but they're optional. The default path is the shortest one that ships working software.

Stories Were Flowing Into Specs Unexamined

The biggest problem wasn't the code generation. It was that ambiguous stories were flowing straight into BDD specs with no human gate. The spec inherited every vague assumption from the story, the step definitions inherited it from the spec, and the implementation inherited it from the steps. Ambiguity compounded through three layers. By the time QA caught a problem, the fix meant rewriting everything back to the story.

Example Mapping is the Missing Step

The fix is a structured discovery conversation between the story and the spec. It's called Example Mapping (from Matt Wynne and the Cucumber team): take a story, extract the business rules, walk through concrete examples, surface open questions, and produce scenario titles. Those titles become your acceptance criteria and your BDD scenarios. One artifact, not two. Then the human reviews the feature file before codegen starts. That single gate prevents most of what went wrong in MetricFlow.

13 working days. 51 stories. 434 acceptance criteria. MetricFlow proved the harness works — and proved where it broke. The fix wasn't more automation. It was less pipeline and one better gate. Example Mapping into BDD specs, code, QA. Three steps. One human review. That's the loop that ships.