AI & Machine Learning · sub-niche

LLM eval harnesses.

Reproducible eval suites that an AI-native team can drop into CI and trust by lunchtime.

Month-long buildHot — multiple deals per month

Why now

Every model swap (GPT → Claude → Gemini → Llama variant) breaks the prompt graph. Teams need an eval layer that survives provider churn.

What the signal looks like

Repos crossing 1k stars inside a quarter, with the contributor list dominated by ML platform engineers from infra-heavy companies — not researchers.

Public examples

We name publicprojects + categories only — never founders we track inside the paid product. The buyer’s edge stays inside the product.

Promptfoo-style YAML eval harnesses
DeepEval-style pytest plugins
OpenAI Evals forks tuned to a single vertical

What this displaces

Hand-rolled notebook eval scripts and the prompt engineer's weekly Excel sheet.

Our build-vs-invest call

Build it as a vertical eval (legal, medical, code review) rather than a general harness — the general slot is crowded. The signal that something is breaking out: a single vertical's eval repo getting starred by three or more competing product teams in the same week.

Common questions about this niche

Why is an eval harness a niche, not a feature of every LLM tool?: Because evals are model-agnostic and product-agnostic — they belong in a separate layer that survives provider swaps. Teams that bury evals inside a single product end up with brittle CI.
Should I build or fund?: Build if you already have the vertical's golden dataset. Fund if you don't — the moat is data, not framework.
What's the GitHub signal that an eval repo is going to raise?: Star velocity is a weak signal; what matters is whether engineers from three or more named product companies are filing issues in the same month.

Five breakout startups, every Sunday — before the round gets crowded

The free Acceleration Watch: five venture-backed teams accelerating on the engineering signal, translated into plain English — 21 to 47 days before the deck circulates. No code-reading, no card.

Get the free Sunday issue →

Signed The Data Nerd · pseudonymous narrator · methodology over personality

More inside AI & Machine Learning

Agent orchestration frameworks — The 'LangChain for X' slot is still wide open — pick a vertical, ship the runtime, win the wedge.
Retrieval-augmented search libraries — RAG-as-a-library — bring-your-own embedding, bring-your-own vector store, win on developer ergonomics.
Fine-tuning tools for non-ML teams — Take fine-tuning out of the notebook. Product teams want to point at JSONL and get a deployable adapter.
On-device LLM runtimes — Privacy, latency, cost — three reasons every app eventually wants a 3-8B model running on the user's machine.

See all 10 AI & Machine Learning sub-niches →

Common questions about this niche

Why is an eval harness a niche, not a feature of every LLM tool?

Because evals are model-agnostic and product-agnostic — they belong in a separate layer that survives provider swaps. Teams that bury evals inside a single product end up with brittle CI.

Should I build or fund?

Build if you already have the vertical's golden dataset. Fund if you don't — the moat is data, not framework.

What's the GitHub signal that an eval repo is going to raise?

Star velocity is a weak signal; what matters is whether engineers from three or more named product companies are filing issues in the same month.

Five breakout startups, every Sunday — before the round gets crowded

The free Acceleration Watch: five venture-backed teams accelerating on the engineering signal, translated into plain English — 21 to 47 days before the deck circulates. No code-reading, no card.

Signed The Data Nerd · pseudonymous narrator · methodology over personality

More inside AI & Machine Learning

Agent orchestration frameworks — The 'LangChain for X' slot is still wide open — pick a vertical, ship the runtime, win the wedge.

Retrieval-augmented search libraries — RAG-as-a-library — bring-your-own embedding, bring-your-own vector store, win on developer ergonomics.

Fine-tuning tools for non-ML teams — Take fine-tuning out of the notebook. Product teams want to point at JSONL and get a deployable adapter.

On-device LLM runtimes — Privacy, latency, cost — three reasons every app eventually wants a 3-8B model running on the user's machine.

Why now

What the signal looks like

Public examples

What this displaces

Our build-vs-invest call

Common questions about this niche

Five breakout startups, every Sunday — before the round gets crowded

More inside AI & Machine Learning

🚀 Explore Our Network

Why now

What the signal looks like

Public examples

What this displaces

Our build-vs-invest call

Common questions about this niche

Five breakout startups, every Sunday — before the round gets crowded

More inside AI & Machine Learning