AI & Machine Learning · sub-niche
LLM eval harnesses.
Reproducible eval suites that an AI-native team can drop into CI and trust by lunchtime.
Why now
Every model swap (GPT → Claude → Gemini → Llama variant) breaks the prompt graph. Teams need an eval layer that survives provider churn.
What the signal looks like
Repos crossing 1k stars inside a quarter, with the contributor list dominated by ML platform engineers from infra-heavy companies — not researchers.
Public examples
We name publicprojects + categories only — never founders we track inside the paid product. The buyer’s edge stays inside the product.
- Promptfoo-style YAML eval harnesses
- DeepEval-style pytest plugins
- OpenAI Evals forks tuned to a single vertical
What this displaces
Hand-rolled notebook eval scripts and the prompt engineer's weekly Excel sheet.
Our build-vs-invest call
Build it as a vertical eval (legal, medical, code review) rather than a general harness — the general slot is crowded. The signal that something is breaking out: a single vertical's eval repo getting starred by three or more competing product teams in the same week.
Common questions about this niche
- Why is an eval harness a niche, not a feature of every LLM tool?
- Because evals are model-agnostic and product-agnostic — they belong in a separate layer that survives provider swaps. Teams that bury evals inside a single product end up with brittle CI.
- Should I build or fund?
- Build if you already have the vertical's golden dataset. Fund if you don't — the moat is data, not framework.
- What's the GitHub signal that an eval repo is going to raise?
- Star velocity is a weak signal; what matters is whether engineers from three or more named product companies are filing issues in the same month.
More inside AI & Machine Learning
- Agent orchestration frameworks — The 'LangChain for X' slot is still wide open — pick a vertical, ship the runtime, win the wedge.
- Retrieval-augmented search libraries — RAG-as-a-library — bring-your-own embedding, bring-your-own vector store, win on developer ergonomics.
- Fine-tuning tools for non-ML teams — Take fine-tuning out of the notebook. Product teams want to point at JSONL and get a deployable adapter.
- On-device LLM runtimes — Privacy, latency, cost — three reasons every app eventually wants a 3-8B model running on the user's machine.