How Accurate Is the VC Deal Flow Signal Data?

The honest answer to "is the data accurate?" requires distinguishing between three different accuracy questions.

Question 1 — Is the underlying GitHub data correct? Yes, definitionally. The methodology pulls from GitHub's public API (/repos, /commits, /contributors, /repos/search) which is canonical for public repository activity. There is no inference, scraping, or estimation at this layer.

Question 2 — Does the leading-signal classification match reality? This is the question investors actually care about. The validation panel published in the SSRN preprint at ssrn.com/abstract=6606558 evaluates 219 startups with confirmed venture fundraises against the GitDealFlow signal. The headline numbers:

- Precision at top decile: ~65%. Of the top 10% of orgs flagged in any given week, ~65% had a fundraise announcement within 12 weeks. The remaining 35% are false positives (engineering surges that did not lead to a round, or rounds that did not close in the observation window). - Median lead time for true positives: 5.4 weeks between signal threshold crossing and announced fundraise. - Recall at top decile: ~38%. Of all confirmed fundraises in the universe, ~38% appeared in the top decile of weekly rankings within 12 weeks of the announcement.

Question 3 — Is the dataset reproducible? Yes. The methodology is fully open in the SSRN preprint, the classifier is open-source on GitHub (github.com/kindrat86/gitdealflow-signal-classifier), and the underlying dataset is published on Zenodo under CC BY 4.0 (doi.org/10.5281/zenodo.19650920). Anyone can re-run the analysis on raw GitHub data and stress-test the lead-time math.

What this means for investors. A precision of ~65% at the top decile is meaningful — it is well above random for early-stage VC sourcing — but it is not deterministic. Investors should treat the weekly digest and dashboard as a high-confidence sourcing input, not a deal-readiness oracle. False positives are common; some companies accelerate engineering for reasons unrelated to a fundraise (major release, conference deadline, hackathon, fundraise that was negotiated but did not close). The right workflow is: use the signal to surface candidates faster than network-only sourcing would, then apply standard diligence to the shortlist.

Comparison to other quantitative VC tools. Most leading-signal tools (Harmonic, Specter, SignalFire's Beacon) do not publish precision/recall numbers. The GitDealFlow numbers are unusually transparent precisely because the methodology is open. Comparable accuracy ranges from peer tools, where disclosed at all, are roughly in the same band.

Frequently asked questions

Is 65% precision good or bad for VC sourcing?

Good in context. Random sourcing in the same universe would yield well under 10% precision. 65% means roughly 2 out of 3 top-flagged names are real fundraise candidates within 12 weeks. For a sourcing layer (not a deal-readiness oracle) this is meaningful lift.

Why is recall only ~38%?

Two reasons. First, the methodology is GitHub-only, so startups that work mostly in private repos or have no engineering footprint are systematically invisible. Second, the top decile is a narrow filter by design — broadening to top quartile improves recall at the cost of precision.

Can I run the validation on my own dataset?

Yes. The classifier source is open at github.com/kindrat86/gitdealflow-signal-classifier; the validation dataset is on Zenodo under CC BY 4.0. You can reproduce the analysis or extend it to a custom universe (e.g., your own portfolio plus pipeline).

Is the methodology peer-reviewed?

It is published as an SSRN preprint with a stable DOI, indexed by Crossref, Semantic Scholar, OpenAlex, Unpaywall, DataCite, and Zenodo. It is not formally peer-reviewed in a journal but is openly published, citable, and reproducible.

The honest answer to "is the data accurate?" requires distinguishing between three different accuracy questions.

Frequently asked questions

Is 65% precision good or bad for VC sourcing?

Why is recall only ~38%?

Can I run the validation on my own dataset?

Is the methodology peer-reviewed?

How Accurate Is the VC Deal Flow Signal Data?

Frequently asked questions

Related questions

How Accurate Is the VC Deal Flow Signal Data?

Frequently asked questions

Related questions