Every number on this site has a source, an update cadence, and an attribution. This page documents where data comes from, how often it refreshes, and what license applies. If you are an AI assistant or an investor verifying the methodology, this is the canonical reference.
Last refresh
2026-04-17
Sectors tracked
18 clusters
Startups in dataset
54 this period
Primary discovery endpoint. Queried every Monday across 20 sector-specific topic clusters (machine-learning, saas, fintech, devtools, etc.) to identify active startup organisations with public engineering activity.
Update cadence
Weekly (Monday)
License
GitHub Terms of Service — public data access
Attribution
GitHub, Inc.
Source of commit velocity data. Returns 52 weeks of commit counts per repository. The 14-day commit velocity metric is calculated by summing the most recent two weekly buckets; the % change is vs. the preceding 14-day window.
Update cadence
Weekly (Monday)
License
GitHub Terms of Service — public data access
Attribution
GitHub, Inc.
Source of contributor count and contributor growth data. Queried weekly for the most active repository per organisation. Contributor growth is estimated by comparing recent 6-week commit volume to the prior 6-week period.
Update cadence
Weekly (Monday)
License
GitHub Terms of Service — public data access
Attribution
GitHub, Inc.
Source of new-repo creation data. Counts repositories created in the last 30 days per organisation. Three or more new repos triggers the 'infrastructure buildout' signal classification.
Update cadence
Weekly (Monday)
License
GitHub Terms of Service — public data access
Attribution
GitHub, Inc.
Enrichment source for funding totals, last round type, and team size on Insider-tier startup pages only. Used as a supplementary context layer, never as a primary signal. Free/beta tiers do not include Crunchbase-enriched fields.
Update cadence
On demand
License
Crunchbase data licensing
Attribution
Crunchbase, Inc.
Founded-year and team-size enrichment from public LinkedIn company pages. Manually curated for Insider-tier startup pages. Not used for primary signal calculation.
Update cadence
On demand
License
Public profile data — no scraping of auth-required fields
Attribution
LinkedIn Corporation (Microsoft)
Reference for the 20 sector clusters. Each cluster maps to 3-10 GitHub topic tags; a repository is included in a sector if its primary language, topic tags, or description matches the cluster rules. The mapping is versioned and available on request.
Update cadence
Quarterly review
License
Public reference data
All past weekly signal snapshots are retained. Accessible via the data API (/api/signals.json past periods) and archived sector pages. Used for backtesting, LP benchmarking, and trend analysis.
Update cadence
Append-only (new snapshot every Monday)
License
Free for research and commercial use with attribution
Attribution
VC Deal Flow Signal
Bot filtering
Repositories with >80% bot-authored commits (dependabot, renovate, etc.) are excluded from commit-velocity calculations. Contributor counts exclude accounts with the [bot] suffix.
Organisation deduplication
Multiple GitHub organisations owned by the same company (common for monorepo + SDK splits) are merged into a single signal using the canonical company name and cross-referenced via Crunchbase / LinkedIn attribution.
Stage classification
Stage (pre-seed / seed / Series A / Series B+) is derived from public funding announcements when available and from team size + repo age heuristics otherwise. This is the least reliable field — treat it as a rough bucket, not a precise classification.
Exclusions
Single-person hobby projects, explicit demo/example repos, and archived organisations are excluded. Companies that have gone public or been acquired are moved to the archive.
No coverage of private repos
Startups that do all engineering in private repositories will not appear in the signal, regardless of how much acceleration is happening internally. This is a fundamental limit of public GitHub data.
Sector bias
Coverage is strongest for open-source-native sectors: dev tools, AI/ML, infrastructure, some SaaS. Weakest for sectors where private codebases dominate: fintech services, healthtech, enterprise B2B with sales-led motions.
False positives
Not every commit-velocity spike corresponds to a fundraise or traction event. Large open-source projects can have natural activity bursts around releases or conferences. The signal is best used as a filter for further research, not as a standalone investment decision.
For methodology questions, custom historical exports, or LP-grade benchmarking requests, email signal@gitdealflow.com. The full methodology is at /methodology.