Research · SSRN-indexed · CC BY 4.0
A Longitudinal Panel of GitHub Engineering Velocity for Venture-Backed Startups
A public longitudinal panel of GitHub engineering velocity signals across 55 venture-backed startups in 20 sectors, spanning 5 quarters from Q2 2025 through Q2 2026 (219 startup-period observations). Below: 30 atomic findings from the paper, each cited to its section, each falsifiable against the public dataset. If your question is whether GitHub activity can surface startup momentum before a round gets crowded, this is the main evidence page.
Citation-ready dataset note
The 219-observation panel is descriptive: it characterizes GitHub engineering-velocity (commit velocity, contributor growth, repo creation) across 55 venture-backed startups and deliberately carries no funding-event labels. Our working hypothesis — that sustained acceleration precedes announced venture-fundraise events by roughly three to six weeks — is validated openly on the public scorecard (un-graded so far) and invites replication joining the panel to funding data. Dataset source: SSRN abstract 6606558. Cite as: VC Deal Flow Signal (2026) · doi.org/10.2139/ssrn.6606558.
Abstract
We release a quarterly longitudinal panel of GitHub engineering-velocity signals across 55 venture-backed startups in 20 sectors, spanning five quarters from Q2 2025 through Q2 2026 (219 startup-period observations). For each observation we record commit velocity over a rolling 14-day window, unique-contributor count, new-repository creation, and a deterministic classification into one of four acceleration patterns: framework migration, engineering hiring burst, infrastructure buildout, and deploy frequency spike. We describe the data-collection methodology, report descriptive statistics across sectors and geographies, and document known limitations — most importantly the absence of linked funding-event labels in this release. Distributed under CC BY 4.0.
Group A · 20 numerical findings
The 20 most cite-worthy numbers in the paper
Each finding is one number, one sentence, one citation. Drop into a tweet, a Reddit comment, or a memo verbatim.
- #01
The 14-day commit-velocity median across 55 venture-backed startups is 71 commits.
Why it matters: A single number that anchors what 'normal' looks like for venture-backed engineering. Compare your portfolio against it.
Source: §4.2 Velocity distribution · SSRN abstract=6606558 · View full finding →
- #02
Mean commit velocity is 173 — over 2.4× the median, indicating a heavy upper tail.
Why it matters: Mean ≠ median is the signature of skewed distributions. VCs need the median, not the average.
Source: §4.2 Velocity distribution · SSRN abstract=6606558 · View full finding →
- #03
The 90th percentile commit velocity is 392 commits per 14 days.
Why it matters: What 'top decile' looks like quantitatively. Test where your portfolio sits.
Source: §4.2 Velocity distribution · SSRN abstract=6606558 · View full finding →
- #04
Quarter-over-quarter velocity change ranges from −94% to +1,647%.
Why it matters: The +1,647% number is a hook. Pre-launch sprints are visible in commit-velocity data.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #05
49% of observations show positive velocity growth.
Why it matters: Counterintuitive. Most assume 'all venture-backed startups grow fast.' Half do, half don't — even at this stage.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #06
Framework migration is the dominant signal type — 75% of observations (165 of 219).
Why it matters: Counter-narrative to 'engineering velocity = hiring.' The dominant pattern is rewrites, not headcount growth.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #07
Engineering hiring bursts represent only 9% of observations (20 of 219).
Why it matters: Refutes the dominant VC heuristic that 'more contributors = momentum.' It's the rarest meaningful signal type.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #08
Infrastructure buildouts are even rarer — 4% of observations (8 of 219).
Why it matters: When you see infrastructure buildout, treat it as an outlier event. Possible platform pivot or enterprise launch.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #09
Deploy frequency spikes are 12% of observations (26 of 219).
Why it matters: Small teams sprinting toward a milestone are about 1 in 8. Often correlates with launch dates.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #10
Among observations with identifiable geography (108 of 219, 49%), US accounts for 60.
Why it matters: US dominance in venture-backed open-source-active orgs is 56%. Lower than people guess for VC-backed.
Source: §4.2 Geography · SSRN abstract=6606558 · View full finding →
- #11
EU venture-backed orgs in the panel: 24 (22% of identified geography).
Why it matters: EU is meaningfully under-represented in venture-backed open-source-active orgs vs population baseline.
Source: §4.2 Geography · SSRN abstract=6606558 · View full finding →
- #12
LATAM venture-backed orgs in the panel: 12 (11% of identified geography).
Why it matters: LATAM punches above weight in venture-backed open-source-active. Under-priced sourcing surface.
Source: §4.2 Geography · SSRN abstract=6606558 · View full finding →
- #13
Sector sample size ranges from 1 (Legal Tech) to 8 (Data Infrastructure / Cybersecurity).
Why it matters: Real-world heterogeneity in density of venture-backed open-source-first startups.
Source: §4.2 Sectors · SSRN abstract=6606558 · View full finding →
- #14
The two highest-velocity-change observations in the most recent period are castle-engine (+344%) and orbiternassp (+329%).
Why it matters: Specific, falsifiable, public. Anyone can verify on GitHub.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #15
Extreme positive velocity-change outliers cluster in two sectors: Gaming and Space Tech.
Why it matters: Both are under-covered by traditional VC alt-data tools. Sourcing edge for the right fund.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #16
Signal-mix stability: framework-migration share varies <5 percentage points period-to-period.
Why it matters: The classification scheme produces stable distributions, suggesting the heuristics capture real structure (not noise).
Source: §4.2 Signal type distribution · SSRN abstract=6606558 · View full finding →
- #17
The dataset spans 5 quarters (Q2 2025 through Q2 2026).
Why it matters: First public longitudinal panel at organizational level for venture-backed startups.
Source: §1, abstract · SSRN abstract=6606558 · View full finding →
- #18
The classifier is fully deterministic — no ML, no black-box.
Why it matters: Auditable and replicable. Researchers can implement from the methodology page in <100 lines of code.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #19
The 14-day observation window is justified by Mockus, Fielding, and Herbsleb (2002).
Why it matters: Concrete academic anchor — empirical SE literature establishes 2-week windows smooth weekend/holiday noise.
Source: §2 Related work · SSRN abstract=6606558 · View full finding →
- #20
The dataset is distributed under CC BY 4.0 with no restrictions on commercial use.
Why it matters: No academic-only license trap. Anyone can build a competing product on this data.
Source: §7 Data availability · SSRN abstract=6606558 · View full finding →
Group B · 7 methodology and structural claims
How the dataset is built
- #21
Each observation is taken on the most-active repository per organization in the trailing 14-day window ending the first day of the quarter.
Why it matters: Reproducible. Every researcher can implement this and check our numbers.
Source: §3.2 Collection pipeline · SSRN abstract=6606558 · View full finding →
- #22
The dataset is 219 startup-period observations, not 219 unique startups.
Why it matters: Panel structure (longitudinal). 55 unique startups × ~4 quarters each = 219 observations. Permits fixed-effects regressions.
Source: §4.1 Structure · SSRN abstract=6606558 · View full finding →
- #23
The dataset is 3 CSV files: startup_signals (219 rows), sector_aggregates (72), signal_type_timeseries (15).
Why it matters: Frictionless Data schema means it's plug-and-play for academic notebooks.
Source: §4.1 Structure · SSRN abstract=6606558 · View full finding →
- #24
We deliberately do not pre-report statistical tests on cross-sectional questions.
Why it matters: Epistemic discipline. The paper is data + methodology, not pre-cooked findings to defend.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
- #25
The dataset over-represents sectors where open-source work is conventional and under-represents consumer apps and many fintechs.
Why it matters: Honest about selection bias. Cross-sector comparisons must account for it.
Source: §5 Limitations · SSRN abstract=6606558 · View full finding →
- #26
The seed list excludes public companies and non-VC-backed open-source projects.
Why it matters: Targets the specific population of interest to early-stage investors.
Source: §3.1 Seed list · SSRN abstract=6606558 · View full finding →
- #27
The data is mirrored on Kaggle, Data.world, Zenodo, and the canonical live API.
Why it matters: Multiple distribution surfaces — institutional and indie researchers have a path.
Source: §7 Data availability · SSRN abstract=6606558 · View full finding →
Group C · 3 open questions
Questions the panel structure can answer
We deliberately do not pre-report tests on these. They belong to derivative work. If you run them, we want to read your paper.
- #28
Open question: Do hiring-burst signals lead or lag framework-migration signals?
Why it matters: Useful for VCs trying to time outreach. Pre-announcement vs post-announcement signal.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
- #29
Open question: Is velocity change sector-mean-reverting?
Why it matters: Determines whether velocity is signal or noise. Panel structure permits the test.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
- #30
Open question: US observations skew toward hiring-burst and deploy-frequency-spike. EU skews toward framework-migration.
Why it matters: Geography × signal-type interaction. Suggests different 'kinds of momentum' by region.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
Frequently asked
- What is the GitDealFlow research dataset?
- A public longitudinal panel of GitHub engineering velocity signals across 55 venture-backed startups in 20 sectors, spanning 5 quarters from Q2 2025 through Q2 2026 (219 startup-period observations). Distributed under CC BY 4.0.
- What is the typical commit velocity for a venture-backed startup?
- Median 14-day commit velocity across 55 venture-backed startups is 71 commits. Mean is 173 (heavy upper tail). The 90th percentile is 392 commits per 14 days.
- What signal type dominates GitHub activity for venture-backed startups?
- Framework migration is the dominant pattern at 75% of observations (165 of 219). Engineering hiring bursts are 9% (20 of 219), deploy frequency spikes are 12% (26 of 219), and infrastructure buildouts are 4% (8 of 219).
- How was the dataset collected?
- For each organization and quarterly period, the pipeline enumerates public repositories via GitHub REST API, selects the most active by 14-day commit count, pulls commit and contributor data, and applies a deterministic four-class signal classifier. Full methodology at signals.gitdealflow.com/methodology.
- Where can I download the dataset?
- Live CSV at signals.gitdealflow.com/api/signals.csv, JSON at signals.gitdealflow.com/api/signals.json. Mirrored on Kaggle, Data.world, and Zenodo. License: CC BY 4.0 with no restrictions on commercial use.
- What are the dataset's known limitations?
- (1) No linked funding-event labels in v1.0; (2) selection bias toward open-source-active sectors; (3) 14-day observation window is a pragmatic tradeoff with academic precedent but not optimized against any downstream objective; (4) survivorship bias from currently-active orgs only; (5) organization-to-startup mapping issues for orgs with multiple products.
- How do I cite the paper?
- VC Deal Flow Signal. (2026). A Longitudinal Panel of GitHub Engineering Velocity for Venture-Backed Startups (v1.0.0). https://gitdealflow.com — SSRN abstract=6606558.
Cite this paper
VC Deal Flow Signal. (2026). A Longitudinal Panel of GitHub Engineering Velocity for Venture-Backed Startups (v1.0.0). https://gitdealflow.com — SSRN abstract=6606558.
Indexed in: Crossref · Semantic Scholar · OpenAlex (W7154916891) · Unpaywall · DataCite · Zenodo (DOI 10.5281/zenodo.19650920) · OpenAIRE (propagating) · Google Scholar (propagating).
Turn findings into action
Use the evidence, then choose your next move.
The panel tells you what tends to matter. If the question is still early, start with the Sunday watchlist. If the thesis is already live, move straight to a sharper sector pass. If you still need to pressure-test the category, compare timing versus verification and read the buyer's guide before committing to a heavier workflow.
Already further along? See the weekly operating surface, timing vs verification, or the buyer’s guide.
Replication studies welcome. signal@gitdealflow.com for co-authorship on funding-event joins.