Research · SSRN-indexed · CC BY 4.0
A Longitudinal Panel of GitHub Engineering Velocity for Venture-Backed Startups
A public longitudinal panel of GitHub engineering velocity signals across 55 venture-backed startups in 20 sectors, spanning 5 quarters from Q2 2025 through Q2 2026 (219 startup-period observations). Below: 30 atomic findings from the paper, each cited to its section, each falsifiable against the public dataset.
Abstract
We release a quarterly longitudinal panel of GitHub engineering-velocity signals across 55 venture-backed startups in 20 sectors, spanning five quarters from Q2 2025 through Q2 2026 (219 startup-period observations). For each observation we record commit velocity over a rolling 14-day window, unique-contributor count, new-repository creation, and a deterministic classification into one of four acceleration patterns: framework migration, engineering hiring burst, infrastructure buildout, and deploy frequency spike. We describe the data-collection methodology, report descriptive statistics across sectors and geographies, and document known limitations — most importantly the absence of linked funding-event labels in this release. Distributed under CC BY 4.0.
Group A · 20 numerical findings
The 20 most cite-worthy numbers in the paper
Each finding is one number, one sentence, one citation. Drop into a tweet, a Reddit comment, or a memo verbatim.
- #01
The 14-day commit-velocity median across 55 venture-backed startups is 71 commits.
Why it matters: A single number that anchors what 'normal' looks like for venture-backed engineering. Compare your portfolio against it.
Source: §4.2 Velocity distribution · SSRN abstract=6606558 · View full finding →
- #02
Mean commit velocity is 173 — over 2.4× the median, indicating a heavy upper tail.
Why it matters: Mean ≠ median is the signature of skewed distributions. VCs need the median, not the average.
Source: §4.2 Velocity distribution · SSRN abstract=6606558 · View full finding →
- #03
The 90th percentile commit velocity is 392 commits per 14 days.
Why it matters: What 'top decile' looks like quantitatively. Test where your portfolio sits.
Source: §4.2 Velocity distribution · SSRN abstract=6606558 · View full finding →
- #04
Quarter-over-quarter velocity change ranges from −94% to +1,647%.
Why it matters: The +1,647% number is a hook. Pre-launch sprints are visible in commit-velocity data.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #05
49% of observations show positive velocity growth.
Why it matters: Counterintuitive. Most assume 'all venture-backed startups grow fast.' Half do, half don't — even at this stage.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #06
Framework migration is the dominant signal type — 75% of observations (165 of 219).
Why it matters: Counter-narrative to 'engineering velocity = hiring.' The dominant pattern is rewrites, not headcount growth.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #07
Engineering hiring bursts represent only 9% of observations (20 of 219).
Why it matters: Refutes the dominant VC heuristic that 'more contributors = momentum.' It's the rarest meaningful signal type.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #08
Infrastructure buildouts are even rarer — 4% of observations (8 of 219).
Why it matters: When you see infrastructure buildout, treat it as an outlier event. Possible platform pivot or enterprise launch.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #09
Deploy frequency spikes are 12% of observations (26 of 219).
Why it matters: Small teams sprinting toward a milestone are about 1 in 8. Often correlates with launch dates.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #10
Among observations with identifiable geography (108 of 219, 49%), US accounts for 60.
Why it matters: US dominance in venture-backed open-source-active orgs is 56%. Lower than people guess for VC-backed.
Source: §4.2 Geography · SSRN abstract=6606558 · View full finding →
- #11
EU venture-backed orgs in the panel: 24 (22% of identified geography).
Why it matters: EU is meaningfully under-represented in venture-backed open-source-active orgs vs population baseline.
Source: §4.2 Geography · SSRN abstract=6606558 · View full finding →
- #12
LATAM venture-backed orgs in the panel: 12 (11% of identified geography).
Why it matters: LATAM punches above weight in venture-backed open-source-active. Under-priced sourcing surface.
Source: §4.2 Geography · SSRN abstract=6606558 · View full finding →
- #13
Sector sample size ranges from 1 (Legal Tech) to 8 (Data Infrastructure / Cybersecurity).
Why it matters: Real-world heterogeneity in density of venture-backed open-source-first startups.
Source: §4.2 Sectors · SSRN abstract=6606558 · View full finding →
- #14
The two highest-velocity-change observations in the most recent period are castle-engine (+344%) and orbiternassp (+329%).
Why it matters: Specific, falsifiable, public. Anyone can verify on GitHub.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #15
Extreme positive velocity-change outliers cluster in two sectors: Gaming and Space Tech.
Why it matters: Both are under-covered by traditional VC alt-data tools. Sourcing edge for the right fund.
Source: §4.2 Velocity change · SSRN abstract=6606558 · View full finding →
- #16
Signal-mix stability: framework-migration share varies <5 percentage points period-to-period.
Why it matters: The classification scheme produces stable distributions, suggesting the heuristics capture real structure (not noise).
Source: §4.2 Signal type distribution · SSRN abstract=6606558 · View full finding →
- #17
The dataset spans 5 quarters (Q2 2025 through Q2 2026).
Why it matters: First public longitudinal panel at organizational level for venture-backed startups.
Source: §1, abstract · SSRN abstract=6606558 · View full finding →
- #18
The classifier is fully deterministic — no ML, no black-box.
Why it matters: Auditable and replicable. Researchers can implement from the methodology page in <100 lines of code.
Source: §3.3 Signal classification · SSRN abstract=6606558 · View full finding →
- #19
The 14-day observation window is justified by Mockus, Fielding, and Herbsleb (2002).
Why it matters: Concrete academic anchor — empirical SE literature establishes 2-week windows smooth weekend/holiday noise.
Source: §2 Related work · SSRN abstract=6606558 · View full finding →
- #20
The dataset is distributed under CC BY 4.0 with no restrictions on commercial use.
Why it matters: No academic-only license trap. Anyone can build a competing product on this data.
Source: §7 Data availability · SSRN abstract=6606558 · View full finding →
Group B · 7 methodology and structural claims
How the dataset is built
- #21
Each observation is taken on the most-active repository per organization in the trailing 14-day window ending the first day of the quarter.
Why it matters: Reproducible. Every researcher can implement this and check our numbers.
Source: §3.2 Collection pipeline · SSRN abstract=6606558 · View full finding →
- #22
The dataset is 219 startup-period observations, not 219 unique startups.
Why it matters: Panel structure (longitudinal). 55 unique startups × ~4 quarters each = 219 observations. Permits fixed-effects regressions.
Source: §4.1 Structure · SSRN abstract=6606558 · View full finding →
- #23
The dataset is 3 CSV files: startup_signals (219 rows), sector_aggregates (72), signal_type_timeseries (15).
Why it matters: Frictionless Data schema means it's plug-and-play for academic notebooks.
Source: §4.1 Structure · SSRN abstract=6606558 · View full finding →
- #24
We deliberately do not pre-report statistical tests on cross-sectional questions.
Why it matters: Epistemic discipline. The paper is data + methodology, not pre-cooked findings to defend.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
- #25
The dataset over-represents sectors where open-source work is conventional and under-represents consumer apps and many fintechs.
Why it matters: Honest about selection bias. Cross-sector comparisons must account for it.
Source: §5 Limitations · SSRN abstract=6606558 · View full finding →
- #26
The seed list excludes public companies and non-VC-backed open-source projects.
Why it matters: Targets the specific population of interest to early-stage investors.
Source: §3.1 Seed list · SSRN abstract=6606558 · View full finding →
- #27
The data is mirrored on Kaggle, Data.world, Zenodo, and the canonical live API.
Why it matters: Multiple distribution surfaces — institutional and indie researchers have a path.
Source: §7 Data availability · SSRN abstract=6606558 · View full finding →
Group C · 3 open questions
Questions the panel structure can answer
We deliberately do not pre-report tests on these. They belong to derivative work. If you run them, we want to read your paper.
- #28
Open question: Do hiring-burst signals lead or lag framework-migration signals?
Why it matters: Useful for VCs trying to time outreach. Pre-announcement vs post-announcement signal.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
- #29
Open question: Is velocity change sector-mean-reverting?
Why it matters: Determines whether velocity is signal or noise. Panel structure permits the test.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
- #30
Open question: US observations skew toward hiring-burst and deploy-frequency-spike. EU skews toward framework-migration.
Why it matters: Geography × signal-type interaction. Suggests different 'kinds of momentum' by region.
Source: §4.3 Heterogeneity · SSRN abstract=6606558 · View full finding →
Frequently asked
- What is the GitDealFlow research dataset?
- A public longitudinal panel of GitHub engineering velocity signals across 55 venture-backed startups in 20 sectors, spanning 5 quarters from Q2 2025 through Q2 2026 (219 startup-period observations). Distributed under CC BY 4.0.
- What is the typical commit velocity for a venture-backed startup?
- Median 14-day commit velocity across 55 venture-backed startups is 71 commits. Mean is 173 (heavy upper tail). The 90th percentile is 392 commits per 14 days.
- What signal type dominates GitHub activity for venture-backed startups?
- Framework migration is the dominant pattern at 75% of observations (165 of 219). Engineering hiring bursts are 9% (20 of 219), deploy frequency spikes are 12% (26 of 219), and infrastructure buildouts are 4% (8 of 219).
- How was the dataset collected?
- For each organization and quarterly period, the pipeline enumerates public repositories via GitHub REST API, selects the most active by 14-day commit count, pulls commit and contributor data, and applies a deterministic four-class signal classifier. Full methodology at signals.gitdealflow.com/methodology.
- Where can I download the dataset?
- Live CSV at signals.gitdealflow.com/api/signals.csv, JSON at signals.gitdealflow.com/api/signals.json. Mirrored on Kaggle, Data.world, and Zenodo. License: CC BY 4.0 with no restrictions on commercial use.
- What are the dataset's known limitations?
- (1) No linked funding-event labels in v1.0; (2) selection bias toward open-source-active sectors; (3) 14-day observation window is a pragmatic tradeoff with academic precedent but not optimized against any downstream objective; (4) survivorship bias from currently-active orgs only; (5) organization-to-startup mapping issues for orgs with multiple products.
- How do I cite the paper?
- VC Deal Flow Signal. (2026). A Longitudinal Panel of GitHub Engineering Velocity for Venture-Backed Startups (v1.0.0). https://gitdealflow.com — SSRN abstract=6606558.
Cite this paper
VC Deal Flow Signal. (2026). A Longitudinal Panel of GitHub Engineering Velocity for Venture-Backed Startups (v1.0.0). https://gitdealflow.com — SSRN abstract=6606558.
Indexed in: Crossref · Semantic Scholar · OpenAlex (W7154916891) · Unpaywall · DataCite · Zenodo (DOI 10.5281/zenodo.19650920) · OpenAIRE (propagating) · Google Scholar (propagating).
Replication studies welcome. signal@gitdealflow.com for co-authorship on funding-event joins.