Expanding beyond the modern code series, this release presents a massive historical snapshot from the Google Code Archive. This dataset captures the open-source landscape from 2006 to 2016, offering a unique time capsule of software development patterns during the era before GitHub's dominance.
Key Stats:
- 65,825,565 files from 488,618 repositories - 47 GB compressed Parquet storage - 454 programming languages (Heavily featuring Java, PHP, and C++) - Extensive quality filtering (excluding vendor code and build artifacts) - Rich historical metadata: original repo names, file paths, and era-specific licenses
This is one of those releases that I'm most interested in getting feedback on. Would you like to see more old code datasets?