I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.
r/LocalLLaMA
•
Machine Learning
Generative AI
AI Research
Posted this to r/MachineLearning a couple weeks ago (30K views, 100+ upvotes) and have been meaning to share it here where the fine-tuning angle is directly relevant. I spent years building and processing a complete Usenet corpus from 1980 to 2013. Here’s why it might matter for local model work specifically: Zero AI contamination. Every post predates LLMs by decades