AI Research Group Releases Large Open-Text Dataset

EleutherAI, an AI research organization, has released the Common Pile v0.1, a large collection of licensed and open-domain text for training AI models. The dataset took two years to complete in collaboration with AI startups, academic institutions, and other organizations. It weighs 8 terabytes and is used to train two new AI models that rival those developed using unlicensed data.

The release comes as AI companies, including OpenAI, face lawsuits over their AI training practices, which rely on scraping the web for copyrighted material without permission. EleutherAI argues that these lawsuits have decreased transparency in the industry, making it harder to understand how models work and identify flaws.

In a blog post, Stella Biderman, EleutherAI’s executive director, stated that copyright laws have not changed data sourcing practices but have reduced transparency. Researchers at some companies have cited lawsuits as a reason for not releasing research due to data concerns.

The Common Pile v0.1 was created with consultation from legal experts and draws on public domain books, the Library of Congress, and the Internet Archive. EleutherAI used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.

EleutherAI claims its new models demonstrate that carefully curated open datasets can produce competitive results with proprietary alternatives. The models rival Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

As the amount of accessible openly licensed and public domain data grows, EleutherAI believes that the quality of models trained on open content will improve. The company is committed to releasing open datasets more frequently in collaboration with its partners.

Source: https://techcrunch.com/2025/06/06/eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text