Artificial intelligence (AI) companies have long claimed that their tools rely on copyrighted material to train, but a new study proves otherwise. A team of researchers from 14 institutions, including MIT, Carnegie Mellon, and the University of Toronto, successfully trained an 8 TB dataset using only public domain and openly licensed material. The resulting large language model (LLM) performed similarly to Meta’s Llama 2-7B, a model developed two years prior.
The team faced significant challenges in gathering and processing the data, as much of it was unreadable by machines and required manual annotation by humans. However, they overcame these hurdles to demonstrate that an ethically sourced dataset can produce a comparable LLM.
This study provides a counterpoint to industry claims that such models are impossible without copyrighted materials. While it may not change the trajectory of AI companies, it does challenge their arguments and could have implications for regulatory discussions around copyright law and AI development.
Source: https://www.engadget.com/ai/it-turns-out-you-can-train-ai-models-without-copyrighted-material-174016619.html