Databricks Faces Lawsuit Over Copyrighted LLM Training Data

Key Takeaways
- 1Databricks sued over LLM claiming copyright infringement of 196,000 titles.
- 2Legal landscape shifts as court allows authors' lawsuit to proceed.
- 3Impacts Databricks' reliance on external data sources for AI training.
Databricks is currently facing a class action lawsuit filed by a group of authors, including notable bestsellers, alleging that its language model, DBRX, was trained on copyrighted material from about 196,000 book titles. U.S. District Judge Charles Breyer denied Databricks' motion to dismiss the lawsuit, allowing the authors to pursue their claims that the company acquired this data unlawfully, particularly from a dataset called RedPajama, which was previously removed from Hugging Face for copyright violations.
The case raises significant questions about the data sourcing practices for LLMs and could set a precedent for copyright issues in AI development. With the court proceedings, Databricks may need to re-evaluate its AI training data acquisitions, raising concerns about dependency on external datasets potentially leading to further legal ramifications. This situation not only highlights the challenges faced by AI companies in complying with copyright laws but also underscores an increasing scrutiny regarding data sovereignty in AI training practices.
Related Sovereign AI Articles

Canonical Faces Backlash Over AI Features in Ubuntu

Robinhood CEO Predicts Tokenization Supercycle Impact

North Carolina Proposes Bill for Data Center Cost Coverage

Tumbler Ridge Families Sue OpenAI Over ChatGPT Incident
