Databricks Faces Lawsuit Over Copyrighted LLM Training Data

Global AI Watch··4 min read·The Register
Databricks Faces Lawsuit Over Copyrighted LLM Training Data

Databricks is currently facing a class action lawsuit filed by a group of authors, including notable bestsellers, alleging that its language model, DBRX, was trained on copyrighted material from about 196,000 book titles. U.S. District Judge Charles Breyer denied Databricks' motion to dismiss the lawsuit, allowing the authors to pursue their claims that the company acquired this data unlawfully, particularly from a dataset called RedPajama, which was previously removed from Hugging Face for copyright violations.

The case raises significant questions about the data sourcing practices for LLMs and could set a precedent for copyright issues in AI development. With the court proceedings, Databricks may need to re-evaluate its AI training data acquisitions, raising concerns about dependency on external datasets potentially leading to further legal ramifications. This situation not only highlights the challenges faced by AI companies in complying with copyright laws but also underscores an increasing scrutiny regarding data sovereignty in AI training practices.

Related Sovereign AI Articles

Explore Trackers