Sovereign AI·Europe

Microsoft Uses Unlicensed Web Data for New AI Models, Challenging Fair

Global AI Watch · Editorial Team··5 min read
Microsoft Uses Unlicensed Web Data for New AI Models, Challenging Fair
Editorial Insight

Microsoft's reliance on Common Crawl highlights a wider industry trend towards normalized unlicensed data use.

Key Points

  • 1Third major provider following similar data use policies.
  • 2Raises questions about data rights and AI training.
  • 3Signals reliance on unlicensed data across AI firms.

What Changed

Microsoft has adopted a controversial approach by using unlicensed web data, such as Common Crawl, for training its new AI models. This practice is akin to those used by other large language model creators. The decision challenges prior commitments regarding data usage and raises significant questions about Fair Use assumptions and potential regulatory implications.

Strategic Implications

By resorting to unlicensed data, Microsoft and its peers might face regulatory scrutiny over intellectual property rights. The shift potentially reallocates power towards companies using similar practices, increasing pressure on data owners and web platforms. It puts a spotlight on the need for clearer guidelines in AI training.

What Happens Next

Expect discussions among policymakers and internet stakeholders about the boundaries of data usage, with potential rulings by late 2026. Regulatory bodies might consider revising data protection laws or interpretations of Fair Use to balance innovators' needs with creators' rights.

Second-Order Effects

The reliance on unlicensed data could influence data supply chains, affecting adjacent markets like data management and protection services. The use of Common Crawl may also push web operators to fortify measures for data scraping and its redistribution.

Free Daily Briefing

Top AI intelligence stories delivered each morning.

Subscribe Free →

Explore Trackers