Microsoft Uses Unlicensed Web Data for New AI Models, Challenging Fair

Microsoft's reliance on Common Crawl highlights a wider industry trend towards normalized unlicensed data use.
Key Points
- 1Third major provider following similar data use policies.
- 2Raises questions about data rights and AI training.
- 3Signals reliance on unlicensed data across AI firms.
What Changed
Microsoft has adopted a controversial approach by using unlicensed web data, such as Common Crawl, for training its new AI models. This practice is akin to those used by other large language model creators. The decision challenges prior commitments regarding data usage and raises significant questions about Fair Use assumptions and potential regulatory implications.
Strategic Implications
By resorting to unlicensed data, Microsoft and its peers might face regulatory scrutiny over intellectual property rights. The shift potentially reallocates power towards companies using similar practices, increasing pressure on data owners and web platforms. It puts a spotlight on the need for clearer guidelines in AI training.
What Happens Next
Expect discussions among policymakers and internet stakeholders about the boundaries of data usage, with potential rulings by late 2026. Regulatory bodies might consider revising data protection laws or interpretations of Fair Use to balance innovators' needs with creators' rights.
Second-Order Effects
The reliance on unlicensed data could influence data supply chains, affecting adjacent markets like data management and protection services. The use of Common Crawl may also push web operators to fortify measures for data scraping and its redistribution.
Free Daily Briefing
Top AI intelligence stories delivered each morning.