The AI data gold rush is over. Here's what comes next – qz.com
Getty Images
A version of this article originally appeared in Quartz’s AI & Tech newsletter. Sign up here to get the latest AI & tech news, analysis and insights straight to your inbox.
When OpenAI scraped the internet to train the first version of ChatGPT, the web was a digital buffet laid out for the taking. Years of blog posts, massive book repositories, endless Reddit discussions — all sitting there, unguarded and free. The amount of data needed was staggering, but it was just waiting to be harvested.
That era has ended.
Publishers are suing over past use and demanding payment for future content. Cloudflare has turned into the internet's bouncer, blocking AI crawlers by default unless they pay up. The high-quality data needed to train the next generation of AI models isn't just sitting around anymore — and AI companies are discovering they may have poisoned their own well.
The numbers tell the story of a dramatic shift. According to Cloudflare's analysis, Anthropic's Claude made almost 71,000 page requests for every single referral it sent back to publishers. OpenAI's ratio was 1,600 requests per referral, while Perplexity clocked in at more than 200 to 1. Compare that to traditional search engines such as Google, which maintained a roughly 9-to-1 ratio — a relationship that, for all the grumbling publishers have had with it over the years, now resembles a fair partnership.
Since Google launched AI Overviews in May 2024, the proportion of news searches that result in zero clicks to publisher websites has jumped from 56% to almost 69%. Organic traffic to news sites plummeted from more than 2.3 billion visits at its peak to less than 1.7 billion by May 2025.
The traffic collapse is devastating publishers who built their business models around advertising revenue and reader engagement. Digital media companies have responded with mass layoffs, and some analysts suggest shutdowns are imminent.
This creates a destructive feedback loop: As publishers produce less content or lock it away, AI companies lose access to the fresh, high-quality information they need to keep their models current and accurate. The web risks becoming increasingly stale and synthetic, populated more by AI-generated content than human expertise and reporting.
Publishers are fighting back in court with a wave of lawsuits that could force major payouts. Anthropic recently agreed to pay at least $1.5 billion to settle a class-action lawsuit from authors, working out to approximately $3,000 per book.
Reddit, which signed licensing agreements worth $203 million in early 2024, is already pushing for more lucrative deals. The company is in talks with Google and OpenAI for agreements that would include dynamic pricing, allowing Reddit to charge more as its data becomes increasingly vital to AI answers.
Even companies that have historically championed free information are joining the tollbooth economy. WalletHub recently pulled 40,000 pages of financial content from public access, making it available only to logged-in users. CEO Odysseas Papadimitriou compared the situation to dealing with the mafia: "Either they shut down the road your restaurant is on and no customers can reach you, or they keep the road clear but open a restaurant next door to yours and make you serve their customers for free."
Cloudflare, which handles about 20% of all internet traffic, responded to the massive amount of bot traffic scraping the web for AI training by flipping the internet's default setting. Instead of allowing AI crawlers free access unless specifically blocked, the company now blocks them by default unless they pay for content licenses.
"The deal that Google made to take content in exchange for sending you traffic just doesn't make sense anymore," Cloudflare CEO Matthew Prince wrote in a blog post.
The industry is trying to create order from chaos. A consortium of publishers and tech companies has introduced the Really Simple Licensing (RSL) standard, which would require AI crawlers to present valid license tokens before accessing content. The system is backed by major players including O'Reilly Media, Reddit, Yahoo, and Medium.
But technical solutions face practical limits. As Pete Pachal, a journalist and media critic, noted, there are "myriad" ways for AI companies to access blocked content through relays, third-party systems, and different types of bots. Even sophisticated blocking measures often amount to "whack-a-mole" games.
Google, caught in its own contradiction, recently admitted in a court filing that "the open web is already in rapid decline" — a stark contrast to executives' public statements that "the web is thriving."
The admission reflects a new reality for AI companies: The era of free lunch is over. The next generation of AI models will require not just better algorithms, but bigger checkbooks. The question isn't whether AI companies will have to pay for data. They will. The question is how much they'll be willing to spend, and whether the internet's creators will finally get their fair share of the AI gold rush.