John Engates, Field CTO, Cloudflare explained how AI scraping can be made fair, the shift from training AI models to inferencing and the need for intelligent automation. | Photo Credit: Special Arrangement

The Internet is at risk without proper control over web scraping: Cloudflare top exec

We’re intent on democratizing content monetization through the AI Audit tool, which creates an automated marketplace for content licensing

27 Nov 2024, 09:30 by Poulomi Chatterjee · The Hindu

Cloud and cybersecurity company, Cloudflare, is waging a war against AI bot crawlers. Web scraping, a phenomenon that began accelerating as the power of AI advanced, has crossed legal lines as AI firms are casually dismissing common online etiquette. Some tech companies have ignored ‘robot.txt’ files, which served as a binding instruction on whether data could or couldn’t be scraped from a website. Cloudflare is looking to fix such unruly behaviour with a permanent solution. They launched a bot management tool that monitors and blocks web scrapers. They also announced a marketplace for websites where they can negotiate payment terms with AI companies for using their data.

This year, the company expanded around its core offering and introduced AI-oriented services like Web Application Firewall and Cloudflare Workers, a serverless platform that allows developers to run their code fast. In an interaction with The Hindu, John Engates, Field CTO, Cloudflare explained how AI scraping can be made fair, the shift from training AI models to inferencing and the need for intelligent automation.

THB: How did AI turn web scraping into a battleground? What is Cloudflare’s goal in this area?

John Engates: Web scraping has become contentious as AI systems require massive amounts of training data from public web content, creating significant resource strain on websites and raising legal concerns around copyright infringement and privacy.  Cloudflare has been closely tracking these challenges and raising awareness about AI-driven web scraping trends. We’ve always helped customers block the bad bots, but the rise of AI large language models (LLMs) has created a murkier category of bots that, while not malicious, can burden sites without providing value in return. At Cloudflare, we believe that without proper control over web scraping, the open Internet is at risk. Copyright infringement and privacy violations can discourage site owners from maintaining their content publicly, so more content ends up behind paywalls and limits availability for smaller AI model providers. High-frequency scraping also burdens website servers, increasing operational costs and reducing performance. To address this, we’ve developed AI Audit tools that enable customers to detect and block unauthorized scraping while providing detailed analytics for monitoring bot activity. Our goal is to protect the open Internet by giving site owners control over their content and allow them to realize value from their content.

THB: Cloudflare has announced a marketplace so AI companies and websites can negotiate deals. How open do you think AI companies will be to pay smaller websites for web scraping?

JE: Cloudflare’s approach goes beyond just blocking or allowing AI bots access—it gives content creators the power to control and monetize their content. We’re intent on democratising content monetisation through the AI Audit tool, which creates an automated marketplace for content licensing. Traditionally, only major publishers could negotiate such deals, but our system allows sites of any size to set fair prices and receive compensation when AI companies access their content. The process is designed to be seamless and scalable, eliminating the need for individual negotiations and makes it much easier for smaller content creators to benefit. We’re also providing advanced analytics to help content creators understand AI bot impact on their traffic and revenue.

THB: Prominent AI firms like OpenAI have already signed licensing deals with major publishers to avoid any legal trouble. But this means that smaller outlets will be ignored and bigger publishers manage to get paid. How can this be avoided?

JE: When only large publishers can negotiate deals, it creates a two-tier Internet where smaller creators get left behind. Our approach levels the playing field by giving every site owner—from individual bloggers to major media outlets—the same tools to analyse AI bot activity, control access, and set fair compensation rates. With millions of Internet properties representing roughly 20% of web traffic on our network, Cloudflare’s scale ensures AI companies can’t afford to ignore smaller content creators. Our automated system eliminates the need for individual negotiations by providing standardized tools for analysing bot activity, controlling access, and setting compensation rates.

THB: Cloudflare also has serverless GPU inference platform, Workers AI. What is the growth in this segment, and how much is inference-related compute expected to increase?

JE: If AI training is the foundation, AI inference is the implementation. AI compute has been primarily dedicated to training LLMs so far, but we’re now seeing companies prepare to shift to the implementation phase. As many more companies adopt and integrate AI into their own tools and systems, the demand for GPU compute for inference tasks will eventually surpass the demand for training capacity. Workers AI is gaining traction due to two key advantages: our global network of inference capabilities across 180 cities, and significantly higher GPU utilisation rates compared to traditional clouds. The ability to run inference tasks close to where data is produced and consumed is becoming crucial. Our platform delivers both performant inference capabilities and ensures data security, making it particularly attractive as companies scale their AI implementations. While we’re still early in the adoption curve, we’re seeing strong enthusiasm for our cost-effective, globally distributed approach to AI.

THB: How beneficial is AI and automation to cybersecurity in the context of a CrowdStrike-like incident?

JE: AI is both a challenge and a solution when it comes to cybersecurity. While AI holds tremendous potential for organizations, it also allows threat actors to rapidly increase their effectiveness and renders one-size-fits-all security offerings obsolete. Having more quality data will make AI more effective at stopping new and evolving types of attacks.

As major tech companies increasingly generate code through AI, we need intelligent automation that can match this pace. The future isn’t about choosing between automation or no automation – it’s about making automation smarter through AI to handle the complexity of modern development and threats effectively.

Published - November 27, 2024 03:00 pm IST

The Internet is at risk without proper control over web scraping: Cloudflare top exec

We’re intent on democratizing content monetization through the AI Audit tool, which creates an automated marketplace for content licensing

Related stories