Cloudflare has identified and cracked down on Perplexity, an AI-powered answer engine, for its stealth crawling behavior. Perplexity, which initially crawls using a declared user agent, changes its user agent and source ASNs to evade website preferences when faced with network blocks.
The company conducted experiments by creating new domains and implementing robots.txt files that block automated access. Despite these precautions, Perplexity continued to provide detailed information about the content hosted on the restricted domains. The AI engine used multiple IPs not listed in its official IP range, rotated through them, and attempted to evade website blocks using different ASNs.
Cloudflare’s bot management system identified Perplexity’s undeclared crawling activity as a bot and blocked it due to its behavior being incompatible with web crawling norms outlined in RFC 9309. The company has added signature matches for the stealth crawler into its managed rule that blocks AI crawling activity, which is available to all customers.
This incident highlights the need for clear guidelines and principles for well-intentioned bot operators. Cloudflare’s Verified Bots Policy outlines best practices for crawlers, including transparency, well-behaved netizenship, serving a clear purpose, separating bots for separate activities, and following rules. OpenAI is an example of a leading AI company that follows these best practices.
Customers who have existing block rules in place are already protected, while those who don’t want to block traffic can set up rules to challenge requests, giving real humans an opportunity to proceed. Cloudflare’s Content Independence Day initiative has seen over two and a half million websites choose to disallow AI training through its managed robots.txt feature or managed rule blocking AI crawlers.
As the bot landscape continues to evolve, Cloudflare is actively working with technical and policy experts to establish clear principles that well-meaning bot operators should abide by. The company expects this behavior to change over time, and its methods will keep evolving as well.
Source: https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives