Cloudflare Accuses Perplexity of Skirting No-Crawl Rules

Cybersecurity firm Cloudflare is accusing Perplexity AI of employing tactics to skirt websites’ no-crawl rules.

    Get the Full Story

    Complete the form to unlock this article and enjoy unlimited free access to all PYMNTS content — no additional logins required.

    yesSubscribe to our daily newsletter, PYMNTS Today.

    By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions.

    Writing on its blog Monday (Aug. 4), the company said it had gotten complaints from customers who had barred Perplexity scraping bots through the settings in their sites’ robots.txt files and through Web application firewalls. Nevertheless, Perplexity continued to access the sites’ content, the blog post added.

    Cloudflare noted that Perplexity is violating decades-old internet norms. PYMNTS has contacted Perplexity for comment but has not yet gotten a reply.

    According to the blog post, Cloudflare researchers decided to test the customer complaints for themselves, learning that when Perplexity crawlers encountered blocks from robots.txt files or firewall rules, Perplexity then searched the sites with a stealth bot that used an array of tactics to conceal its presence.

    “This undeclared crawler utilized multiple IPs not listed in Perplexity’s official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare,” the researchers wrote. “In addition to rotating IPs, we observed requests coming from different ASNs in attempts to further evade website blocks. This activity was observed across tens of thousands of domains and millions of requests per day.”

    In July, Cloudflare introduced a tool to prevent bot crawlers from accessing web content without consent.

    Advertisement: Scroll to Continue

    “If the Internet is going to survive the age of AI, we need to give publishers the control they deserve and build a new economic model that works for everyone — creators, consumers, tomorrow’s AI founders, and the future of the web itself,” said Matthew Prince, co-founder and CEO of Cloudflare, at the time.

    Cloudflare’s allegation was flagged in a Monday report by Ars Technica.

    In 1994, the report said, engineer Martijn Koster proposed the Robots Exclusion Protocol, a machine-readable format for telling crawlers they weren’t allowed on a given site. The protocol has been widely followed since then, and became an official standard under the Internet Engineering Task Force in 2022, the report said.

    The report also noted that Perplexity has been accused of breaking norms before. Reddit CEO Steve Huffman spoke with The Verge last year about the difficulty of keeping Perplexity, Microsoft and Anthropic from accessing his site’s content for free. Reddit sued Anthropic last year, accusing that company of unlawfully using its content to train its AI models.

    Perplexity has also been accused by other media companies plagiarizing their content. That includes the BBC, which took the company to court in June. Perplexity has dismissed the broadcaster’s claims as “manipulative and opportunistic.”