Amazon Investigating Claim of Scrapping Abuse

Tech Read Team
4 Min Read

Amazon’s cloud division has initiated an investigation into Perplexity AI, a startup specializing in AI search. There are concerns that Perplexity AI may be violating Amazon Web Services rules by scraping websites that have attempted to block such actions, according to an exclusive report by WIRED.

An AWS spokesperson, speaking on the condition of anonymity to WIRED, confirmed that the company is looking into Perplexity. It was previously reported that the startup, which has received backing from the Jeff Bezos family fund and Nvidia and was valued at $3 billion, appears to be using content from websites that explicitly prohibited access through the Robots Exclusion Protocol, a widely recognized web standard. While the Robots Exclusion Protocol itself is not legally binding, respecting website terms of service is typically expected.

The Robots Exclusion Protocol, a long-standing web standard, involves the placement of a plaintext file on a domain (e.g., wired.com/robots.txt) to specify which pages should not be accessed by automated bots and crawlers. While companies using scrapers can choose to overlook this protocol, it has generally been respected. According to the AWS spokesperson, AWS customers are required to adhere to the robots.txt standard when crawling websites.

“AWS’s terms of service prohibit customers from engaging in any illegal activities, and customers are responsible for complying with our terms and all applicable laws,” the spokesperson explained in a statement.

The scrutiny of Perplexity’s practices intensified following a June 11 report from Forbes accusing the startup of plagiarizing at least one of its articles. WIRED’s investigations substantiated this claim and uncovered additional instances of scraping and plagiarism by systems associated with Perplexity’s AI-powered search chatbot. Engineers at Condé Nast, the parent company of WIRED, blocked Perplexity’s crawler across all their websites using a robots.txt file. However, it was discovered that the company had accessed a server using an undisclosed IP address (44.221.181.252), which had visited Condé Nast’s properties numerous times over the past three months, seemingly for the purpose of scraping the websites.

The machine connected to Perplexity seems to be conducting widespread crawling of news websites that explicitly prohibit bots from accessing their content. Representatives from The Guardian, Forbes, and The New York Times also acknowledged detecting the IP address on their servers on multiple occasions.

WIRED was able to trace the IP address to an Elastic Compute Cloud (EC2) instance hosted on AWS, prompting the cloud service provider to launch an investigation after being asked about the potential violation of its terms of service by using AWS infrastructure for prohibited scraping activities.

Recently, Perplexity CEO Aravind Srinivas initially dismissed WIRED’s queries by claiming they displayed a lack of understanding of how the Internet functions. Subsequently, in an interview with Fast Company, Srinivas revealed that the undisclosed IP address observed scraping Condé Nast websites and a test site was operated by a third-party company specializing in web crawling and indexing services. He declined to disclose the name of the company, citing a nondisclosure agreement. When questioned about ceasing the crawling of WIRED’s website by the third party, Srinivas cryptically replied, “It’s complicated.”

Share This Article
Leave a comment