Science and Tech

Amazon Web Services is investigating whether Perplexity is using web scraping to train its AI

Amazon Web Services is investigating whether Perplexity is using web scraping to train its AI

1 Jul. (Portaltic/EP) –

Amazon Web Services (AWS) has announced that it has begun an investigation into the operation of Perplexity – which uses its servers – to determine whether this company uses the ‘web scraping’ technique to train its Artificial Intelligence (AI) models.

Also known as data scraping, it is a process by which content is collected from web pages using software that extracts the HTML code from these sites to filter the information and store it, which is compared to the process automatic copy and paste.

Developer Robb Knight and Wired have recently discovered that AI search startup Perplexity has violated the so-called Robots Exclusion Protocol for certain websites and used this technique to train its AI models.

This Protocol responds to a web standard that consists of placing a plain text file (robots.txt) in a domain to indicate which pages robots and automated crawlers should not access, as explained by the media.

Based on these allegations, Amazon Web Services has launched an investigation to determine whether Perplexity, which uses AWS to train its AI, is violating the rules and performing web scrapping on websites that tried to prevent them from doing so.

This was confirmed to Wired by an AWS spokesperson, who noted that its terms prohibit its customers from using its services for any illegal activity and that they are responsible for complying with its conditions “and all applicable laws.”

From the startup they have indicated that Perplexity “respects robots.txt” and that the services it monitors “do not engage in tracking in any way that violates the AWS Terms of Service,” said spokeswoman Sara Platnick.

However, the company explained that its bot will ignore the robots.txt file when a user enters a URL in their query, a “rare” use case. “When a user enters a specific URL, it does not trigger a crawling behavior” but rather “the agent acts on behalf of the user to retrieve the URL. It works the same as if the user were to go to a page, copy the text of the article and then paste it into the system,” it said.

In this regard, Wired has stressed that, according to the spokesperson’s description, it is confirmed that the investigations it has carried out are true and that its ‘chatbot’ ignores robots.txt in certain cases to collect information in an unauthorized manner.

Source link