Smolweb and scraper protection

solbear@slrpnk.net · 1 month ago

Smolweb and scraper protection

x1gma@lemmy.world · 1 month ago

You’ve answered your question by yourself already. All of those measures prevent, at most, the public and “nice” scrapers. Anything more is pretty close to impossible, if you post something online on a public page, it’s public. Public stuff will be scraped for good, evil, LLM and non-LLM usage.

You can not prevent your public data to be scraped into LLM data. It’s simply not possible. The moment your page gets picked up somewhere, either by scanners, DNS, domain, TLS registry scanners, whatever - it will get scraped. There will be a point where your defenses will fail and answer to a bot posing as a regular user, and your page will get fed to the money printing machine.

The goal of Anubis and similar tools is to make LLM scraping more expensive, and at least prevent LLM scrapers from freeloading on content completely. Blocking scrapers is purely trust based (user agents and similar identification, that can be faked easily) or heuristic/behaviour based, which can never achieve 100% correct detection (e.g. you will always have some malicious requests going through, and some legitimate users getting blocked).

The more restrictions you try and apply, the more your regular users will be impacted, and that’s the trade-off you need to take.