A tiny mouse, a hacker.

See here for an introduction, and my link tree for socials.

  • 0 Posts
  • 8 Comments
Joined 2 years ago
cake
Cake day: December 24th, 2023

help-circle
  • I’m using a setup similar to what you had in mind: I have a small €4/month VPS as my front, with scrapers taken care of by iocaine (it both blocks them, and firewalls the worst off automatically). That’s over 90% of the HTTP(s) traffic never making it past the VPS, greatly reducing the traffic into my home network. My actual servers are behind a WireGuard tunnel.

    It does not protect against a non-HTTP DDoS, but that wasn’t part of my threat model to begin with. My VPS provider (Hetzner) has DDoS protection even for €4/month servers - that doesn’t include the scraper DDoS, but includes other kinds - I have luckily not been a victim of any, so no idea whether it works reliably.

    Against the scrapers, a VPS + bot defense + Wireguard works like a charm. Can recommend.




  • I need to join more communities, because I’m noticing these anti-scraper questions way too late.

    I’d like to direct your attention to iocaine. It’s somewhat similar to Anubis in the sense that it sits between your reverse proxy and the real content, but unlike Anubis, it does not use proof of work. It exploits the fact that most of the scrapers are incredibly dumb, and can be trivially detected:

    • Is it in ai.robots.txt’s list? It’s a crawler.
    • Does it have Firefox/ or Chrome/ in the user agent, but sent no sec-fetch-mode header? Pretty much guaranteed to be a crawler, with few exceptions (eg, Googlebot, Bingbot - but I’d classify those as hostile crawlers too)

    Serve garbage or a static page with poisoned URLs to these, and you got rid of 90%+ of the bots. Why the poisoned URLs? Because when they come back riding headless chromes, they usually crawl URLs the dumb bots collected. If you poison those URLs in a way that you can identify them trivially, you can block the headless chromes too, which you wouldn’t be able to detect otherwise. Whether they come through residential proxies or not, as long as their queue is collected by the dumb bots, you can catch them.

    On top of this, to reduce the load on your servers, iocaine can also block requests. It can be configured to serve garbage & poisoned URLs to the dumb bots, and then firewall anything that hits a poisoned URL.

    The false positive rate is surprisingly low.


  • That doesn’t mean it can’t happen, but also, what actual harm does it do to have a dozen scrapers hitting your site every second? (this is an exaggeration it’s likely not going to be that bad) How big is your smolweb page and images?

    If I were hit by a few dozen scrapers, I wouldn’t care. But I host a few dozen small sites (which all opted out of search engine indexing too), and even today, when I firewall off the worst offenders, I’m still getting 20-25 requests/second a day. Prior to firewalling those off, I had an average of ~300 requests/sec sustained over months, with weekend waves going up to 1400 requests/second. It would’ve gone higher, but at that point, my €4/month VPS was unable to handle the TLS handshakes. At 1400 req/sec, just doing the handshake exhausted what little CPU I had, and I didn’t even serve anything. (At one point, before I implemented automatic firewalling, I scaled the server up, and saw 20k req/sec - stupidly high, because there’s nothing particularly lucrative I host).

    But smolweb? Honestly, I hate to break it to you but nobody cares that much, not even LLMs.

    I’m sorry, they do.

    Anubis potentially makes sense on social media sites like Lemmy that are hosting large numbers of users and user-generated content.

    I don’t think it does. You know what can trivially get through Anubis? A real browser. You know what AI companies have in abundance? ~Infinite money to burn. If they want to get through Anubis, they will. Codeberg saw that happen. Proof of Work doesn’t scale well against the crawlers.