Anubis is awesome! Stopping (AI)crawlbots

zoey@lemmy.librebun.com · edit-2 8 months ago

Anubis is awesome! Stopping (AI)crawlbots

lambalicious@lemmy.sdf.org · 8 months ago

Positives: nice uwu art.

Negatives: requires javascript, intrinsically ableist.

phase@lemmy.8th.world · 8 months ago

There’s another challenge available, without javascript.

ohshit604@sh.itjust.works · 8 months ago

How is the art a positive?

e0qdk@reddthat.com · 8 months ago

I don’t like Anubis because it requires me to enable JS – making me less secure. reddthat started using go-away recently as an alternative that doesn’t require JS when we were getting hammered by scrapers.

BakedCatboy@lemmy.ml · 8 months ago

Fwiw Anubis is adding a nojs meta refresh challenge that if it doesn’t have issues will soon be the new default challenge

dan@upvote.au · 8 months ago

Won’t the bots just switch to using that instead of the heavier JS challenge?

Sekoia@lemmy.blahaj.zone · 8 months ago

They can, but it’s not trivial. The challenge uses a bunch of modern browser features that these scrapers don’t use, regarding metadata and compression and a few other things. Things that are annoying to implement and not worth the effort. Check the recent discussion on lobste.rs if you’re interested in the exact details.

baod_rate@programming.dev · 8 months ago

Check the recent discussion on lobste.rs if you’re interested in the exact details.

For those coming from the future: https://lobste.rs/s/aa7ske/anubis_now_supports_non_js_challenges

reddeadhead@awful.systems · edit-2 8 months ago

Anubis just released the no-JS challenge in a update. Page loads for me with JS disabled. https://anubis.techaro.lol/blog/release/v1.20.0/

Mora@pawb.social · edit-2 8 months ago

Besides that point: why tf do they even crawl lemmy. They could just as well create a “read only” instance with an account that subscribes to all communities … and the other instances would send their data. Oh, right, AI has to be as unethical as possible for most companies for some reason.

ZombiFrancis@sh.itjust.works · 8 months ago

See your brain went immediately to a solution based on knowing how something works. That’s not in the AI wheelhouse.

dan@upvote.au · 8 months ago

They’re likely not intentionally crawling Lemmy. They’re probably just crawling all sites they can find.

wizardbeard@lemmy.dbzer0.com · 8 months ago

They crawl wikipedia too, and are adding significant extra load on their servers, even though Wikipedia has a regularly updated torrent to download all its content.

AmbitiousProcess (they/them)@piefed.social · 8 months ago

Because the easiest solution for them is a simple web scraper. If they don’t give a shit about ethics, then something that just crawls every page it can find is loads easier for them to set up than a custom implementation to get torrent downloads for wikipedia, making lemmy/mastodon/pixelfed instances for the fediverse, using rss feeds and checking if they have full or only partial articles, implementing proper checks to prevent double (or more) downloading of the same content, etc.

Possibly linux@lemmy.zip · 8 months ago

It doesn’t stop bots

All it does is make clients do as much or more work than the server which makes it less temping to hammer the web.

rtxn@lemmy.world · 8 months ago

But don’t you know that Anubis is MALWARE?

…according to some of the clowns at the FSF, which is definitely one of the opinions to have. https://www.fsf.org/blogs/sysadmin/our-small-team-vs-millions-of-bots

chihuamaranian@lemmy.ca · 8 months ago

The FSF explanation of why they dislike Anubis could just as easily apply to the process of decrypting TLS/HTTPS. You know, something uncontroversial that every computer is expected to do when they want to communicate securely.

I don’t fundamentally see the difference between “The computer does math to ensure end-to-end privacy” and “The computer does math to mitigate DDoS attempts on the server”. Either way, without such protections the client/server relationship is lacking crucial fundamentals that many interactions depend on.

dan@upvote.au · 8 months ago

tbh I kinda understand their viewpoint. Not saying I agree with it.

The Anubis JavaScript program’s calculations are the same kind of calculations done by crypto-currency mining programs. A program which does calculations that a user does not want done is a form of malware.

Natanox@discuss.tchncs.de · edit-2 8 months ago

That’s guilt by association. Their viewpoint is awful.

I also wished there was no security at the gate of concerts, but I happily accept it if that means actual security (if done reasonably of course). And quite frankly, cute anime girl doing some math is so, so much better than those god damn freaking captchas. Or the service literally dying due to AI DDoS.

Edit: Forgot to mention, proof of work wasn’t invented by or for crypto currency or blockchain. The concept exists since the 90’s (as an idea for Email Spam prevention), making their argument completely nonsensical.

Arghblarg@lemmy.ca · 8 months ago

Ah, hashcash. Wish that had taken off, it was a good idea …

xavier666@lemmy.umucat.day · 8 months ago

TIL of hashcash

xavier666@lemmy.umucat.day · 8 months ago

And quite frankly, cute anime girl doing some math is so, so much better than those god damn freaking captchas

One user complained that a random anime girl popping up is making his gf think he’s watching hentai. So the mascot should be changed to something “normal”.

Natanox@discuss.tchncs.de · 8 months ago

Lol.

“My relationship is fragile and it’s the internets fault.”

dan@upvote.au · 8 months ago

The Anubis site thinks my phone is a bot :/

tbh I would have just configured a reasonable rate limit in Nginx and left it at that.

Won’t the bots just hammer the API instead now?

Flipper@feddit.org · 8 months ago

No. The rate limit doesn’t work as they use huge IP Spaces to crawl. Each IP alone is not bad they just use several thousand of them.

Using the API would assume some basic changes. We don’t do that here. If they wanted that, they could run their own instance and would even get notified about changes. No crawling required at all.

paraphrand@lemmy.world · 8 months ago

I’ve seen some people reject this solution due to the anime.

fireshell@kbin.earth · 8 months ago

The development of Anubis remains a matter of enthusiasm: Zee is funding the project through Patreon and sponsorship on GitHub, but cannot yet afford to pursue it on a full-time basis. He would also like to hire a key community member, budget permitting.

ikidd@lemmy.world · 8 months ago

Something that’s less annoying than Anubis is fail2ban tarpitting the scrapers by putting in a hidden honeypot page link that they follow, and adding the followers to fail2ban.

https://petermolnar.net/article/anti-ai-nepenthes-fail2ban/

N0x0n@lemmy.ml · edit-2 8 months ago

Wow, what a combo ! I guess this would reduce the tarpit’s overall power consumption?

I haven’t looked at your link yet and maybe it already contains my answer, but I wish to customize for how long they are traped into the tarpit before fail2ban kicks in so I can still poison their AI while saving alot of ressources !!

Edit:

block anything that visits it more, than X times with fail2ban

I guess this is it, but I’m not sure how that translates from nepenthese to fail2ban. Need further reading and testing !

Thanks for the link !

blob42@lemmy.ml · edit-2 8 months ago

I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

git.blob42.xyz {
    @bot <<CEL
        header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
    CEL


    abort @bot
    

    defender garbage {

        ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
      
    }

    rate_limit {
        zone dynamic_botstop {
            match {
                method GET
                 # to use with defender
                 #header X-RateLimit-Apply true
                 #not header LetMeThrough 1
            }
            key {remote_ip}
            events 1500
            window 30s
            #events 10
            #window 1m
        }
    }

    reverse_proxy upstream.server:4242

    handle_errors 429 {
        respond "429: Rate limit exceeded."
    }

}

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

ZeroOne@lemmy.world · 8 months ago

Go_Away is another alternative

Anubis is awesome! Stopping (AI)crawlbots

Anubis is awesome! Stopping (AI)crawlbots

Incoherent rant.

Behold, Anubis.

“Weighs the soul of incoming HTTP requests to stop AI crawlers”