Anubis is awesome and I want to talk about it

SmokeyDope@piefed.social · edit-2 3 months ago

Anubis is awesome and I want to talk about it

krooklochurm@lemmy.ca · 3 months ago

“Anubis has risen, Wendell”

“Are you Jane’s addiction”?

url@feddit.fr · 3 months ago

Honestly im not a big fan of anubis . it fucks users with slow devices

https://lock.cmpxchg8b.com/anubis.html

url@feddit.fr · 3 months ago

Did i forgot to mention it doesnt work without js that i keep disabled

silmarine@discuss.tchncs.de · 3 months ago

Thanks for this! In going to set this up for myself.

A_norny_mousse@feddit.org · 3 months ago

At the time of commenting, this post is 8h old. I read all the top comments, many of them critical of Anubis.

I run a small website and don’t have problems with bots. Of course I know what a DDOS is - maybe that’s the only use case where something like Anubis would help, instead of the strictly server-side solution I deploy?

As a strictly server-side (and low computing) solution I use CrowdSec (it seems to work with caddy btw). It took a little setting up, but it does the job.

Am I missing something here? Why wouldn’t that be enough? Why do I need to heckle my visitors?

Despite all that I still had a problem with bots knocking on my ports

By the time Anubis gets to work, the knocking already happened so I don’t really understand this argument.

spamming my logs.

If spamming the logs is the only concern here, I rest my case.
Otherwise, if the system is set up to reject a certain type of requests, these are microsecond transactions of no (DDOS exception) harm.

quick_snail@feddit.nl · 3 months ago

With varnish and wazuh, I’ve never had a need for Anubis.

My first recommendation for anyone struggling with bots is to fix their cache.

kalleboo@lemmy.world · 3 months ago

Anubis was originally created to protect git web interfaces since they have a lot of heavy-to-compute URLs that aren’t feasible to cache (revision diffs, zip downloads etc).

After that I think it got adopted by a lot of people who didn’t actually need it, they just don’t like seeing AI scrapers in their logs.

quick_snail@feddit.nl · 3 months ago

Yes!

Also, another very simple solution is to authwall expensive pages that can’t be cached.

daniskarma@lemmy.dbzer0.com · edit-2 3 months ago

You are right. For most self-hosting usecases anubis is not only irrelevant, but it actually works against you. False sense of security and making your devices do extra work for nothing.

Anubis is though for public facing services that may get ddos or AI scrapped by some not targeted bot (for a target bot it’s trivial to get over Anubis in order to scrap).

And it’s never a substitute of crowdsec or fail2ban. Getting an Anubis token it’s just a matter of executing the PoW challenge. You still need a way to detect and ban malicious attacks.

Miggi@discuss.tchncs.de · 3 months ago

I also used CrowdSec for almost a year, but as AI scrapers became more aggressive, CrowdSec alone wasn’t enough. The scrapers used distributed IP ranges and spoofed user agents, making them hard to detect and costing my Forgejo instance a lot in expensive routes. I tried custom CrowdSec rules but hit its limits.

Then I discovered Anubis. It’s been an excellent complement to CrowdSec — I now run both. In my experience they work very well together, so the question isn’t “A or B?” but rather “How can I combine them, if needed?”

SmokeyDope@piefed.social · edit-2 3 months ago

If crowdsec works for you thats great but also its a corporate product whos premium sub tier starts at 900$/month not exactly a pure self hosted solution.

I’m not a hypernerd, still figuring all this out among the myriad of possible solutions with different complexity and setup times. All the self hosters in my internet circle started adopting anubis so I wanted to try it. Anubis was relatively plug and play with prebuilt packages and great install guide documentation.

Allow me to expand on the problem I was having. It wasnt just that I was getting a knock or two, its that I was getting 40 knocks every few seconds scraping every page and searching for a bunch that didnt exist that would allow exploit points in unsecured production vps systems.

On a computational level the constant network activity of bytes from webpage, zip files and images downloaded from scrapers pollutes traffic. Anubis stops this by trapping them in a landing page that transmits very little information from the server side. By traping the bot in an Anubis page which spams that 40 times on a single open connection before it gives up, it reduces overall network activity/ data transfered which is often billed as a metered thing as well as the logs.

And this isnt all or nothing. You don’t have to pester all your visitors, only those with sketchy clients. Anubis uses a weighted priority which grades how legit a browser client is. Most regular connections get through without triggering, weird connections get various grades of checks by how sketchy they are. Some checks dont require proof of work or JavaScript.

On a psychological level it gives me a bit of relief knowing that the bots are getting properly sinkholed and I’m punishing/wasting the compute of some asshole trying to find exploits my system to expand their botnet. And a bit of pride knowing I did this myself on my own hardware without having to cop out to a corporate product.

Its nice that people of different skill levels and philosophies have options to work with. One tool can often complement another too. Anubis worked for what I wanted, filtering out bots from wasting network bandwith and giving me peace of mind where before I had no protection. All while not being noticeable for most people because I have the ability to configure it to not heckle every client every 5 minutes like some sites want to do.

A_norny_mousse@feddit.org · edit-2 3 months ago

If crowdsec works for you thats great but also its a corporate product

It’s also fully FLOSS with dozens of contributors (not to speak of the community-driven blocklists). If they make money with it, great.

not exactly a pure self hosted solution.

Why? I host it, I run it. It’s even in Debian repos, but I choose their own more up-to-date ones.

All the self hosters in my internet circle started adopting anubis so I wanted to try it. Anubis was relatively plug and play with prebuilt packages

Yeah…

Allow me to expand on the problem I was having. It wasnt just that I was getting a knock or two, its that I was getting 40 knocks every few seconds scraping every page and searching for a bunch that didnt exist that would allow exploit points in unsecured production vps systems.

Again, a properly set up WAF will deal with this pronto
You should not have exploit points in unsecured production systems, full stop.

On a computational level the constant network activity of bytes from webpage, zip files and images downloaded from scrapers pollutes traffic. Anubis stops this by trapping them in a landing page that transmits very little information from the server side.

And instead you leave the computations to your clients. Which becomes a problem on slow hardware.
Again, with a properly set up WAF there’s no “traffic pollution” or “downloading of zip files”.

Anubis uses a weighted priority which grades how legit a browser client is.

And apart from the user agent and a few other responses, all of which are easily spoofed, this means “do some javascript stuff on the local client” (there’s a link to an article here somewhere that explains this well) which is much less trivial than you make it sound.

SmokeyDope@piefed.social · 3 months ago

why? I run it.

Mmm how to say this. i suppose what I’m getting at is like a philosophy of development and known behaviors of corporate products.

So, here’s what I understand about crowdsec. Its essentially like a centralized collection of continuously updated iptable rules and botscanning detectors that clients install locally.

In a way its crowd sourcing is like a centralized mesh network each client is a scanner node which phones home threat data to the corporate home which updates that.

Notice the optimal word, centralized. The company owns that central home and its their proprietary black box to do what they want with. And so you know what for profit companies like to do to their services over time? Enshittify them by

adding subscription tier price models
putting once free features behind paywalls,
change data sharing requirements as a condition for free access

restricting free api access tighter and tighter to encourage paid tiers,

making paid tiers cost more to do less.
Intentionally ruining features in one service to drive power users to use a different.

They can and do use these tactics to drive up profit or reduce overhead once a critical mass has been reached. I do not expect alturism and respect for usersfrom corporations, I expect bean counters using alturism as a vehicle to attract users in the growing phase and then flip the switch in their tos to go full penny pinching once they’re too big to fail.

At the end of the day its not the thousands of anonymous users contributing their logs or Foss voulenteers on git getting a quarterly payout. They’re the product and free compute + live action pen testing ginnea pigs, no matter what PR they spin saying how much they care about the security of the plebs using their network for free.

Its always about maximizing the money with these people your security can get fucked if they dont get some use out of you. Expect at some point the tos will change so that anonymized data sharing is no longer an option for free tier.

What happens if the company goes bankrupt? Does it just stop working when their central servers shut down? Does their open source security have the possibility of being forked and run from local servers?

It doesnt have to be like this. Peer to peer Decentralized mesh networks like YaCy already show its possible for a crowdsourced network of users can all contribute to an open database. Something that can be completely run as a local Node which federates and updates the information in global node. Something like it that updates a global iptables is already a step in the right direction. In that theoretical system there is no central monopoly its like the fediverse everyone contributes to hosting the global network as a mesh which altruistic hobbyist can contribute free compute to on their own terms.

https://github.com/yacy/yacy_search_server

I"I dont see anything wrong with people getting paid" is something I see often on discussions. Theres nothing wrong with people who do work and make contributions getting paid. What’s wrong is it isnt the open source community on github or the users contributing their precious data getting paid, its a for profit centralized monopoly that controls access to the network which the open source community built for free out of alturism.

The pattern is nearly always the same. The thing that once worked well and which you relied on gets slowly worse each ToS update, while their pricing inches just a dollar higher each quarter, and you get less and less control over how you get to use their product. Its pattern recognition.

The only solution is to cut the head off the snake. If I can’t fully host all of the components, see the source code of the mechanisms at all layers, own a local copy of the global database, then its not really mine.

Again, it’s a philosophy thing. Its very easy to look at all that, shrug, and go “whatever not my problem I’ll just switch If it becomes an issue”. But the problem festers the longer its ignored or enabled for convinence. The community needs to truly own the services they run on every level, it has to be open, and for profit bean counters can’t be part of the equation especially for hosting. There are homelab hobbyist out there who will happily eat cents on a electric bill to serve an open service to a community, get 10,000 of them on a truly open source decentralized mesh network and you can accomplish great things without fear of being the product.

poVoq@slrpnk.net · 3 months ago

AI scraping is a massive issue for specific types of websites, such has git forges, wikis and to a lesser extend Lemmy etc, that rely on complex database operations that can not be easily cached. Unless you massively overprovision your infrastructure these web-applications come to a grinding halt by constantly maxing out the available CPU power.

The vast majority of the critical commenters here seem to talk from a point of total ignorance about this, or assume operators of such web applications have time for hyperviligance to constantly monitor and manually block AI scrapers (that do their best to circumvent more basic blocks). The realistic options for such operators are right now: Anubis (or similar), Cloudflare or shutting down their servers. Of these Anubis is clearly the least bad option.

chunes@lemmy.world · 3 months ago

Sounds like maybe webapps are a bad idea then.

If they need dynamism, how about releasing a desktop application?

quick_snail@feddit.nl · 3 months ago

It’s amazing how few people here are familiar with caching

ORbituary@lemmy.dbzer0.com · 3 months ago

It’s a great service. I hate the character.

SmokeyDope@piefed.social · 3 months ago

You know the thing is that they know the character is a problem/annoyance, thats how they grease the wheel on selling subscription access to a commecial version with different branding.

https://anubis.techaro.lol/docs/admin/botstopper/

pricing from site

Commercial support and an unbranded version

If you want to use Anubis but organizational policies prevent you from using the branding that the open source project ships, we offer a commercial version of Anubis named BotStopper. BotStopper builds off of the open source core of Anubis and offers organizations more control over the branding, including but not limited to:

Custom images for different states of the challenge process (in process, success, failure)
Custom CSS and fonts
Custom titles for the challenge and error pages
“Anubis” replaced with “BotStopper” across the UI
A private bug tracker for issues

In the near future this will expand to:

A private challenge implementation that does advanced fingerprinting to check if the client is a genuine browser or not
Advanced fingerprinting via Thoth-based advanced checks

In order to sign up for BotStopper, please do one of the following:

Sign up on GitHub Sponsors at the $50 per month tier or higher
Email sales@techaro.lol with your requirements for invoicing, please note that custom invoicing will cost more than using GitHub Sponsors for understandable overhead reasons

I have to respect the play tbh its clever. Absolutely the kind of greasy shit play that Julian from the trailer park boys would do if he were an open source developer.

webghost0101@sopuli.xyz · 3 months ago

I wish more projects did stuff like this.

It just feels silly and unprofessional while being seriously useful. Exactly my flavour of software, makes the web feel less corporate.

CoyoteFacts@piefed.ca · 3 months ago

You can customize the images if you want: https://anubis.techaro.lol/docs/admin/botstopper#customizing-images

Nate Cox@programming.dev · 3 months ago

I can’t access the page to validate this because I don’t allow JS; isn’t that gated behind a paywall?

CoyoteFacts@piefed.ca · 3 months ago

It looks like it might be; I just know someone that has a site using it and they use a different mascot, so I thought it would have been trivial. I kind of wonder why it wouldn’t be possible to just docker bind mount a couple images into the right path, but I’m guessing maybe they obfuscate/archive the file they’re reading from or something?

Axolotl@feddit.it · 3 months ago

It’s actually possible, also, it’s open source so nothing stop you from making your fork with your own images and build it

ORbituary@lemmy.dbzer0.com · 3 months ago

Not sure why you’re getting down votes for just asking a question.

Nate Cox@programming.dev · 3 months ago

Lots of idol worship in the dev community, question the current darling and people get upset.

Lemminary@lemmy.world · edit-2 3 months ago

Not idol worship, rather, it’s silly to complain about JS when tools like NoScript allow you to selectively choose what runs instead of guessing what it is. It’s simply a documentation page like it says on the URL. I mean, they’re incredibly tame on the danger scale to leave your guard all the way up and instead take a jab at the entire community that had nothing to do with your personal choices.

Nate Cox@programming.dev · 3 months ago

Who jabbed at anything?

I can’t get to that page, so I asked a question about the contents.

Someone here is being silly, we just disagree about who.

Lemminary@lemmy.world · edit-2 3 months ago

It gets quite silly when you blame the entire dev community for supposedly downvoting you over ideals rather than being overly strict about them. I also prefer HTML-first and think it should be the norm, but I draw the line somewhere reasonable.

I can’t get to that page, so I asked a question

Yeah, and you can run the innocuous JS or figure out what it is from the URL. You’re tying your own hands while dishing it out to everyone else.

M.int@lemmy.zip · 3 months ago

You can just fork it and replace the image.

The authors talks about it here on their blog a bit more.

Arghblarg@lemmy.ca · edit-2 3 months ago

I have a script that watches apache or caddy logs for poison link hits and a set of bot user agents, adding IPs to an ipset blacklist, blocking with iptables. I should polish it up for others to try. My list of unique IPs is well over 10k in just a few days.

git repos seem to be real bait for these damn AI scrapers.

quick_snail@feddit.nl · 3 months ago

You just described what wazuh does ootb

JustTesting@lemmy.hogru.ch · 3 months ago

This is the way. I also have rules for hits to url, without a referer, that should never be hit without a referer, with some threshold to account for a user hitting F5. Plus a whitelist of real users (ones that got a 200 on a login endpoint).

then there’s ratelimiting and banning ip’s that hit the ratelimit regularly.

Dowloading abuse ip lists nightly and banning those, that’s around 60k abusive ip’s gone. At that point you probably need to use nftables though, for the sets, as having 60k rules would be a bad idea.

there’s lists of all datacenter ip ranges out there, so you could block as well, though that’s a pretty nuclear option, so better make sure traffic you want is whitelisted. E.g. for lemmy, you can get a list of the ips of all other instances nightly, so you don’t accidentally block them. Lemmy traffic is very spammy…

there’s so much that can be done with f2b and a bit of scripting/writing filters

iopq@lemmy.world · 3 months ago

Can’t you just bookmark the page?

pedroapero@lemmy.ml · 3 months ago

Hi, there are pre-made ipset lists also, ex: https://github.com/ktsaou/blocklist-ipsets

TerHu@lemmy.dbzer0.com · 3 months ago

yes, please be mindful when using cloudflare. with them you’re possibly inviting in a much much bigger problem

https://www.devever.net/~hl/cloudflare

quick_snail@feddit.nl · edit-2 3 months ago

Great article, but I disagree about WAFs.

Try to secure a nonprofit’s web infrastructure with as 1 IT guy and no budget for devs or security.

It would be nice if we could update servers constantly and patch unmaintained code, but sometimes you just need to front it with something that plugs those holes until you have the capacity to do updates.

But 100% the WAF should be run locally, not a MiTM from evil US corp in bed with DHS.

termaxima@slrpnk.net · 3 months ago

I am very annoyed that I have to enable cloudflare’s JavaScript on so many websites, I would much prefer if more of them used Anubis so I didn’t have third-party JavaScript running as often.

( coming from an annoying user who tries to enable the fewest things possible in NoScript )

Nate Cox@programming.dev · 3 months ago

Counterpoint: Anubis is not awesome: https://lock.cmpxchg8b.com/anubis.html

Cyberflunk@lemmy.world · 3 months ago

thank you! this needed said.

This post is a bit critical of a small well-intentioned project, so I felt obliged to email the maintainer to discuss it before posting it online. I didn’t hear back.

i used to watch the dev on mastodon, they seemed pretty radicalized on killing AI, and anyone who uses it (kidding!!) i’m not even surprised you didn’t hear back

great take on the software, and as far as i can tell, playwright still works/completes the unit of work. at scale anubis still seems to work if you have popular content, but does hasnt stopped me using claude code + virtual browsers

im not actively testing it though. im probably very wrong about a few things, but i know anubis isn’t hindering my personal scraping, it does fuck up perplexity and chatgpt bots, which is fun to see.

good luck Blue team!

Kilgore Trout@feddit.it · 3 months ago

the dev […] seemed pretty radicalized on killing Ai

As one should, to lead a similar project.

SmokeyDope@piefed.social · 3 months ago

What use cases does perplexity do that Claude doesn’t for you?

Nate Cox@programming.dev · 3 months ago

For clarity: I didn’t write the article, it’s just a good reference.

mrbn@lemmy.ca · 3 months ago

When I visit sites on my cellphone, Anubis often doesn’t let me through.

cmnybo@discuss.tchncs.de · 3 months ago

I’ve never had any issues on my phone using Fennec or Firefox. I don’t have many addons installed apart from uBlock Origin. I wouldn’t be surprised if some privacy addons cause issues with Anubis though.

mrbn@lemmy.ca · 3 months ago

Yeah, my setup is almost like yours; I’m also on Firefox with unlock and the only difference is that I’m also using Privacy Badger

quick_snail@feddit.nl · 3 months ago

getting fail2ban to read caddy logs

You should look into wazuh

Victor@lemmy.world · 3 months ago

Seems like they already have a working solution now.

quick_snail@feddit.nl · edit-2 3 months ago

sure, but they have to maintain it.

Wazuh ships with rules that are maintained by wazuh. Less code rot.

Victor@lemmy.world · 3 months ago

That’s really good, could be worth looking into in that case. 👍 Thanks for following up!

panda_abyss@lemmy.ca · 3 months ago

I like the quirky SPH character

Fizz@lemmy.nz · 3 months ago

Its a fun little project and I like the little character but it doesnt actually do anything at this point.

quick_snail@feddit.nl · 3 months ago

Kinda sucks how it makes websites inaccessible to folks who have to disable JavaScript for security.

WhyJiffie@sh.itjust.works · 3 months ago

there’s a fork that has non-js checks. I don’t remember the name but maybe that’s what should be made more known

quick_snail@feddit.nl · 3 months ago

Please share if you know.

The only way I know how to do this is running a Tor Onion Service, since the tor protocol has built-in pow support (without js)

WhyJiffie@sh.itjust.works · edit-2 3 months ago

It’s this one: https://git.gammaspectra.live/git/go-away

the project name is a bit unfortunate to show for users, maybe change that if you will use it.

some known privacy services use it too, including the invidious at nadeko.net, so you can check there how it works. It’s one of the most popular inv servers so I guess it cannot be bad, and they use multiple kinds of checks for each visitor

WhyJiffie@sh.itjust.works · 3 months ago

ps: I was wrong it’s not a fork, but a different thing doing the same and more

poVoq@slrpnk.net · 3 months ago

I kinda sucks how AI scrapers make websites inaccessible to everyone 🙄

El Barto@lemmy.world · 3 months ago

You are both right.

quick_snail@feddit.nl · 3 months ago

Not if the admin has a cache. It’s not a difficult problem for most websites

poVoq@slrpnk.net · 3 months ago

You clearly don’t know what you are talking about.

quick_snail@feddit.nl · edit-2 3 months ago

Lol I’m the sysadmin for many sites that doesn’t have these issues, so obviously I do…

It you’re the one that thinks you need this trash pow fronting for a static site, then clearly you’re the one who is ignorant

poVoq@slrpnk.net · 3 months ago

Obviously I don’t think you need Anubis for a static site. And if that is what your admin experience is limited too, than you have a strong case of dunning krueger.

quick_snail@feddit.nl · edit-2 3 months ago

99% of the pages that Anubis is fronting are static.

It’s an abuse if the tool that’s harming the internet.

Mwa@thelemmy.club · 3 months ago

and they dont respect robots.txt