Frustratingly bad at self hosting. Can someone help me access LLMs on my rig from my phone

BlackSnack@lemmy.zip · 3 months ago

Frustratingly bad at self hosting. Can someone help me access LLMs on my rig from my phone

brucethemoose@lemmy.world · edit-2 3 months ago

At risk of getting more technical, ik_llama.cpp has a good built in webui:

https://github.com/ikawrakow/ik_llama.cpp/

Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.

For reference, I’m running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.

And if you want a ‘look this up on the internet for me’ assistant (which you need for them to be truly useful), you need another docker project as well.

…That’s just how LLM self hosting is now. It’s simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small ‘default’ LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.

tal@lemmy.today · edit-2 3 months ago

Ollama does have some features that make it easier to use for a first-time user, including:

Calculating automatically how many layers can fit in VRAM and loading that many layers and splitting between main memory/CPU and VRAM/GPU. kobold.cpp can’t do that automatically yet.
Automatically unloading the model from VRAM after a period of inactivity.

I had an easier time setting up ollama than other stuff, and OP does apparently already have it set up.

BlackSnack@lemmy.zip · 3 months ago

Bet. Looking into that now. Thanks!

I believe I have 11g of vram, so I should be good to run decent models from what I’ve been told by the other AIs.