Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 17 days ago

Honestly, most LLMs suck at the full 128K. Look up benchmarks like RULER.

In my personal tests over API, LLama 70B is bad out there. Qwen (and any fine tune based on Qwen Instruct, with maybe an exception or two) not only sucks, but is impractical past 32K once its internal rope scaling kicks in. Even GPT-4 is bad out there, with Gemini and some other very large models being the only usable ones I found.

So, ask yourself… Do you really need 128K? Because 32K-64K is a boatload of code with modern tokenizers, and that is perfectly doable on a single 24G GPU like a 3090 or 7900 XTX, and that’s where models actually perform well.

brucethemoose@lemmy.world · edit-2 19 days ago

Late to this post, but shoot for and AMD Strix Halo or Nvidia Digits mini PC.

Prompt processing is just too slow on Apple, and the Nvidia/AMD backends are so much faster with long context.

Otherwise, your only sane option for 128K context in a server with a bunch of big GPUs.

Also… what model are you trying to use? You can fit Qwen coder 32B with like 70K context on a single 3090, but honestly its not good above 32K tokens anyway.

brucethemoose@lemmy.world · edit-2 4 months ago

To go into more detail:

Exllama is faster than llama.cpp with all other things being equal.
exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)
With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.

It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.

Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.

I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.

brucethemoose@lemmy.world · edit-2 4 months ago

It’s less optimal.

On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.

brucethemoose@lemmy.world · 4 months ago

Your post is suggesting that the same models with the same parameters generate different result when run on different backends

Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.

There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.

Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.

brucethemoose@lemmy.world · 4 months ago

So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”

This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.

It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.

But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.

brucethemoose@lemmy.world · edit-2 4 months ago

Absolutely.

Only aphrodite (and other enterprise backends like vllm/sglang) can make use of NVLink, but even exllama or mlc-llm split across GPUs nicely over PCIe, no NVLink needed.

2x 3090s or P40s is indeed a popular config among local runners, and is the perfect size for a 70B model. Some try to squeeze Mistral-Large in, but IMO its too tight a fit.

brucethemoose@lemmy.world · edit-2 4 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama