@brucethemoose

brucethemoose@lemmy.world · 16 minutes

Mostly, yeah.

Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded, but you know will be fine in regular use without too many programs open.

In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap CPU RAM, and don’t OOM on the GPU either.

brucethemoose@lemmy.world · 3 hours

Or exllama! Vllm, sglang, Lorax. Koboldcpp, Aphrodite, text-generation-webui, LM Studio, powerinfer, ktransformers, mlc-LLM, really whatever floats your boat. Just not ollama, specifically.

brucethemoose@lemmy.world · 4 hours

Mind you, I’m running Mimo, not the big Mimo Pro.

But yeah. I really like the model, even for one of its size. And it hardly feels quantized as a trellis quant.

brucethemoose@lemmy.world · 12 hours

CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.

brucethemoose@lemmy.world · 12 hours

I completely disagree.

Frankly, I find the description “VC funding a FOSS” offensive. They aren’t funding the engine. I’ve been messing with LLM inference engines since 2022, and Ollama is the worst I’ve seen in the community.

They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn’t really work and they’ll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.

And if that’s all fine, they’re enshittifying the app with closed code, and pointers to cloud models.

They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the “default.” Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that’s not even getting into the interpersonal drama they’ve stirred.

They are a leech that’s a net drag to the whole community, that we can’t get rid of because they’re attention grifters. And they’ve gotten worse and worse over time.

It’s more morale to use any cloud API over Ollama, in my eyes. They’re a grift.

EDIT: And, to be clear, I’m not against VC funded downstream stuff.

LM Studio is good! Even though it’s closed source.

Tons of downstream projects are great.

brucethemoose@lemmy.world · 14 hours

Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I’m running a 300B model on a single 3090, and its faster than I can read.

You just need to use the right framework, and the right model.

I’d suggest trying ik_llama.cpp and a MoE like one of these: https://huggingface.co/models?other=ik_llama.cpp&sort=modified&search=35B

And speculative decoding like DFlash or MTP (which you can also get specific models for).

EDIT: Wrong link.

brucethemoose@lemmy.world · 15 hours

Oh, and I just saw you have a 3090.

To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that’s practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main

Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF

If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should “beat” the cheapest Claude, give or take.

If you have 64GB, I’d suggest a quantization of Step 3.7.

If you have 32GB or 48, I’m not sure. I’d need to look if any “small” MoE is actually better than Qwen 27B now.

brucethemoose@lemmy.world · 15 hours

https://sleepingrobots.com/dreams/stop-using-ollama/

And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.

LM Studio is better, and easy.

If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).

It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them without bad results, there can’t be.

But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.

brucethemoose@lemmy.world · 15 hours

An aside for anyone reading this:

https://sleepingrobots.com/dreams/stop-using-ollama/

And that barely scratches the surface. Please.

Use anything but Ollama. Even APIs.

brucethemoose@lemmy.world · 15 hours

How much CPU RAM do you have?

brucethemoose@lemmy.world · 15 hours

Did you serve them with ollama?

It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.

brucethemoose@lemmy.world · 15 hours

Yep.

I have a RTX 3090 + 128GB CPU RAM.

Currently I run my own custom IQ3_KT quantization of MiMo 2.5 300B, and it’s crazy good. It’s better than API models from not that long ago, and it’s served at about reading speed.

Never thought I’d ever run such a thing on my lowly desktop.

For quick scripts or code assistant, sometimes I use Qwen 27B (another custom quant, currently experimenting with exllama). Or Gemini 12B for messing with image/audio input. But TBH MiMo 2.5 with thinking disabled is smarter than 27B with it.

…And honestly, I use GLM 5.2 API a good bit.

I was lucky enough to get a yearly subscription for like $30, 6 months ago. I do self host the UIs or whatever takes the prompts, though.

brucethemoose@lemmy.world · 19 hours

Yep.

That’s because DDG is Bing. To be blunt, it’s search is kinda terrible.

brucethemoose@lemmy.world · 21 hours

Which is the answer to much of “why modern cars are the way they are.”

There’s a bit of a survivor bias for old stuff.

brucethemoose@lemmy.world · 23 hours

It’s not Android. It’s SailfishOS. With first party support.

And even that aside, I don’t see anything comparable on Aliexpress, hardware wise.

brucethemoose@lemmy.world · 24 hours

1st-party supported SailfishOS, to be specific.

That’s huge, to me.

brucethemoose@lemmy.world · 24 hours

This is fine.

But the optics are important. There’s a concerted effort to delegitimize Wikipedia as an information source, as it’s not in Big Tech’s control.

And they don’t have to kill it. They just have to make it less popular than, say, Grokipedia, and every headline like this is a step in that direction.

Hence I have very scientifically minded family who are already saying some strange things about Wikipedia.

brucethemoose@lemmy.world · 1 day

That’s more like it!

And I completely disagree with the people saying it should be much cheaper.

It’s a LTE Linux computer. In 2026. With multiple screens, a 48MP camera, good DAC, enough power to run real Android apps and tons of bells and whistles; what do you expect?

Electronics are expensive, unless it’s cheap garbage, heavily subsidized, or both. That has a huge externalized cost, and avoiding that is the whole point of this phone. R&D, customer service, and continued software support for the translation layer and OS, must crazy expensive too.

I know wages haven’t gone up with inflation, which makes $400 hard to afford, but that’s not in Commodore’s control.

If one wants a cheaper AliExpress Android fliphone, that’s reasonable.

But it’s not the same product. And you’re going to pay for it in other ways.

brucethemoose@lemmy.world · 1 day

The messenger matters.

Would you care about anything I was saying if I was a bot?

Or a Musk/Theil bootlicker?

Especially on this topic. Nodding heads about the loss of the internet to engagement chum on Twitter is the opposite of poetic.

brucethemoose@lemmy.world · 2 days

Well, he tweets many times a day, many posts like this:

…Seems like a “Tech Bro” type to me. He’s just engagement farming; I don’t care what he says, there nothing valid about that.

In fact, I’d wager some of those posts are automated.