• 0 posts
  • 50 comments
Joined 2 years ago
Cake day: March 22nd, 2024
  • Mostly, yeah.

    Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded, but you know will be fine in regular use without too many programs open.

    In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap CPU RAM, and don’t OOM on the GPU either.

  • CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

    This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.

  • I completely disagree.

    Frankly, I find the description “VC funding a FOSS” offensive. They aren’t funding the engine. I’ve been messing with LLM inference engines since 2022, and Ollama is the worst I’ve seen in the community.

    They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn’t really work and they’ll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.

    And if that’s all fine, they’re enshittifying the app with closed code, and pointers to cloud models.

    They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the “default.” Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that’s not even getting into the interpersonal drama they’ve stirred.

    They are a leech that’s a net drag to the whole community, that we can’t get rid of because they’re attention grifters. And they’ve gotten worse and worse over time.


    It’s more morale to use any cloud API over Ollama, in my eyes. They’re a grift.


    EDIT: And, to be clear, I’m not against VC funded downstream stuff.

    LM Studio is good! Even though it’s closed source.

    Tons of downstream projects are great.

  • https://sleepingrobots.com/dreams/stop-using-ollama/

    And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.

    LM Studio is better, and easy.

    If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).

    It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them without bad results, there can’t be.

    But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.

  • Yep.

    I have a RTX 3090 + 128GB CPU RAM.

    Currently I run my own custom IQ3_KT quantization of MiMo 2.5 300B, and it’s crazy good. It’s better than API models from not that long ago, and it’s served at about reading speed.

    Never thought I’d ever run such a thing on my lowly desktop.

    For quick scripts or code assistant, sometimes I use Qwen 27B (another custom quant, currently experimenting with exllama). Or Gemini 12B for messing with image/audio input. But TBH MiMo 2.5 with thinking disabled is smarter than 27B with it.


    …And honestly, I use GLM 5.2 API a good bit.

    I was lucky enough to get a yearly subscription for like $30, 6 months ago. I do self host the UIs or whatever takes the prompts, though.

  • That’s more like it!

    And I completely disagree with the people saying it should be much cheaper.

    It’s a LTE Linux computer. In 2026. With multiple screens, a 48MP camera, good DAC, enough power to run real Android apps and tons of bells and whistles; what do you expect?

    Electronics are expensive, unless it’s cheap garbage, heavily subsidized, or both. That has a huge externalized cost, and avoiding that is the whole point of this phone. R&D, customer service, and continued software support for the translation layer and OS, must crazy expensive too.

    I know wages haven’t gone up with inflation, which makes $400 hard to afford, but that’s not in Commodore’s control.


    If one wants a cheaper AliExpress Android fliphone, that’s reasonable.

    But it’s not the same product. And you’re going to pay for it in other ways.