Ollama Server Component Recommendations

Jul (they/she)@piefed.blahaj.zone · 2 days ago

Ollama Server Component Recommendations

nitrolife@hikki.team · 2 hours ago

not a very popular opinion, but if you want an inexpensive, really inexpensive variant, take the AMD MX9070XT. AMD is not the most popular AI cards, but they are not bad with ROCm and for the price of 5090 you can put 5 cards (80 GB vram)

vegetaaaaaaa@lemmy.world · 1 day ago

I suggest using llama.cpp instead of ollama, you can easily squeeze +10% in inference speed and other memory optimizations from llama.cpp. With hardware prices nowadays I think every % saved on resources matters. Here is a simple ansible role to setup llama.cpp, it should give you a good idea of how to deploy it.

A dedicated inference rig is not gonna be cheap. What I did, since I need a gaming rig; is getting 32GB DDR5 (this was before the current RAMpocalypse, if I had known I would have bought 64) and an AMD 9070 (16GB VRAM - again if I had known how crazy prices would get I’d probably ahve bought a 24GB VRAM card). The home server runs the usual/non-AI stuff, and llamacpp runs on the gaming desktop (the home server just has a proxy to it). Yeah the gaming desktop has to be powered up when I want to run inference, this is my main desktop so it’s powered on most of the time, no big deal

adeoxymus@lemmy.world · 23 hours ago

Also a reason not to use ollama: https://sleepingrobots.com/dreams/stop-using-ollama/

chrash0@lemmy.world · 2 days ago

honestly it’s hard to beat Macs these days in this space for two reasons:

unified memory means that you don’t have to load up on RAM just to load the model and then also shell out for a video card with barely enough VRAM to fit a basic language model
their supply chain is solid and has mostly avoided the constraints that other OEMs and parts manufacturers are struggling with

pricing is tough. sure, crypto is on its way out, but GPUs are still the platform of choice for most neural net workloads (outside of SoCs like Apple M-series). i built a PC in late 2024, and it’s easily worth twice what i paid for it.

Scipitie@lemmy.dbzer0.com · 1 day ago

Depends what you want to do… For example I didn’t get python whisper in a container to run on Mac in any way that can be called “performance” and I don’t want my dev workflow to optimize for an OS I despise :D

chrash0@lemmy.world · 23 hours ago

in a container

well there’s your issue. i get not liking the OS, but actively crippling your project will cripple your project.

containers on macOS do kinda suck

Scipitie@lemmy.dbzer0.com · 22 hours ago

That’s sich a Mac answer it’s unbelievable.

Describing “A project aimed to be agnostic of it’s environment” as a design mistake and not a inherent flaw of the OS is… Just wow.

Remember in this thread it’s about the pro and con of Macos as interference hardware. This is a major flaw which comes baked into the hardware. I tested it and find it an unacceptable limitation. It’s important for others to know.

To state “containerization is the issue” though… Just wow.

Jade@programming.dev · 21 hours ago

Unfortunately containerisation on macos usually means running virtualized Linux, which of course is going to add overhead and cut off access to apple APIs and some hardware. So yep. There’s plenty that runs natively.

chrash0@lemmy.world · 20 hours ago

thanks for clarifying. it was hard for me to dignify such a comment with a response.

you’re also going to run into hardware acceleration issues trying to run Metal acceleration with a Linux kernel. i don’t really see a need to containerize these workloads these days anyway with tools like uv.

it’s a big pain in my ass at times trying to do web dev work with an aarch64-darwin dev env vs the target x86_64-linux. adding in hardware acceleration issues just sounds painful.

i also just personally don’t like containers. feels like bludgeon of a solution.

Jul (they/she)@piefed.blahaj.zone · 2 days ago

Yeah,but I dont want to get locked into a proprietary OS or have to put a lot of effort into hacking it to run Linux.

WASTECH@lemmy.world · 16 hours ago

I haven’t looked into Asahi Linux in a while now, but I figured the experience would be pretty good by now. You don’t need to “hack” anything to get it to run. Last I read, there were just a few driver issues, but I haven’t looked into it in probably 2-3 years now.

chrash0@lemmy.world · 2 days ago

super fair. i am a Linux guy normally. i’m just being honest. i wish there was a better more open alternative.

if you want to go with the Linux alternative it’s going to cost. get at least 32GB of RAM and at least a 4090 to run the kind of models you’re asking for. it’s the way she goes

ryokimball@infosec.pub · 2 days ago

The apple silicon is more energy efficient but the latest Intel and AMD CPUs deliver more processing power and can also share a significant amount of RAM to the GPU / AI components.

curbstickle@anarchist.nexus · 2 days ago

Going to second this, its all my m2 does right now. Putting together a solution for the office with some m4s.

Its a lot of bang for the buck specifically for llm use despite being horribly overpriced otherwise.

irmadlad@lemmy.world · 2 days ago

i built a PC in late 2024, and it’s easily worth twice what i paid for it.

spoiler

I wrote the vendor and asked him if the decimal was in the right place or was this the model that was beta testing alien technology. Got to be a misprint.

clifmo@programming.dev · 1 day ago

Until you tell us what your budget is, I’m not sure there’s much to discuss. You’re talking about motherboards. So I guess your choice right now comes down to Strix halo or not?

Jul (they/she)@piefed.blahaj.zone · 22 hours ago

Definitely not needing something that high-end. It’s just me and maybe one other person using it periodically for voice commands that needs to be realtime. The rest is background processed stuff like Immich image recognition and Jellyfin audio/video processing. Nothing fancy is needed. I mention motherboard because the system I’m thinking of using is currently running Plex which I’m in the process of replacing with Jellyfin on my Kubernetes cluster of minipcs and Raspberry pis that runs most of my stuff pretty well, but could benefit from dedicated LLM/ML. So, that machine will be freed up, but it’s nearly a decade old and not up to the task as it is.

As for specific budget, I don’t have specifics in mind. My Kubernetes cluster is super energy efficient since it’s all small systems that only spin up when needed. So thinking about overall cost of ownership vs benefit. Having something too high end would just waste energy as well as the initial investment.

Radieschen@slrpnk.net · 1 hour ago

You could consider something from the Radxa Rock 5 series.

mierdabird@lemmy.dbzer0.com · 2 days ago

It’s hard to say what exactly your requirements are in terms of VRAM/RAM from what you described here, but as a general recommendation whether AMD or Intel, I’d stick with DDR4 generation hardware. DDR5 is extremely expensive, but any non-MoE model that spills into system memory will still be frustratingly slow.

For GPU’s the best bang for your buck if you want Nvidia is probably the 3060 12GB, it has 360GB/s memory bandwidth and one or more of those is a very reasonable starting point for local AI.
If you’re okay with AMD there are some really unique cards floating around, I recently picked up a V620 off ebay for $350, it’s an ex-datacenter card with 32GB GDDR6 @ 512GB/s bandwidth. It’s a bit of a power hog but in my early testing it was running Qwen coder 3 30B at like 100 tokens/sec.

I run it on an ASUS X570 PRO board which is the cheapest AM4 board I could find with an optimal PCI-E setup: three x16 slots running 4.0x8, 4.0x8, 3.0x4. I have successfully tested it with the V620, a 9060XT, and a 3060 for 60 GB total VRAM, though the third x16 is only single slot so I had to borrow a pci extender cable to try it. I’ve found 48gb VRAM is plenty for me so I doubt I’ll actually run a third card unless I find a good deal on a single slot one.

Kinda turned into a ramble but let me know if you got questions