I'm going to ballpark it between 2.5-3x faster than the desktop. Except for the tg128 test, where the difference is "minimal" (but I didn't do the math).
Actually, you can combine them. When compared to Mac Studio, the main advantage of these Strix Halo boxes is that you still add a bunch of egpu over usb4/oculink, for better PP.
Thanks for the excellent writeup. I'm pleasantly surprised that ROCm worked as well as it did — for the price these aren't bad for LLM workloads and some moderate gaming. (Apple is probably still the king of affordable at-home inference, but for games... Amazing these days but Linux is so much better.)
I switched to Fedora Sway as my daily driver nearly two years ago. A Windows title wasn’t working on my brand new PC. I switched to Steam+Proton+Fedora and it worked immediately. Valve now offers a more stable and complete Windows API through Proton than Microsoft does through Windows itself.
I've been testing Exo (seems dead), llama.cpp RPC (has a lot of performance limitations) and distributed-llama (faster but has some Vulkan quirks and only works with a few models).
for those who are already in the field and doing these things - if I wanted to start running my own local LLM.. should I find an Nvidia 5080 GPU for my current desktop or is it worth trying one of these Framework AMD desktops?
The short answer is that the best value is a used RTX 3090 (the long answer being, naturally, it depends). Most of the time, the bottleneck for running LLMs on consumer grade equipment is memory and memory bandwidth. A 3090 has 24GB of VRAM, while a 5080 only has 16GB of VRAM. For models that can fit inside 16GB of VRAM, the 5080 will certainly be faster than the 3090, but the 3090 can run models that simply won't fit on a 5080. You can offload part of the model onto the CPU and system RAM, but running a model on a desktop CPU is an enormous drag, even when only partially offloaded.
Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.
What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.
Used 3090s have been getting expensive in some markets. Another option is dual 5060ti 16 gig. Mine are lower powered, single 8 pin power, so they max out around 180W. With that I'm getting 80t/s on the new qwen 3 30b a3b models, and around 21t/s on Gemma 27b with vision. Cheap and cheerful setup if you can find the cards at MSRP.
The BIOS allows pre-allocating 96 GB max, and I'm not sure if that's the maximum for Windows, but under Linux, you can use `amdttm.pages_limit` and `amdttm.page_pool_size` [1]
I wonder how much MoE will disrupt this. qwen3:30b-a3b is pretty good even on pure CPU, but a lot smarter than a 3B parameter model. If the CPU-GPU bottleneck isn't too tight, a large model might be able to sustainably cache the currently active experts in GPU RAM.
The recent qwen3 models run fine on CPU + GPU, and so does gpt-oss. LM Studio and Ollama are turnkey solutions where the user has to know nothing about memory management. But finding benchmarks for these hybrid setups is astonishingly difficult.
I keep thinking that the bottleneck has to be CPU RAM, and for a large model the difference would be minor. For example with an 100 GByte model such as quantised gpt-oss-120B, I imagine that going from 10G to 24G would scale up my tk/s like 1/90 -> 1/76, so 20% advantage? But I can't find much on the high-level scaling math. People seem to either create calculators that oversimplify, or they seem too deep into the weeds.
The Framework Desktop has at least two M.2 connectors for NVME. I wonder if an interconnect with higher performance than Ethernet or Thunderbolt could be established using one of the M.2 to connect to PCIe via Oculink?
I had been hoping that these would be a bit faster than the 9950X because of the different memory architecture, but it appears that due to the lower power design point the AI Max+ 395 loses across the board, by large margins. So I guess these really are niche products for ML users only, and people with generic workloads that want more than the 9950X offers are shopping for a Threadripper.
Threadripper very rarely seems to make any sense. The only times it seems like you want it are for huge memory support/bandwidth and/or a huge number of pcie slots. But it's not cheap or supported enough compared to epyc to really make sense to me any time I've been specing out a system along those lines.
I bought a threadripper pro system out of desperation, trying to get secondhand PCIe 80G A100s to run locally. The huge rebar allocations confused/crashed every Intel/AMD system I had access to.
I think the Xeon systems should have worked and that it was actually a motherboard bios issue, but I had seen a photo of it running in a threadripper and prayed I wasn’t digging an even deeper hole.
Yeah I don't get it either. To get marginally more resources than the 9950X you have to make a significant leap in price to a $1500+ CPU on a $1000 motherboard.
It also seems like the tools aren't there to fully utilize them. Unless I misunderstood he was running off CPU only for all the test so there's still the iGPU and NPU performance that's not been utilized in these tests.
No, only a couple initial tests with Ollama used CPU. I ran most tests on Vulkan / iGPU, and some on ROCm (read further down the thread).
I found it difficult to install ROCm on Fedora 42 but after upgrading to Rawhide it was easy, so I re-tested everything with ROCm vs Vulkan.
Ollama, for some silly reason, doesn't support Vulkan even though I've used a fork many times to get full GPU acceleration with it on Pi, Ampere, and even this AMD system... (moral of the story just stick with llama.cpp).
No experimental flag option, no "you can use the fork that works fine but we don't have capacity to support this" just a hard "no, we think it's unreliable". I guess they just want you to drop them and use llama.cpp.
Yeah, my conspiracy theory is Nvidia is somehow influencing the decision. If you can do Vulkan with Ollama, it opens up people to using Intel/AMD/other iGPUs and you might not be incentivized to buy an Nvidia GPU.
ROCm support is not wonderful. It's certainly worse for an end user to deal with than Vulkan, which usually 'just works'.
I agree. AMD should just go all in on vulkan I think, The ROCm compatibility list is terrible compared to...every modern device and probably some ancient gpus that can be made to work with vulkan as well.
Considering they created mantle, you would think it would be the obvious move too.
Vulkan is Mantle. Vulkan was developed out of the original Mantle API that AMD brought to Khronos. What do you mean "AMD should just go all in on Vulkan"? They've been "all in" on Vulkan from the beginning because they were one of the lead authors of the API.
Hi Jeff, I'm a linux ambassador for Framework and I have one of these units. It'd be interesting if you would install ramalama in fedora and test that. I've been using that out of the box as a drop in replacement for ollama and everything was GPU accelerated out of the box. It pulls rocm from a container and just figures it out, etc. Would love to see actual numbers though.
I've ran a comparison benchmark for the smaller models https://gist.github.com/mhitza/f5a8eeb298feb239de10f9f60f841...
Comparing it against the RTX 4000 SFF Ada (20GB) which is around $1.2k (if you believe the original price on the nvidia website https://marketplace.nvidia.com/en-us/enterprise/laptops-work...). Which I have access to on a Hetzner GEX44.
I'm going to ballpark it between 2.5-3x faster than the desktop. Except for the tg128 test, where the difference is "minimal" (but I didn't do the math).
The whole point of these integrated memory designs is to go beyond that 20 GB VRAM.
Actually, you can combine them. When compared to Mac Studio, the main advantage of these Strix Halo boxes is that you still add a bunch of egpu over usb4/oculink, for better PP.
Thanks for the excellent writeup. I'm pleasantly surprised that ROCm worked as well as it did — for the price these aren't bad for LLM workloads and some moderate gaming. (Apple is probably still the king of affordable at-home inference, but for games... Amazing these days but Linux is so much better.)
I switched to Fedora Sway as my daily driver nearly two years ago. A Windows title wasn’t working on my brand new PC. I switched to Steam+Proton+Fedora and it worked immediately. Valve now offers a more stable and complete Windows API through Proton than Microsoft does through Windows itself.
Jeff - check out the distributed-llama project...you should be able to distribute over entire cluster
I've been testing Exo (seems dead), llama.cpp RPC (has a lot of performance limitations) and distributed-llama (faster but has some Vulkan quirks and only works with a few models).
See my AI cluster automation setup here: https://github.com/geerlingguy/beowulf-ai-cluster
I was building that through the course of making this video, because it's insane how much manual labor people put into building home AI clusters :D
https://github.com/b4rtaz/distributed-llama ?
He mentioned that in the video.
for those who are already in the field and doing these things - if I wanted to start running my own local LLM.. should I find an Nvidia 5080 GPU for my current desktop or is it worth trying one of these Framework AMD desktops?
The short answer is that the best value is a used RTX 3090 (the long answer being, naturally, it depends). Most of the time, the bottleneck for running LLMs on consumer grade equipment is memory and memory bandwidth. A 3090 has 24GB of VRAM, while a 5080 only has 16GB of VRAM. For models that can fit inside 16GB of VRAM, the 5080 will certainly be faster than the 3090, but the 3090 can run models that simply won't fit on a 5080. You can offload part of the model onto the CPU and system RAM, but running a model on a desktop CPU is an enormous drag, even when only partially offloaded.
Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.
What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.
Used 3090s have been getting expensive in some markets. Another option is dual 5060ti 16 gig. Mine are lower powered, single 8 pin power, so they max out around 180W. With that I'm getting 80t/s on the new qwen 3 30b a3b models, and around 21t/s on Gemma 27b with vision. Cheap and cheerful setup if you can find the cards at MSRP.
Just a point of clarification. I believe the 128GB Strix Halo can only allocate up to 96GB of RAM to the GPU.
108 GB or so under Linux.
The BIOS allows pre-allocating 96 GB max, and I'm not sure if that's the maximum for Windows, but under Linux, you can use `amdttm.pages_limit` and `amdttm.page_pool_size` [1]
[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
If you think the future is small models (27B) get Nvidia; if you think larger models (70-120B) are worth it then you need AMD or Apple.
I wonder how much MoE will disrupt this. qwen3:30b-a3b is pretty good even on pure CPU, but a lot smarter than a 3B parameter model. If the CPU-GPU bottleneck isn't too tight, a large model might be able to sustainably cache the currently active experts in GPU RAM.
The recent qwen3 models run fine on CPU + GPU, and so does gpt-oss. LM Studio and Ollama are turnkey solutions where the user has to know nothing about memory management. But finding benchmarks for these hybrid setups is astonishingly difficult.
I keep thinking that the bottleneck has to be CPU RAM, and for a large model the difference would be minor. For example with an 100 GByte model such as quantised gpt-oss-120B, I imagine that going from 10G to 24G would scale up my tk/s like 1/90 -> 1/76, so 20% advantage? But I can't find much on the high-level scaling math. People seem to either create calculators that oversimplify, or they seem too deep into the weeds.
I'd like a new anandtech please.
The Framework Desktop has at least two M.2 connectors for NVME. I wonder if an interconnect with higher performance than Ethernet or Thunderbolt could be established using one of the M.2 to connect to PCIe via Oculink?
There is also a PCIe x4 slot that you can use for other high throughput network options.
I missed that. Too bad it is under the power cables. I’d be hard to fit something in there using the stock case.
I had been hoping that these would be a bit faster than the 9950X because of the different memory architecture, but it appears that due to the lower power design point the AI Max+ 395 loses across the board, by large margins. So I guess these really are niche products for ML users only, and people with generic workloads that want more than the 9950X offers are shopping for a Threadripper.
Sounds about right.
I’m struggling to justify the cost of a Threadripper (let alone pro!) for a AAA game studio though.
I wonder who can justify these machines. High frequency trading? data science? shouldn’t that be done on servers?
Threadripper very rarely seems to make any sense. The only times it seems like you want it are for huge memory support/bandwidth and/or a huge number of pcie slots. But it's not cheap or supported enough compared to epyc to really make sense to me any time I've been specing out a system along those lines.
I bought a threadripper pro system out of desperation, trying to get secondhand PCIe 80G A100s to run locally. The huge rebar allocations confused/crashed every Intel/AMD system I had access to.
I think the Xeon systems should have worked and that it was actually a motherboard bios issue, but I had seen a photo of it running in a threadripper and prayed I wasn’t digging an even deeper hole.
Yeah, that makes sense if you just have ~proof that some configuration works and want to just be done with it.
This is why a business like Puget Systems, or a line like HP Z Workstations, persist. You know in advance that your rig will work.
Yeah I don't get it either. To get marginally more resources than the 9950X you have to make a significant leap in price to a $1500+ CPU on a $1000 motherboard.
It also seems like the tools aren't there to fully utilize them. Unless I misunderstood he was running off CPU only for all the test so there's still the iGPU and NPU performance that's not been utilized in these tests.
No, only a couple initial tests with Ollama used CPU. I ran most tests on Vulkan / iGPU, and some on ROCm (read further down the thread).
I found it difficult to install ROCm on Fedora 42 but after upgrading to Rawhide it was easy, so I re-tested everything with ROCm vs Vulkan.
Ollama, for some silly reason, doesn't support Vulkan even though I've used a fork many times to get full GPU acceleration with it on Pi, Ampere, and even this AMD system... (moral of the story just stick with llama.cpp).
Sadly, the reason they give is subjectively terrible:
https://x.com/ollama/status/1952783981000446029
No experimental flag option, no "you can use the fork that works fine but we don't have capacity to support this" just a hard "no, we think it's unreliable". I guess they just want you to drop them and use llama.cpp.
Yeah, my conspiracy theory is Nvidia is somehow influencing the decision. If you can do Vulkan with Ollama, it opens up people to using Intel/AMD/other iGPUs and you might not be incentivized to buy an Nvidia GPU.
ROCm support is not wonderful. It's certainly worse for an end user to deal with than Vulkan, which usually 'just works'.
I agree. AMD should just go all in on vulkan I think, The ROCm compatibility list is terrible compared to...every modern device and probably some ancient gpus that can be made to work with vulkan as well.
Considering they created mantle, you would think it would be the obvious move too.
Vulkan is Mantle. Vulkan was developed out of the original Mantle API that AMD brought to Khronos. What do you mean "AMD should just go all in on Vulkan"? They've been "all in" on Vulkan from the beginning because they were one of the lead authors of the API.
Hi Jeff, I'm a linux ambassador for Framework and I have one of these units. It'd be interesting if you would install ramalama in fedora and test that. I've been using that out of the box as a drop in replacement for ollama and everything was GPU accelerated out of the box. It pulls rocm from a container and just figures it out, etc. Would love to see actual numbers though.
Great work on this!
Kinda bummed, I get why he used Ollama but I feel like using llama cpp directly would provide better and more consistent results
As the article describes, most of this was done with llama.cpp, not Ollama.
Ahh good catch, I didn’t notice if you scroll lower, he has the llama cpp results. The ollama-benchmark repo name is a misnomer.
I'm slowly migrating all my testing to https://github.com/geerlingguy/beowulf-ai-cluster
So, TL;DR?
I saw mixed results but comments suggest very good performance relative to other at-home setups. Can someone summarize?
I put most of the top-line numbers and some graphs on my blog: https://www.jeffgeerling.com/blog/2025/i-clustered-four-fram...
Great! As always fantastic writeup
I was about to be annoyed until you said you got preprod units. I guess I'll have to build on this when my desktop shows up.