The AMD Ryzen AI MAX+ 395 — Strix Halo — is a genuinely interesting machine for running large language models locally. 128 GB of unified memory, real world reported ~225 GB/s memory bandwidth, an iGPU that can access the full pool. You can run a 35B MoE model fully on-GPU at 40–55 t/s. No cloud, no API bill, no data leaving the machine.

Getting there takes some work. The hardware is capable but the software stack has sharp edges. Here are the five things that cost the most time on a fresh setup, and how to fix them. For the full step-by-step guide, see the Strix Halo setup reference.


1. You are only getting 61 GB of GPU memory

Out of the box, ROCm caps GPU memory allocation to ~61 GB on APUs. The machine has 128 GB. Without the right kernel parameters, roughly half of it is inaccessible to your models.

Add these to GRUB_CMDLINE_LINUX in /etc/default/grub:

iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856 amdgpu.no_system_mem_limit=1 amdgpu.cwsr_enable=0

Then:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot

Verify afterwards:

echo "scale=1; $(cat /sys/class/drm/card*/device/mem_info_gtt_total | head -1) / 1024^3" | bc
# Should print: 124.0

The key parameter is amdgpu.no_system_mem_limit=1. Without it, the kernel silently caps allocations regardless of what you configure elsewhere.


2. Your GPU is running at 600 MHz

The GPU clocks down to ~600 MHz when idle and does not boost appropriately during inference unless you intervene. The result is roughly 50% of the throughput you should be getting, with no error message — the model just runs slow.

echo "high" | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level

This does not persist across reboots. Create a systemd service to set it on startup — the full guide has the unit file.


3. Two llama-server flags are non-negotiable

On Strix Halo, these two flags are mandatory for every llama-server invocation:

-fa 1 --no-mmap

-fa 1 enables flash attention. Without it, inference crashes. --no-mmap disables memory mapping, which causes silent data corruption or hangs on Strix Halo unified memory. Neither produces a helpful error when missing — you just get crashes or wrong outputs.

Always include both. Put them in whatever launch script or config you use so they cannot be accidentally omitted.


4. Toolbox from SSH silently fails

If you are running llama-server from an SSH session or a systemd service via a Podman toolbox container, you will hit this:

crun: sd-bus call: Access denied

The cause: Podman defaults to the systemd cgroup manager, which requires polkit interactive auth. On a headless server, that auth never happens.

Fix:

mkdir -p ~/.config/containers
cat >> ~/.config/containers/containers.conf << 'EOF'
[engine]
cgroup_manager = "cgroupfs"
EOF

Test it:

toolbox run echo "cgroup test OK"

5. ROCm containers do not release GPU memory between model swaps

If you are using llama-swap to hot-swap between models, be aware that ROCm/HIP retains the GPU memory pool within a container namespace after the llama-server process exits. Vulkan/RADV releases memory immediately. ROCm does not.

The consequence: if two different ROCm models share the same container, the second model will fail to allocate GPU memory after the first has been evicted. The fix is to create a separate container for each ROCm model, even if they use the same image:

toolbox create --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2.1-pr21344 llama-rocm-pr21344
toolbox create --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2.1-pr21344 llama-rocm-pr21344-b

Then assign each model config its own TOOLBOX_CONTAINER. Vulkan models can share a container without issue.


Once you have cleared these, the machine runs well. Pre-built containers for gfx1151 are at kyuz0/amd-strix-halo-toolboxes and cover the three backend variants you will need. The full setup guide covers everything from BIOS to model downloads.

One to watch: TurboQuant KV cache compression

TurboQuant (Zandieh et al., ICLR 2026) promises significant KV cache compression — the current implementation in TheTom/llama-cpp-turboquant achieves 5.12x vs FP16 with block_size=128, and a recent sparse V dequantization optimization gives an additional +22.8% decode speed at 32K context. On a 128 GB machine that translates to meaningfully more usable context before you hit memory limits.

HIP/ROCm support is confirmed in TheTom's fork, so it is usable today on Strix Halo. The flag syntax is --cache-type-k turbo3 --cache-type-v turbo3. Note this differs from the standard -ctk/-ctv flags.

It is not merged into llama.cpp mainline yet. The main discussion is at #20969 (162 comments, very active) and the feature request at #20977. No formal upstream PR has been submitted yet. That merge is the trigger to watch — when it lands, mainline llama.cpp gets it and the kyuz0 containers will pick it up on the next rebuild.