Strix Halo LLM Inference Server Setup

Step-by-step guide to turning an AMD Ryzen AI MAX+ 395 (Strix Halo) machine into a local LLM inference server. Self-contained — everything needed is inline.

Hardware

Component	Detail
CPU	AMD Ryzen AI MAX+ 395
GPU	Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs) — integrated
RAM	128 GB unified LPDDR5X-8000
GPU memory	Up to 124 GB (unified, shared with system via GTT)
Memory bandwidth	~256 GB/s theoretical; ~220–230 GB/s observed during LLM inference

The unified memory architecture is the key advantage: the iGPU can access the full 128 GB pool, allowing models far too large for any consumer dGPU to run fully on-GPU. LLM token generation is memory-bandwidth-bound; at ~225 GB/s the system comfortably runs large MoE models at 40–55 t/s.

1. BIOS

Disable Secure Boot — required for the custom kernel parameters and ROCm GPU access
Set GPU memory allocation to maximum — in the BIOS memory/GPU section, set UMA frame buffer to maximum (or "Auto" if no explicit option; the kernel parameters below override it anyway)

2. OS Installation

Fedora Server 43 is recommended — headless, minimal, good kernel support for gfx1151. A desktop install works but adds unnecessary overhead.

After install, update fully and install essential tools:

sudo dnf upgrade -y && sudo reboot
sudo dnf install -y git curl wget aria2 bc podman toolbox

3. Kernel Boot Parameters

This is the most critical step. Without these parameters, ROCm will cap GPU memory at ~61 GB and inference may crash or perform poorly.

Edit /etc/default/grub and add the following to the GRUB_CMDLINE_LINUX line:

iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856 amdgpu.no_system_mem_limit=1 amdgpu.cwsr_enable=0

Parameter	Purpose
`iommu=pt`	IOMMU pass-through — required for iGPU performance
`amdgpu.gttsize=126976`	Allow GPU to use 124 GB of unified memory (value in MB)
`ttm.pages_limit=32505856`	Allow 124 GB of pinned memory pages (value in 4K pages)
`amdgpu.no_system_mem_limit=1`	Prevents ROCm from capping allocation to ~61 GB on APUs
`amdgpu.cwsr_enable=0`	Workaround for a CWSR regression in kernels 6.18/6.19 that degrades compute throughput

Apply and reboot:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot

Verify the GPU has access to the full memory pool:

echo "scale=1; $(cat /sys/class/drm/card*/device/mem_info_gtt_total | head -1) / 1024^3" | bc
# Should print: 124.0

4. GPU Power Governor

Without this, the GPU clocks down to ~600 MHz when idle and may not boost appropriately during inference — causing up to 50% throughput loss.

echo "high" | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level

Persist across reboots with a systemd service:

sudo tee /etc/systemd/system/gpu-power-high.service > /dev/null << 'EOF'
[Unit]
Description=Set AMD GPU power performance level to high
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now gpu-power-high.service

Verify:

cat /sys/class/drm/card0/device/power_dpm_force_performance_level
# high

5. Podman and Toolbox Setup

LLM inference runs inside rootless Podman toolbox containers. One configuration fix is required before containers will work from SSH sessions or systemd services.

Cgroup Manager Fix

By default Podman uses the systemd cgroup manager, which requires polkit interactive auth to create cgroup scopes. On a headless server this causes toolbox run to fail with crun: sd-bus call: Access denied.

Fix by switching to cgroupfs:

mkdir -p ~/.config/containers
cat >> ~/.config/containers/containers.conf << 'EOF'
[engine]
cgroup_manager = "cgroupfs"
EOF

Verify:

toolbox run echo "cgroup test OK"

GPU Access Flag

Rootless Podman containers must be started with --userns=host to access the GPU. Without it, the GPU is detected but kernel dispatch crashes with a segfault. The toolbox run wrapper passes this flag automatically — if you ever use podman run directly for GPU workloads, always include --userns=host.

6. llama.cpp Toolbox Containers

Pre-built containers for Strix Halo (gfx1151) are maintained at kyuz0/amd-strix-halo-toolboxes, compiled specifically for gfx1151 and rebuilt automatically when llama.cpp master advances.

Available Images

Short name	Container name	Backend	Use for
`radv`	`llama-vulkan-radv`	Vulkan RADV	Default; most models
`rocm`	`llama-rocm-7.2`	ROCm 7.2	Nemotron (Mamba-2 SSM ops)
`rocm-pr21344`	`llama-rocm-pr21344`	ROCm + gfx1151 MMQ	Qwen3.6 MoE models

The rocm-pr21344 image includes custom gfx1151 MMQ tuning that gives ~40% faster prefill on Qwen MoE architectures — important for agentic workloads where prompt processing dominates.

ROCm container isolation: ROCm/HIP retains the GPU memory pool within a container namespace after a process exits, unlike Vulkan/RADV which releases memory immediately. If you run multiple ROCm models via llama-swap, each model needs its own dedicated container. Create them from the same image: llama-rocm-pr21344, llama-rocm-pr21344-b, etc.

Create Containers

toolbox create --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv llama-vulkan-radv
toolbox create --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2 llama-rocm-7.2
toolbox create --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2.1-pr21344 llama-rocm-pr21344

Verify GPU Detection

toolbox run -c llama-vulkan-radv llama-cli --list-devices
# ggml_vulkan: Found 1 Vulkan devices:
# Vulkan0: Radeon 8060S Graphics (RADV GFX1151) (129024 MiB, 128858 MiB free)

Keeping Containers Updated

#!/bin/bash
# update-toolboxes — update one or all toolbox containers
# Usage: update-toolboxes [radv|rocm|rocm-pr21344|all]

declare -A IMAGES NAMES
IMAGES[radv]="docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv"
NAMES[radv]="llama-vulkan-radv"
IMAGES[rocm]="docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2"
NAMES[rocm]="llama-rocm-7.2"
IMAGES[rocm-pr21344]="docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2.1-pr21344"
NAMES[rocm-pr21344]="llama-rocm-pr21344"

for key in "${@:-radv rocm rocm-pr21344}"; do
    image="${IMAGES[$key]}"
    name="${NAMES[$key]}"
    echo "==> Updating $name"
    podman pull "$image"
    toolbox rm -f "$name" 2>/dev/null || podman rm -f "$name"
    toolbox create --image "$image" "$name"
    echo "    Done: $(toolbox run -c "$name" llama-server --version 2>&1 | grep '^version:')"
done

7. ROCm Environment Variables

The kyuz0 containers set HSA_OVERRIDE_GFX_VERSION=11.5.1 and HSA_ENABLE_SDMA=0 automatically. For any ROCm containers you launch manually, set these:

Variable	Value	Purpose
`HSA_OVERRIDE_GFX_VERSION`	`11.5.1`	Identifies GPU as gfx1151 to ROCm
`HSA_ENABLE_SDMA`	`0`	Disables SDMA — prevents APU memory crashes
`GPU_MAX_HEAP_SIZE`	`100`	Allows ROCm to use 100% of reported GPU heap
`GPU_MAX_ALLOC_PERCENT`	`100`	Allows single allocations up to 100% of GPU memory
`GPU_SINGLE_ALLOC_PERCENT`	`100`	Allows single buffers to use all available GPU memory
`GPU_FORCE_64BIT_PTR`	`1`	Forces 64-bit GPU pointers — required for >4 GB allocations
`ROCBLAS_USE_HIPBLASLT`	`1`	Uses hipBLASLt for matrix multiplication

8. Critical llama-server Flags

Two flags are mandatory on Strix Halo — do not omit them:

Flag	Why
`-fa 1`	Flash attention — without it, inference crashes on this hardware
`--no-mmap`	Memory mapping causes silent corruption or hangs on Strix Halo unified memory

Recommended flags for production use:

Flag	Purpose
`-ngl 999`	Offload all layers to GPU
`-ctk q8_0 -ctv q8_0`	Quantize KV cache to Q8_0 — halves cache memory with minimal quality loss
`--kv-unified`	Single shared KV buffer across parallel slots
`--jinja`	Enable Jinja2 chat templates (required for tool calling)
`--metrics`	Expose Prometheus metrics at `/metrics`
`--parallel N`	Number of concurrent request slots (3 is a good default)

Backend Selection

Vulkan RADV: best for most models. ~20% faster token generation than ROCm for standard transformer/MoE models. Releases GPU memory immediately on process exit.
ROCm (standard): use for Mamba-2 SSM models (Nemotron family). ROCm runs SSM scan and conv ops on-GPU; Vulkan falls back to CPU.
ROCm pr21344: use for Qwen3.6 MoE models. Custom gfx1151 MMQ tuning gives ~40% faster prefill for long agentic contexts.

Rule of thumb: if the model is purely transformer/MoE (no SSM layers), Vulkan RADV is faster for generation-dominated workloads. Use ROCm when there are SSM ops or when prefill speed matters more than generation speed.

9. llama-swap: Multi-Model Router

llama-swap is a lightweight Go proxy that listens on a single port and hot-swaps llama-server instances on demand. Clients address models by name in the model field; llama-swap loads the right model and evicts the previous one.

Install

curl -L https://github.com/mostlygeek/llama-swap/releases/latest/download/llama-swap-linux-amd64 \
  -o ~/.local/bin/llama-swap
chmod +x ~/.local/bin/llama-swap

Launch Script

Save as ~/.local/bin/llama-server-launch:

#!/bin/bash
# Starts llama-server inside the appropriate toolbox container.
# Reads model config from a .conf file passed via --conf.

CONF=""
PORT=8080

while [[ "$#" -gt 0 ]]; do
    case "$1" in
        --conf) CONF="$2"; shift 2 ;;
        --port) PORT="$2"; shift 2 ;;
        *) echo "Unknown argument: $1" >&2; exit 1 ;;
    esac
done

_source_conf() {
    local line key value
    while IFS= read -r line || [[ -n "$line" ]]; do
        [[ "$line" =~ ^[[:space:]]*# ]] && continue
        [[ "$line" =~ ^[[:space:]]*$ ]] && continue
        [[ "$line" =~ ^MODEL_NAME= ]] && continue
        key="${line%%=*}"
        value="${line#*=}"
        export "$key"="$value"
    done < "$1"
}

_source_conf "$CONF"

CONTAINER="${TOOLBOX_CONTAINER:-llama-vulkan-radv}"
ENV_PREFIX=""
[ -n "$ROCBLAS_USE_HIPBLASLT" ] && ENV_PREFIX="env ROCBLAS_USE_HIPBLASLT=$ROCBLAS_USE_HIPBLASLT"

eval "extra_flags=($MODEL_EXTRA_FLAGS)"

exec /usr/bin/toolbox run -c "$CONTAINER" \
    $ENV_PREFIX \
    llama-server \
    -m "$MODEL_PATH" \
    -ngl 999 -fa 1 --no-mmap \
    --parallel "$MODEL_PARALLEL" -c "$MODEL_CTX" \
    -ctk q8_0 -ctv q8_0 --kv-unified \
    --host 0.0.0.0 --port "$PORT" \
    --jinja --metrics \
    "${extra_flags[@]}"

chmod +x ~/.local/bin/llama-server-launch

Model Config Files

Each model gets a .conf file at ~/.config/llama-server/models/<name>.conf.

Vulkan RADV example:

mkdir -p ~/.config/llama-server/models

cat > ~/.config/llama-server/models/gemma-4-26b-a4b.conf << 'EOF'
MODEL_NAME=Gemma-4-26B-A4B (UD-Q8_K_XL)
MODEL_PATH=/home/$USER/models/gemma-4-26b-a4b/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf
MODEL_PARALLEL=3
MODEL_CTX=262144
MODEL_EXTRA_FLAGS=--reasoning-budget 512 --temp 1.0 --top-p 0.95 --top-k 64 --chat-template-file /home/$USER/models/gemma-4-26b-a4b/chat_template.jinja
EOF

ROCm example:

cat > ~/.config/llama-server/models/qwen3.6-35b.conf << 'EOF'
MODEL_NAME=Qwen3.6-35B-A3B (Q8_0)
MODEL_PATH=/home/$USER/models/qwen3.6-35b-a3b/Q8_0/Qwen3.6-35B-A3B-Q8_0.gguf
MODEL_PARALLEL=3
MODEL_CTX=262144
TOOLBOX_CONTAINER=llama-rocm-pr21344
MODEL_EXTRA_FLAGS=--reasoning-format deepseek --reasoning-budget 2048 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --chat-template-kwargs '{"preserve_thinking":true}'
EOF

llama-swap Config

Create ~/.config/llama-swap/config.yaml:

healthCheckTimeout: 300
logLevel: info
logToStdout: proxy
sendLoadingState: true
startPort: 8100

macros:
  launch: /home/$USER/.local/bin/llama-server-launch
  confs:  /home/$USER/.config/llama-server/models

models:
  qwen3.6-35b:
    cmd: ${launch} --conf ${confs}/qwen3.6-35b.conf --port ${PORT}

  gemma-4-26b-a4b:
    cmd: ${launch} --conf ${confs}/gemma-4-26b-a4b.conf --port ${PORT}

Memory budget: on 128 GB Strix Halo, ~115 GB is the safe ceiling for model weights plus KV cache combined. A 35B MoE model with Q8_0 KV cache, 262K context, and 3 parallel slots uses ~43 GB total.

llama-swap systemd Service

mkdir -p ~/.config/systemd/user

cat > ~/.config/systemd/user/llama-swap.service << 'EOF'
[Unit]
Description=llama-swap model router for local LLM inference
After=network.target

[Service]
Type=simple
ExecStart=%h/.local/bin/llama-swap --config %h/.config/llama-swap/config.yaml
Restart=on-failure
RestartSec=10

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable --now llama-swap

10. Firewall

sudo firewall-cmd --permanent --add-port=8080/tcp   # llama-swap API
sudo firewall-cmd --permanent --add-port=3000/tcp   # Open WebUI
sudo firewall-cmd --reload

11. Downloading Models

uv tool install huggingface-hub --with hf_transfer

Save as ~/bin/hf-aria2 for fast parallel downloads with resume support:

mkdir -p ~/bin
cat > ~/bin/hf-aria2 << 'SCRIPT'
#!/bin/bash
# Download GGUF files from HuggingFace using aria2c (16 parallel connections)
# Usage: hf-aria2 <repo> <pattern> <local-dir>

set -euo pipefail
REPO="${1:-}"; PATTERN="${2:-}"; LOCAL_DIR="${3:-}"
HF_PYTHON="$HOME/.local/share/uv/tools/huggingface-hub/bin/python"

[[ -z "$REPO" || -z "$PATTERN" || -z "$LOCAL_DIR" ]] && {
    echo "Usage: hf-aria2 <repo> <pattern> <local-dir>"; exit 1; }

FILES=$("$HF_PYTHON" -c "
from huggingface_hub import list_repo_files
for f in list_repo_files('$REPO'):
    if '$PATTERN' in f and f.endswith('.gguf'):
        print(f)
" 2>/dev/null)

[[ -z "$FILES" ]] && { echo "No files found matching '$PATTERN' in $REPO"; exit 1; }

BASE_URL="https://huggingface.co/${REPO}/resolve/main"
URL_FILE=$(mktemp)
while IFS= read -r f; do
    echo "${BASE_URL}/${f}" >> "$URL_FILE"
    echo "  out=$(basename "$f")" >> "$URL_FILE"
done <<< "$FILES"

mkdir -p "$LOCAL_DIR"
aria2c -x 16 -s 16 --continue=true --auto-file-renaming=false \
    -d "$LOCAL_DIR" -i "$URL_FILE"
rm -f "$URL_FILE"
echo "Done:"; ls -lh "$LOCAL_DIR"/*.gguf 2>/dev/null
SCRIPT
chmod +x ~/bin/hf-aria2
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc

Usage:

hf-aria2 unsloth/gemma-4-26B-A4B-it-GGUF UD-Q8_K_XL ~/models/gemma-4-26b-a4b/
hf-aria2 Qwen/Qwen3.6-35B-A3B-GGUF Q8_0 ~/models/qwen3.6-35b-a3b/Q8_0/

Downloads resume automatically if interrupted.

12. Open WebUI (Optional)

A chat frontend via rootless Podman with host networking:

mkdir -p ~/.config/containers/systemd

cat > ~/.config/containers/systemd/openwebui.container << 'EOF'
[Unit]
Description=Open WebUI - LLM Chat Frontend

[Container]
Image=ghcr.io/open-webui/open-webui:main
Network=host
Volume=%h/.local/share/open-webui:/app/backend/data
Environment=OLLAMA_BASE_URL=
Environment=OPENAI_API_BASE_URL=http://localhost:8080/v1
Environment=OPENAI_API_KEY=none

[Service]
Restart=always

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable --now openwebui

Access at http://<hostname>:3000. For HTTPS via Tailscale (enables microphone/audio):

sudo tailscale serve --bg 3000

13. Verification Checklist

# GPU memory pool
echo "scale=1; $(cat /sys/class/drm/card*/device/mem_info_gtt_total | head -1) / 1024^3" | bc
# → 124.0

# GPU power governor
cat /sys/class/drm/card0/device/power_dpm_force_performance_level
# → high

# Vulkan GPU detection inside container
toolbox run -c llama-vulkan-radv llama-cli --list-devices 2>&1 | grep -E "Found|Vulkan0"

# llama-swap running
systemctl --user status llama-swap
curl -s http://localhost:8080/v1/models | python3 -m json.tool

# Model responds
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "YOUR_MODEL", "messages": [{"role":"user","content":"Hi"}], "max_tokens": 10}'

Common Issues

crun: sd-bus call: Access denied when running toolbox from SSH or a service Set cgroup_manager = "cgroupfs" in ~/.config/containers/containers.conf (step 5).

GPU detected but inference crashes immediately Ensure you are going through toolbox run. If using podman run directly, add --userns=host.

ROCm reports only ~61 GB GPU memory Missing amdgpu.no_system_mem_limit=1 kernel parameter. Re-check step 3.

Model loads but runs at ~50% expected throughput GPU power governor is on auto. Fix with step 4.

Second ROCm model fails to allocate GPU memory after eviction ROCm HIP retains the memory pool within a container namespace after process exit. Each ROCm model needs its own dedicated container (step 6).

Crashes or silent corruption with no obvious error Check that both -fa 1 and --no-mmap are present in your llama-server invocation. They are non-negotiable on this hardware.