Llama.cpp + Minimax 2.5 (Unsloth)

Minimax M2.5

Minimax M2.5

Firstly, download the following files.

hf download unsloth/MiniMax-M2.5-GGUF \
      --include "UD-Q3_K_XL/*"

llama-server \
    --model ~/.cache/huggingface/hub/models--unsloth--MiniMax-M2.5-GGUF/snapshots/7c50dca0e5592483ad308ecffc876aecac725660/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf \
    --alias "unsloth/MiniMax-M2.5" \
    --prio 3 \
    --temp 0.3 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --ctx-size 32768 \
    --port 8045 \
    --flash-attn on \
    --n-gpu-layers auto \
    --threads 12

Vulkan0 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition): 63 layers,  12412 MiB used,  84133 MiB free
Vulkan0 (NVIDIA RTX PRO 6000 Blackwell Workstation Edition): 63 layers ( 8 overflowing),  95122 MiB used,   1423 MiB free
llama_params_fit: successfully fit params to free device memory