As; HN: I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

4 points | by i_am_rocoe 2 days ago

1 comments

i_am_rocoe 2 days ago
If you want to run the PoC locally to replicate:
Clone the masked-nextn-skip-catchup branch:
https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...
Run llama-server with at least --n-gpu-layers 99 --spec-type draft-mtp --spec-draft-n-max 2 --parallel 1 --no-cache-prompt.
I used Qwen3.6-35B-A3B-UD-Q2_K_XL MTP:
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF?show...
I tested on a L40S (using Modal).
About the parameters:
--parallel 1, see Code review findings:
https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...
--no-cache-prompt, see Known limitations:
https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...