Ternative – C++/CUDA inference engine for ternary LLMs with runtime LoRA

2 points | by michelangeloro 8 hours ago

1 comments

  I built this because llama.cpp crashes with a type-36 error on
  BitNet I2_S weights, and bitnet.cpp has no LoRA support.

  The core problem: LoRA deltas (~1e-5 magnitude) are erased when
   you merge them into ternary weights (~1.2 scale) and
  re-quantize. The fine-tuning silently disappears. The only
  solution is to never merge — keep the adapter separate, apply
  it at full F32 precision at load time, then cast to F16.

  ternative loads a base I2_S GGUF + a LoRA adapter GGUF, merges
  in F32, and serves via OpenAI-compatible HTTP. Runs all 30
  layers of a 2B model on a 4GB GPU at ~6-7 tok/s.

  I used it to train and benchmark Orchid 1.0
  (https://huggingface.co/MicheRomChis/orchid-1.0) — a BitNet
  fine-tune aligned with ORPO. ARC-Challenge: 56.0% (+6.1pp over
  base). Technical paper: https://huggingface.co/MicheRomChis/orc
  hid-1.0/blob/main/orchid-1-0-technical-paper.pdf