Show and Tell: Project Zero - C99 BitNet b1.58 inference, 1.25x-1.83x over bitnet.cpp #579
shifulegend
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all - sharing performance data for Project Zero, a from-scratch C99 inference engine for BitNet b1.58.
What it is:
BitNet b1.58 performance:
AMD 9950X - benchmarked independently by @tommyyliu in #569, who ran Project Zero on his machine via Claude:
Intel Xeon 8259CL (AVX-512BW, no VNNI): T=1 gives 36.25 tok/s vs bitnet.cpp ~20 tok/s = 1.83x. Full Phoronix Test Suite result: https://openbenchmarking.org/result/2606207-SHIF-PROJECT42
K-6 LUT kernel:
The LUT numbers use 5-trit packing (5 ternary weights per byte) + AVX-512BW
vpermt2wfor register-based lookups - @tommyyliu's approach from lut_mm, integrated into PZ. Kernel-level on the no-VNNI Xeon: 74.6 vs 2.5 Gop/s = 29.3x. End-to-end gain is Amdahl-bounded but measurable: 36.25 vs expected 40+ tok/s once the K-6 branch is merged to master.Honest limitation: Dense Q4_K GGUF: ~1.9 tok/s vs llama.cpp ~14 tok/s. Kernel investment has been on the BitNet ternary path so far.
Happy to answer questions or run benchmarks on specific hardware.
Beta Was this translation helpful? Give feedback.
All reactions