Show and Tell: Project Zero - C99 BitNet b1.58 inference, 1.25x-1.83x over bitnet.cpp #579

shifulegend · 2026-06-29T00:51:21Z

shifulegend
Jun 29, 2026

Hi all - sharing performance data for Project Zero, a from-scratch C99 inference engine for BitNet b1.58.

What it is:

C99, zero runtime deps (GCC + make, no Python, no BLAS)
Runs BitNet b1.58-2B-4T and standard GGUF models in the same binary
OpenAI-compatible HTTP API with SSE streaming
Pre-built x86-64 Linux binary: releases

BitNet b1.58 performance:

AMD 9950X - benchmarked independently by @tommyyliu in #569, who ran Project Zero on his machine via Claude:

Threads	PZ (VNNI path)	PZ + K-6 LUT	Speedup
1	34.15 tok/s	42.66 tok/s	1.25x
2	47.43 tok/s	51.11 tok/s	1.08x
4	48.09 tok/s	51.93 tok/s	1.08x
16	52.96 tok/s	56.66 tok/s	1.07x

Intel Xeon 8259CL (AVX-512BW, no VNNI): T=1 gives 36.25 tok/s vs bitnet.cpp ~20 tok/s = 1.83x. Full Phoronix Test Suite result: https://openbenchmarking.org/result/2606207-SHIF-PROJECT42

K-6 LUT kernel:
The LUT numbers use 5-trit packing (5 ternary weights per byte) + AVX-512BW vpermt2w for register-based lookups - @tommyyliu's approach from lut_mm, integrated into PZ. Kernel-level on the no-VNNI Xeon: 74.6 vs 2.5 Gop/s = 29.3x. End-to-end gain is Amdahl-bounded but measurable: 36.25 vs expected 40+ tok/s once the K-6 branch is merged to master.

Honest limitation: Dense Q4_K GGUF: ~1.9 tok/s vs llama.cpp ~14 tok/s. Kernel investment has been on the BitNet ternary path so far.

Happy to answer questions or run benchmarks on specific hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Show and Tell: Project Zero - C99 BitNet b1.58 inference, 1.25x-1.83x over bitnet.cpp #579

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Show and Tell: Project Zero - C99 BitNet b1.58 inference, 1.25x-1.83x over bitnet.cpp #579

Uh oh!

shifulegend Jun 29, 2026

Replies: 0 comments

shifulegend
Jun 29, 2026