JANG
Jang Adaptive N-bit Grading
On MLX, 2-bit models can’t even form a sentence. JANG makes them coherent.
Try running any model at 2 or 3 bits on MLX. Ask it “What is 2+2?” and you’ll get “2+2? 2+2? 2+2?” or “1000000000” or just nothing at all. The model is too compressed to think straight — it loops, spits out numbers, or goes silent. This happens with every model on MLX at low bits. It’s not the model’s fault — it’s how MLX compresses it.
JANG compresses the same model to the same size, runs at MLX speed, but keeps it coherent. Ask the same question and you get “The answer is 4.” It works because JANG protects the small part of the model that controls whether output makes sense, while compressing everything else just as aggressively. Same speed. Same size. Actually works.
Attention is 12% of parameters but 100% of coherence
Uniform quantization applies the same bit width everywhere. When bits get low, attention layers break first — scores go flat, positional encoding degrades, and output collapses into repetition loops or number garbage.
JANG protects what matters. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The cost: ~0.4 extra bits on average. The benefit: correct output where uniform produces garbage.
59 wins. 0 losses. Every test shown.
All tests on Apple M4 Max (107 GB unified memory). Same model, same tokenizer, same prompt. Affine quantization, group_size=64. 45 experiments documented.
At 2.5 effective bits, JANG_2S gets 6/6 correct while uniform gets 0/6. JANG protects the 8 critical full-attention layers at 6-bit while compressing the 24 linear-attention layers and all MLP at 2-bit.
Per-model results
JANG_3M (3.4 bits)
Uniform 3-bit (3.5 bits)
JANG_3L (3.6 bits)
Uniform 3-bit (3.5 bits)
JANG_3M (3.4 bits)
Uniform 3-bit (3.5 bits)
JANG_4S (4.1 bits)
Uniform 4-bit (4.5 bits)
JANG_2S (2.5 bits)
Uniform 2-bit (2.5 bits)
JANG_4S (4.1 bits)
Uniform 4-bit (4.5 bits)
More wins across experiments
JANG_4S (4.1 bits)
Uniform 4-bit (4.5 bits)
JANG_2S (2.5 bits)
Uniform 2-bit (2.5 bits)
JANG_4L (4.5 bits)
Uniform 4-bit (4.5 bits)
JANG_3L (3.6 bits)
Uniform 3-bit (3.5 bits)
JANG_3M (3.4 bits)
Uniform 3-bit (3.5 bits)
JANG_3L (3.6 bits)
Uniform 3-bit (3.5 bits)
JANG_2M (2.7 bits)
Uniform 2-bit (2.5 bits)
JANG_4L (4.5 bits)
Uniform 4-bit (4.5 bits)
JANG (4.12 bits)
Uniform 4-bit (4.0 bits)
JANG at 3.37 bits beats uniform 4-bit
Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better
Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64
JANG at 3.37 bits (MSE 11.10) beats uniform at 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.
All models tested
| Model | Params | Architecture | Wins | Degradation Point |
|---|---|---|---|---|
| Mistral-7B | 7B | Mistral GQA 4:1, sliding window | 13 | Uniform 3b → number garbage, 4b → loops |
| TinyLlama-1.1B | 1.1B | Llama GQA 8:1 | 11 | Uniform 4b → topic derail |
| SmolLM2-1.7B | 1.7B | Llama MHA | 11 | Uniform 3b → number spam |
| Phi-2 | 2.7B | Phi MHA, GELU MLP | 9 | Uniform 2b → empty output |
| Qwen2.5-7B | 7B | Qwen GQA 4:1 | 9 | Uniform 3b → repetition loop |
| Qwen2.5-3B | 3B | Qwen GQA 8:1 | 6 | Uniform 4b → echo/loop |
| Qwen3.5-4B | 4B | Hybrid: 24 linear + 8 full attn | 6 | Uniform 2b → 0/6 correct |
All tests: Apple M4 Max · 107 GB unified memory · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 45 experiments · 8 models · Qwen3.5-9B downloaded, testing pending
JANG_{bits}{size}
11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).
| Profile | MLP | Attention | Embed | lm_head | Avg Bits |
|---|---|---|---|---|---|
| JANG_1L | 2-bit | 8-bit | 8-bit | 8-bit | ~2.2 |
| JANG_2S | 2-bit | 6-bit | 4-bit | 6-bit | ~2.5 |
| JANG_2M | 2-bit | 8-bit | 4-bit | 8-bit | ~2.7 |
| JANG_2L | 2-bit | 8-bit | 6-bit | 8-bit | ~2.9 |
| JANG_3S | 3-bit | 4-bit | 4-bit | 6-bit | ~3.1 |
| JANG_3M | 3-bit | 6-bit | 4-bit | 6-bit | ~3.4 |
| JANG_3L | 3-bit | 8-bit | 4-bit | 8-bit | ~3.6 |
| JANG_4S | 4-bit | 5-bit | 4-bit | 6-bit | ~4.1 |
| JANG_4M | 4-bit | 6-bit | 4-bit | 6-bit | ~4.2 |
| JANG_4L | 4-bit | 8-bit | 4-bit | 8-bit | ~4.5 |
| JANG_6M | 6-bit | 8-bit | 6-bit | 8-bit | ~6.2 |
Swift + Metal inference engine
14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.
Dequant + GEMV
Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.
Dequant + GEMM
Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.
GQA Attention
Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.
RMSNorm + RoPE
Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.
SwiGLU
Fused SiLU activation + element-wise multiply for gated feed-forward networks.
Quantized Embedding
Direct embedding lookup from quantized weights. No full-table dequantization needed.
Convert any model
Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.
6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.
Run bigger models on less RAM
JANG_3M saves 25% vs uniform 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.