Open Source

JANG

Jang Adaptive N-bit Grading

On MLX, 2-bit models can’t even form a sentence. JANG makes them coherent.

Try running any model at 2 or 3 bits on MLX. Ask it “What is 2+2?” and you’ll get “2+2? 2+2? 2+2?” or “1000000000” or just nothing at all. The model is too compressed to think straight — it loops, spits out numbers, or goes silent. This happens with every model on MLX at low bits. It’s not the model’s fault — it’s how MLX compresses it.

JANG compresses the same model to the same size, runs at MLX speed, but keeps it coherent. Ask the same question and you get “The answer is 4.” It works because JANG protects the small part of the model that controls whether output makes sense, while compressing everything else just as aggressively. Same speed. Same size. Actually works.

Importance-aware bit allocation 2-bit to 8-bit mixed precision 14 custom Metal GPU kernels Swift + Metal runtime Per-block variable bit widths Open source · Apache 2.0
59–0
Wins vs uniform quantization
6/6
Correct at 2.5 bits (vs 0/6)
14
Custom Metal GPU kernels
7
Models tested, 6 architectures
The Problem

Attention is 12% of parameters but 100% of coherence

Uniform quantization applies the same bit width everywhere. When bits get low, attention layers break first — scores go flat, positional encoding degrades, and output collapses into repetition loops or number garbage.

JANG protects what matters. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The cost: ~0.4 extra bits on average. The benefit: correct output where uniform produces garbage.

Attention
8-bit — protected
MLP
2-bit — compressed
Embed
4-bit
lm_head
6-bit
Result
JANG_2M → 2.7 avg bits → coherent output
Uniform 3-bit → 3.0 avg bits → repetition loops
Empirical Results

59 wins. 0 losses. Every test shown.

All tests on Apple M4 Max (107 GB unified memory). Same model, same tokenizer, same prompt. Affine quantization, group_size=64. 45 experiments documented.

Per-model results

Mistral-7B-v0.3
Mistral GQA 4:1 JANG_3M 3.4 bits M4 Max
“What is photosynthesis?”
JANG_3M (3.4 bits)
“Photosynthesis is the process by which plants and some other organisms...”
Uniform 3-bit (3.5 bits)
10000000000000000000000000000...
FEWER bits, correct answer vs number garbage. 13 total wins on this model.
Qwen2.5-7B
Qwen GQA 4:1 JANG_3L 3.6 bits M4 Max
“What is 2+2?”
JANG_3L (3.6 bits)
“The answer is 4.”
Uniform 3-bit (3.5 bits)
Assistant Assistant Assistant Assistant Assistant...
Same size, correct answer vs infinite repetition loop. 9 total wins.
SmolLM2-1.7B
Llama MHA JANG_3M 3.4 bits M4 Max
“How many legs does a spider have?”
JANG_3M (3.4 bits)
“8. How many arms does a spider have? Answer: 8”
Uniform 3-bit (3.5 bits)
2 1/2 1/2 1/2 1/2 1/2 1/2 1/2
FEWER bits, correct answer (8) vs number spam. 11 total wins.
TinyLlama-1.1B
Llama GQA 8:1 JANG_4S 4.1 bits M4 Max
“What is the chemical formula for water?”
JANG_4S (4.1 bits)
“What is the chemical formula for water? Answers: 1. H...”
Uniform 4-bit (4.5 bits)
“What is the chemical formula for hydrogen peroxide?...”
9% smaller, stays on topic vs derailing to wrong question. 11 total wins.
Phi-2 (2.7B)
Phi MHA JANG_2S 2.5 bits M4 Max
“What is photosynthesis?”
JANG_2S (2.5 bits)
“Photosynthesis is the process by which plants use sunlight to con...”
Uniform 2-bit (2.5 bits)
(empty output)
SAME bits, correct answer vs completely empty output. 9 total wins.
Qwen2.5-3B
Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max
“Translate 'thank you' to Spanish.”
JANG_4S (4.1 bits)
“Thank you in Spanish is 'gracias'.”
Uniform 4-bit (4.5 bits)
“Translate 'thank you' to Spanish.”
9% smaller, correct translation vs echoing the prompt. 6 total wins.

More wins across experiments

Mistral-7B — 4-bit Win
Mistral GQA 4:1 JANG_4S 4.1 bits M4 Max
“What is 2+2?”
JANG_4S (4.1 bits)
“The answer is 4. But what if...”
Uniform 4-bit (4.5 bits)
4. What is 2+2? 4. What is 2+2? 4...
9% smaller AND better — uniform loops even at 4-bit on Mistral.
Mistral-7B — 2-bit Win
Mistral GQA 4:1 JANG_2S 2.5 bits M4 Max
“Name three planets in our solar system.”
JANG_2S (2.5 bits)
“1. Jupiter 2. Mars 3. Saturn”
Uniform 2-bit (2.5 bits)
is a new planet, and it is a new planet...
SAME bits, correct list vs nonsense loop. Coherent at 2.5 bits.
Qwen2.5-3B — Creative Task
Qwen GQA 8:1 JANG_4L 4.5 bits M4 Max
“Write a haiku about the moon.”
JANG_4L (4.5 bits)
“The moon’s glow, a tranquil sight...”
Uniform 4-bit (4.5 bits)
ancient sky, lunar glow, ancient sky, lunar glow...
Same size — JANG writes a real poem, uniform loops the same phrase.
Qwen2.5-7B — Factual
Qwen GQA 4:1 JANG_3L 3.6 bits M4 Max
“Who wrote Romeo and Juliet?”
JANG_3L (3.6 bits)
“The play Romeo and Juliet was written by William Shakespeare”
Uniform 3-bit (3.5 bits)
Who wrote Romeo and Juliet?
Same size, correct answer vs echoing the prompt back.
Mistral-7B — Knowledge
Mistral GQA 4:1 JANG_3M 3.4 bits M4 Max
“Name a famous painting by Leonardo da Vinci.”
JANG_3M (3.4 bits)
“The Last Supper, The Mona Lisa, The Vitruvian Man...”
Uniform 3-bit (3.5 bits)
Name a famous painting by Michelangelo. Name a famous painting by Rap...
FEWER bits — correct artist vs hallucinating wrong artists.
Mistral-7B — Factual
Mistral GQA 4:1 JANG_3L 3.6 bits M4 Max
“How many legs does a spider have?”
JANG_3L (3.6 bits)
“Spiders have eight legs.”
Uniform 3-bit (3.5 bits)
TDM 10000000000000000000000...
Same size, clean answer vs garbage numbers.
Mistral-7B — Ultra-Low Bit
Mistral GQA 4:1 JANG_2M 2.7 bits M4 Max
“What is the largest ocean on Earth?”
JANG_2M (2.7 bits)
“The Pacific Ocean, The Atlantic Ocean, The Indian...”
Uniform 2-bit (2.5 bits)
## 1000000000000000000000000...
Correct at 2.7 bits on a 7B model — uniform is pure number garbage.
Mistral-7B — Repetition
Mistral GQA 4:1 JANG_4L 4.5 bits M4 Max
“Who wrote Romeo and Juliet?”
JANG_4L (4.5 bits)
“William Shakespeare. What is the name of the play...”
Uniform 4-bit (4.5 bits)
William Shakespeare 1564-1616 1564-1616 1564-1616...
Same size — JANG answers clean, uniform loops the dates.
Qwen2.5-3B — Tomato Test
Qwen GQA 8:1 JANG (MLP=4, A=5) 4.12 bits M4 Max
“Is a tomato a fruit or vegetable?”
JANG (4.12 bits)
“A tomato is a fruit. It is botanically classified...”
Uniform 4-bit (4.0 bits)
Is a tomato a fruit or vegetable? Is it a vegetabl...
Direct win — JANG prevents repetition collapse with just 0.12 extra bits. At uniform 3-bit: “xnxx Is a a a a a a...” (complete garbage).
Logit MSE Proof

JANG at 3.37 bits beats uniform 4-bit

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8
7.13 MSE — 4.49 bits
MLP=4, attn=6
8.70 MSE — 4.24 bits
Uniform 4-bit
11.31 MSE — 4.00 bits
MLP=3, attn=6
11.10 MSE — 3.37 bits ✔

JANG at 3.37 bits (MSE 11.10) beats uniform at 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.

Summary

All models tested

Model Params Architecture Wins Degradation Point
Mistral-7B7BMistral GQA 4:1, sliding window13Uniform 3b → number garbage, 4b → loops
TinyLlama-1.1B1.1BLlama GQA 8:111Uniform 4b → topic derail
SmolLM2-1.7B1.7BLlama MHA11Uniform 3b → number spam
Phi-22.7BPhi MHA, GELU MLP9Uniform 2b → empty output
Qwen2.5-7B7BQwen GQA 4:19Uniform 3b → repetition loop
Qwen2.5-3B3BQwen GQA 8:16Uniform 4b → echo/loop
Qwen3.5-4B4BHybrid: 24 linear + 8 full attn6Uniform 2b → 0/6 correct

All tests: Apple M4 Max · 107 GB unified memory · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 45 experiments · 8 models · Qwen3.5-9B downloaded, testing pending

Profiles

JANG_{bits}{size}

11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).

ProfileMLPAttentionEmbedlm_headAvg Bits
JANG_1L2-bit8-bit8-bit8-bit~2.2
JANG_2S2-bit6-bit4-bit6-bit~2.5
JANG_2M2-bit8-bit4-bit8-bit~2.7
JANG_2L2-bit8-bit6-bit8-bit~2.9
JANG_3S3-bit4-bit4-bit6-bit~3.1
JANG_3M3-bit6-bit4-bit6-bit~3.4
JANG_3L3-bit8-bit4-bit8-bit~3.6
JANG_4S4-bit5-bit4-bit6-bit~4.1
JANG_4M4-bit6-bit4-bit6-bit~4.2
JANG_4L4-bit8-bit4-bit8-bit~4.5
JANG_6M6-bit8-bit6-bit8-bit~6.2
Runtime

Swift + Metal inference engine

14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.

jang — Terminal
$ jang run --model Qwen2.5-3B-JANG_4L.jang
# Loading model (zero-copy mmap)...
# Profile: JANG_4L (MLP=4, attn=8, avg=4.5 bits)
# Size: 1.8 GB — loaded in 0.39s
> What is photosynthesis?
Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.

Dequant + GEMM

Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.

GQA Attention

Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.

RMSNorm + RoPE

Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.

SwiGLU

Fused SiLU activation + element-wise multiply for gated feed-forward networks.

Quantized Embedding

Direct embedding lookup from quantized weights. No full-table dequantization needed.

Quantize

Convert any model

Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.

6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.

Open source — Apache 2.0 License
jang-tools
$ pip install jang-tools
$ jang convert --model Qwen/Qwen2.5-7B \
    --profile JANG_4L \
    --method gptq \
    --output ./Qwen2.5-7B-JANG_4L/
# Quantizing with GPTQ (Hessian-guided)...
# Attention layers: 8-bit | MLP: 4-bit
# Average bits: 4.5 | Size: 4.1 GB
# Done ✔
Memory

Run bigger models on less RAM

JANG_3M saves 25% vs uniform 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.

~4.1 GB
7B at JANG_4S (vs 4.5 GB uniform)
~8.2 GB
14B at JANG_4S (vs 9 GB uniform)
~41 GB
70B at JANG_4S (vs 45 GB uniform)
25%
Savings at JANG_3M vs uniform 4-bit