# JANG — The GGUF for MLX

> The GGUF equivalent for MLX on Apple Silicon. Mixed-precision quantization that protects
> attention layers while compressing MLP — same model size, dramatically better quality.
> 84% MMLU at 2 bits where standard gets 46%. Tested 1B to 230B. Open source, Apache 2.0.
> https://jangq.ai | https://github.com/jjang-ai/jangq | https://pypi.org/project/jang/
> Author: Jinho Jang (eric@jangq.ai) | https://x.com/jangqai

## Large Model Results (March 2026)

JANG runs quantized in GPU memory using native MLX Metal kernels — no float16 expansion. Models stay compressed and dequantize on-the-fly during inference at full native speed. Tested on M4 Max 128 GB and Mac Studio M4 Ultra 192 GB.

### MMLU Benchmark — Qwen3.5-122B-A10B

50-question MMLU subset (10 subjects × 5 questions), thinking disabled, temperature 0.0, M4 Max 128 GB.

| Method | Avg bits | Disk | GPU Mem | MMLU |
|--------|----------|------|---------|------|
| **JANG_4K** | 3.99 | 69 GB | 71 GB | **94%** |
| MLX 4-bit | 4.0 | 64 GB | 64 GB | 90% |
| **JANG_2S** | 2.11 | 38 GB | 44 GB | **84%** |
| JANG_1L | 2.24 | 51 GB | 46 GB | 73% |
| 2-bit | 2.0 | 36 GB | 36 GB | 56% |
| MLX mixed_2_6 | ~2.5 | 44 GB | 45 GB | 46% |

JANG_4K scores 94% MMLU — 4 points above MLX 4-bit (90%) on the 122B model.
JANG_2S scores 84% at 38 GB — 6 GB smaller than MLX mixed_2_6 (44 GB) while scoring +38 points higher.

### Qwen3.5-122B-A10B — QA prompt comparison (122B params, 10B active, MoE)

| Method | Avg bits | Disk | GPU mem | Speed | Correct | Partial | Broken |
|--------|----------|------|---------|-------|---------|---------|--------|
| JANG_1L | 2.24 | 51 GB | 46 GB | 48 tok/s | 3 | 3 | 0 |
| MLX mixed_2_6 | ~2.2 | 44 GB | 44.9 GB | 66 tok/s | 1 | 1 | 4 |
| 2-bit | 2.0 | 36 GB | 35.6 GB | 67 tok/s | 1 | 2 | 3 |

| Prompt | JANG_1L (2.24b) | MLX mixed_2_6 (~2.2b) | 2-bit |
|--------|----------------|----------------------|-------------------|
| "What is 2+2?" | "2+2 is 4" ✅ | "2+2=4" then repeats ⚠️ | "2+2=4" then loops ⚠️ |
| "Is a tomato a fruit?" | Uses think ⚠️ | Empty think ❌ | Rephrases ❌ |
| "What is photosynthesis?" | "plants use energy of sun" ✅ | "dummies" degenerate ❌ | "Photos-sense y=y" ❌ |
| "Three planets larger?" | Uses think ⚠️ | Uses think ⚠️ | Misreads ❌ |
| "Who wrote Romeo and Juliet?" | Uses think ⚠️ | Double think ❌ | Uses think ⚠️ |
| "Capital of France?" | "Paris" ✅ | "Paris" ✅ | "Paris" with details ✅ |

Note: Previous results incorrectly claimed "6/6 PERFECT" for JANG_1L on 122B. The corrected scores are 3 correct, 3 partial, 0 broken. The partial results use <think> reasoning tags instead of answering directly, which counts as partial rather than correct.

### Qwen3.5-122B-A10B — JANG_2L (earlier profile)
- Profile: JANG_2L (2.19 avg bits)
- GPU memory: 45.3 GB
- Speed: 38-49 tok/s on M4 Max 128 GB
- Score: JANG_2L 3/4 correct, 2-bit 1/4 correct

### MiniMax-M2.5 (230B params, 10B active, MoE)
- Profile: JANG_2S (2.06 avg bits)
- GPU memory: 81.6 GB on Mac Studio M4 Ultra 192 GB
- Speed: 50 tok/s
- Score: JANG_2S 3/6 correct
- JANG_2L (~88 GB) converting — results coming soon

### Qwen3.5-35B-A3B — Three-way comparison (35B params, 3B active, MoE)

| Method | Avg bits | Disk | GPU mem | Speed | Correct | Partial | Broken |
|--------|----------|------|---------|-------|---------|---------|--------|
| JANG_2L | 2.28 | 15 GB | 13.3 GB | 100 tok/s | 4 | 1 | 1 |
| MLX mixed_2_6 | ~2.2 | 13 GB | 12.8 GB | 120 tok/s | 0 | 1 | 5 |
| 2-bit | 2.0 | 10 GB | 10.1 GB | 128 tok/s | 0 | 1 | 5 |

| Prompt | JANG_2L (2.28b) | MLX mixed_2_6 (~2.2b) | 2-bit |
|--------|----------------|----------------------|-------------------|
| "What is 2+2?" | "2+2 equals 4. This is a simple addition problem..." | "4" then "2 2 2 2 2 2 2 2..." | "4" then "2 2 2 2 2 2 2 2..." |
| "What is photosynthesis?" | Correct: "process by which plants convert light energy..." | "Photos 6 6 6 6 6" garbage | "Photos 6 6 6 6 6" garbage |
| "Three planets larger?" | "Jupiter, Saturn, and Uranus" with details | "3 of the 3 8 8 8 8 8 8" number spam | "3 of the 3 8 8 8 8 8 8" number spam |
| "Capital of France?" | "Paris. Major hub for culture, finance, and tourism" | "Paris" then "Hé Hé" garbage | "Paris" then "Hé Hé" garbage |
| "Who wrote Romeo and Juliet?" | "William Shakespeare" | "The" then nothing | "The" then nothing |
| "Is a tomato a fruit?" | Loops (A. A. A.) — both fail | "A . 4" garbage — both fail | "A . 4" garbage — both fail |

### MMLU Benchmark — Qwen3.5-35B-A3B

| Method | MMLU |
|--------|------|
| MLX 4-bit | 82% |
| JANG_4S | 82% |
| JANG_2L v2 | 56% |
| MLX mixed_2_6 | 34% |

JANG_4S matches MLX 4-bit exactly (82% = 82%), confirming the JANG quantization pipeline is lossless at 4-bit.

### HumanEval — Qwen3.5-35B-A3B

20-problem HumanEval subset, temperature 0.0.

| Method | Pass |
|--------|------|
| MLX 4-bit | 19/20 = 95% |
| MLX mixed_2_6 | 0/20 = 0% |

MLX mixed_2_6 fails to produce any working code on this model — 0 out of 20 problems pass.

These are basic QA comparisons at temperature 0.0, not comprehensive benchmarks. Perplexity and downstream task evaluation is planned.

Note on MLX mixed_2_6: MLX's mixed 2/6-bit quantization applies higher precision to attention layers (6-bit) and lower precision to MLP layers (2-bit), similar in concept to JANG. However, on hybrid and MoE architectures like Qwen3.5, mixed_2_6 does not account for GatedDeltaNet linear attention layers or MoE expert routing. It treats all attention at 6-bit regardless of layer type, and does not adjust precision for expert gating or routing tensors. As a result, mixed_2_6 provides little quality improvement over 2-bit on these architectures. JANG's per-tensor sensitivity profiles handle these architectural differences explicitly.

## Background

Standard quantization applies the same bit width to every tensor. At low bit widths (2-3 bits), attention layers degrade first — attention scores flatten, positional encoding loses precision, and output degenerates into repetition loops, number sequences, or empty responses.

Common failure modes at 2-3 bits:
- Repetition loops ("2+2? 2+2? 2+2?")
- Number sequences ("10000000000000000000")
- Prompt echoing (repeats the question instead of answering)
- Empty output

Attention layers are ~12% of total parameters but control output coherence. When quantized to 2-3 bits alongside MLP layers, they lose the precision needed to maintain stable attention patterns.

## Approach

JANG assigns variable bit widths per tensor based on layer type and sensitivity:
- Attention layers (Q, K, V, O projections): 5-8 bits
- MLP/FFN layers: 2-4 bits
- Embeddings: 4-8 bits
- Output head (lm_head): 6-8 bits

The overhead is ~0.3 extra bits on average compared to standard quantization at the same MLP bit width. Models stay quantized in GPU memory using MLX's native `quantized_matmul` — no float16 expansion, no speed penalty.

## Three Components

1. A quantization method — importance-aware bit allocation (more bits to attention, fewer to MLP)
2. A file format — .jang files using safetensors with per-block variable bit widths (2, 3, 4, 5, 6, 8)
3. An inference runtime — Swift 6.0 + Metal engine with 14 custom GPU kernels for Apple Silicon

## Dense Model Results (1B-7B)

All tests on Apple M4 Max (107 GB unified memory), affine quantization, group_size=64, temperature 0.0.

### Qwen3.5-4B (Hybrid: 24 linear attention + 8 full attention layers)
JANG_2S at 2.5 effective bits: 6/6 correct. 2-bit: 0/6 correct.

| Prompt                          | JANG_2S (2.5 bits)                              | 2-bit (2.5 bits)      |
|---------------------------------|--------------------------------------------------|-------------------------------|
| "What is 2+2?"                  | "The answer is 4."                               | "2+2? 2+2? 2+2? 2+2?"        |
| "Is a tomato a fruit?"          | "A tomato is a fruit, not a vegetable."           | "1 1 1 1 1 1 1 1"            |
| "Who wrote Romeo and Juliet?"   | Answers correctly                                | "10, 10, 10, 10, 10"          |
| "What is photosynthesis?"       | Correct definition                               | Garbled text                   |
| "How many legs does a spider?"  | Answers correctly                                | "10, 10, 10, 10"              |
| "Largest ocean on Earth?"       | "The Pacific Ocean."                             | Infinite loop                  |

Why: Qwen3.5-4B has 8 critical full-attention layers. JANG protects them at 6-bit while compressing 24 linear-attention layers and MLP at 2-bit.

### Mistral-7B-v0.3 (Mistral GQA 4:1, sliding window) — 13 wins
JANG_3M (3.4 bits) vs 3-bit (3.5 bits):
- "What is photosynthesis?" → JANG: correct answer | Standard: "10000000000000000000..."
- FEWER bits, correct answer vs number garbage.

### Qwen2.5-7B (Qwen GQA 4:1) — 9 wins
JANG_3L (3.6 bits) vs 3-bit (3.5 bits):
- "What is 2+2?" → JANG: "The answer is 4." | Standard: "Assistant Assistant Assistant..."
- Same size, correct answer vs infinite repetition loop.

### SmolLM2-1.7B (Llama MHA) — 11 wins
JANG_3M (3.4 bits) vs 3-bit (3.5 bits):
- "How many legs does a spider have?" → JANG: "8" | Standard: "2 1/2 1/2 1/2 1/2..."
- FEWER bits, correct answer vs number spam.

### TinyLlama-1.1B (Llama GQA 8:1) — 11 wins
JANG_4S (4.1 bits) vs 4-bit (4.5 bits):
- "Chemical formula for water?" → JANG: stays on topic (H...) | Standard: derails to "hydrogen peroxide?"
- 9% smaller, stays on topic vs derailing to wrong question.

### Phi-2 2.7B (Phi MHA) — 9 wins
JANG_2S (2.5 bits) vs 2-bit (2.5 bits):
- "What is photosynthesis?" → JANG: correct scientific answer | Standard: (empty output)
- SAME bits, correct answer vs completely empty output.

### Qwen2.5-3B (Qwen GQA 8:1) — 6 wins
JANG_4S (4.1 bits) vs 4-bit (4.5 bits):
- "Translate 'thank you' to Spanish" → JANG: "'gracias'" | Standard: echoes prompt back
- 9% smaller, correct translation vs echoing the prompt.

## Mathematical Proof (Logit MSE)

Qwen2.5-3B, "What is 2+2?", MSE vs bf16 reference:
- MLP=4, attn=8 (4.49 bits): MSE 7.13
- MLP=4, attn=6 (4.24 bits): MSE 8.70
- 4-bit (4.00 bits): MSE 11.31
- MLP=3, attn=6 (3.37 bits): MSE 11.10 ← JANG at 16% fewer bits beats 4-bit

## Profile System

JANG_{bits}{S/M/L} where bits = MLP bit width, S=Small, M=Medium, L=Large (attention precision).

| Profile  | MLP | Attention | Embed | lm_head | Avg Bits |
|----------|-----|-----------|-------|---------|----------|
| JANG_1L  | 2   | 8         | 8     | 8       | ~2.2     |
| JANG_2S  | 2   | 6         | 4     | 6       | ~2.5     |
| JANG_2M  | 2   | 8         | 4     | 8       | ~2.7     |
| JANG_2L  | 2   | 8         | 6     | 8       | ~2.9     |
| JANG_3S  | 3   | 4         | 4     | 6       | ~3.1     |
| JANG_3M  | 3   | 6         | 4     | 6       | ~3.4     |
| JANG_3L  | 3   | 8         | 4     | 8       | ~3.6     |
| JANG_4S  | 4   | 5         | 4     | 6       | ~4.1     |
| JANG_4M  | 4   | 6         | 4     | 6       | ~4.2     |
| JANG_4L  | 4   | 8         | 4     | 8       | ~4.5     |
| JANG_6M  | 6   | 8         | 6     | 8       | ~6.2     |

## Technical Details

Format: safetensors with 5 companion tensors per weight (qweight, scales, zeros, bit_map, block_offsets). Block size 64. Asymmetric quantization. Dequant formula: dequantized = (raw_int - zero) * scale.

Runtime: Swift 6.0, macOS 15+, Apple Silicon. 14 Metal kernels: standalone dequant, fused dequant+GEMV, fused dequant+GEMM, GQA attention decode, causal attention prefill, quantized embedding, RMSNorm, RoPE (trad + non-trad), softmax, SiLU, SwiGLU, element-wise add, standard embedding.

Quantization methods: RTN (round-to-nearest), MSE-optimal (grid search), GPTQ (Hessian-guided).

Architectures supported: Llama, Qwen, Qwen 3.5 (hybrid), Gemma, Phi, Mistral, Mamba/SSM, MoE, GatedDeltaNet, MLA (DeepSeek), sliding window attention.

Speed (Qwen2.5-3B, M4 Max): Load 0.39s (4-bit, 1.8 GB), Prefill 27.6 tok/s, Decode 15.4 tok/s.

Memory: 7B at JANG_4S = ~4.1 GB (vs 4.5 GB 4-bit, 9% savings). At JANG_3M = 25% savings vs 4-bit.

## Links

- Website: https://jangq.ai
- GitHub: https://github.com/jjang-ai/jangq
- PyPI: https://pypi.org/project/jang/
- Author: https://x.com/jangqai
- Models: https://huggingface.co/JANGQ-AI
- Related: https://vmlx.net | https://mlx.studio