# JANG — The GGUF for MLX > The GGUF equivalent for MLX on Apple Silicon. Mixed-precision quantization that protects > attention layers while compressing MLP — same model size, dramatically better quality. > 84% MMLU at 2 bits where standard gets 46%. Tested 1B to 230B. Open source, Apache 2.0. > https://jangq.ai | https://github.com/jjang-ai/jangq | https://pypi.org/project/jang/ > Author: Jinho Jang (eric@jangq.ai) | https://x.com/jangqai ## Large Model Results (March 2026) JANG runs quantized in GPU memory using native MLX Metal kernels — no float16 expansion. Models stay compressed and dequantize on-the-fly during inference at full native speed. Tested on M4 Max 128 GB and Mac Studio M4 Ultra 192 GB. ### MMLU Benchmark — Qwen3.5-122B-A10B 50-question MMLU subset (10 subjects × 5 questions), thinking disabled, temperature 0.0, M4 Max 128 GB. | Method | Avg bits | Disk | GPU Mem | MMLU | |--------|----------|------|---------|------| | **JANG_4K** | 3.99 | 69 GB | 71 GB | **94%** | | MLX 4-bit | 4.0 | 64 GB | 64 GB | 90% | | **JANG_2S** | 2.11 | 38 GB | 44 GB | **84%** | | JANG_1L | 2.24 | 51 GB | 46 GB | 73% | | 2-bit | 2.0 | 36 GB | 36 GB | 56% | | MLX mixed_2_6 | ~2.5 | 44 GB | 45 GB | 46% | JANG_4K scores 94% MMLU — 4 points above MLX 4-bit (90%) on the 122B model. JANG_2S scores 84% at 38 GB — 6 GB smaller than MLX mixed_2_6 (44 GB) while scoring +38 points higher. ### Qwen3.5-122B-A10B — QA prompt comparison (122B params, 10B active, MoE) | Method | Avg bits | Disk | GPU mem | Speed | Correct | Partial | Broken | |--------|----------|------|---------|-------|---------|---------|--------| | JANG_1L | 2.24 | 51 GB | 46 GB | 48 tok/s | 3 | 3 | 0 | | MLX mixed_2_6 | ~2.2 | 44 GB | 44.9 GB | 66 tok/s | 1 | 1 | 4 | | 2-bit | 2.0 | 36 GB | 35.6 GB | 67 tok/s | 1 | 2 | 3 | | Prompt | JANG_1L (2.24b) | MLX mixed_2_6 (~2.2b) | 2-bit | |--------|----------------|----------------------|-------------------| | "What is 2+2?" | "2+2 is 4" ✅ | "2+2=4" then repeats ⚠️ | "2+2=4" then loops ⚠️ | | "Is a tomato a fruit?" | Uses think ⚠️ | Empty think ❌ | Rephrases ❌ | | "What is photosynthesis?" | "plants use energy of sun" ✅ | "dummies" degenerate ❌ | "Photos-sense y=y" ❌ | | "Three planets larger?" | Uses think ⚠️ | Uses think ⚠️ | Misreads ❌ | | "Who wrote Romeo and Juliet?" | Uses think ⚠️ | Double think ❌ | Uses think ⚠️ | | "Capital of France?" | "Paris" ✅ | "Paris" ✅ | "Paris" with details ✅ | Note: Previous results incorrectly claimed "6/6 PERFECT" for JANG_1L on 122B. The corrected scores are 3 correct, 3 partial, 0 broken. The partial results use reasoning tags instead of answering directly, which counts as partial rather than correct. ### Qwen3.5-122B-A10B — JANG_2L (earlier profile) - Profile: JANG_2L (2.19 avg bits) - GPU memory: 45.3 GB - Speed: 38-49 tok/s on M4 Max 128 GB - Score: JANG_2L 3/4 correct, 2-bit 1/4 correct ### MiniMax-M2.5 (230B params, 10B active, MoE) - Profile: JANG_2S (2.06 avg bits) - GPU memory: 81.6 GB on Mac Studio M4 Ultra 192 GB - Speed: 50 tok/s - Score: JANG_2S 3/6 correct - JANG_2L (~88 GB) converting — results coming soon ### Qwen3.5-35B-A3B — Three-way comparison (35B params, 3B active, MoE) | Method | Avg bits | Disk | GPU mem | Speed | Correct | Partial | Broken | |--------|----------|------|---------|-------|---------|---------|--------| | JANG_2L | 2.28 | 15 GB | 13.3 GB | 100 tok/s | 4 | 1 | 1 | | MLX mixed_2_6 | ~2.2 | 13 GB | 12.8 GB | 120 tok/s | 0 | 1 | 5 | | 2-bit | 2.0 | 10 GB | 10.1 GB | 128 tok/s | 0 | 1 | 5 | | Prompt | JANG_2L (2.28b) | MLX mixed_2_6 (~2.2b) | 2-bit | |--------|----------------|----------------------|-------------------| | "What is 2+2?" | "2+2 equals 4. This is a simple addition problem..." | "4" then "2 2 2 2 2 2 2 2..." | "4" then "2 2 2 2 2 2 2 2..." | | "What is photosynthesis?" | Correct: "process by which plants convert light energy..." | "Photos 6 6 6 6 6" garbage | "Photos 6 6 6 6 6" garbage | | "Three planets larger?" | "Jupiter, Saturn, and Uranus" with details | "3 of the 3 8 8 8 8 8 8" number spam | "3 of the 3 8 8 8 8 8 8" number spam | | "Capital of France?" | "Paris. Major hub for culture, finance, and tourism" | "Paris" then "Hé Hé" garbage | "Paris" then "Hé Hé" garbage | | "Who wrote Romeo and Juliet?" | "William Shakespeare" | "The" then nothing | "The" then nothing | | "Is a tomato a fruit?" | Loops (A. A. A.) — both fail | "A . 4" garbage — both fail | "A . 4" garbage — both fail | ### MMLU Benchmark — Qwen3.5-35B-A3B | Method | MMLU | |--------|------| | MLX 4-bit | 82% | | JANG_4S | 82% | | JANG_2L v2 | 56% | | MLX mixed_2_6 | 34% | JANG_4S matches MLX 4-bit exactly (82% = 82%), confirming the JANG quantization pipeline is lossless at 4-bit. ### HumanEval — Qwen3.5-35B-A3B 20-problem HumanEval subset, temperature 0.0. | Method | Pass | |--------|------| | MLX 4-bit | 19/20 = 95% | | MLX mixed_2_6 | 0/20 = 0% | MLX mixed_2_6 fails to produce any working code on this model — 0 out of 20 problems pass. These are basic QA comparisons at temperature 0.0, not comprehensive benchmarks. Perplexity and downstream task evaluation is planned. Note on MLX mixed_2_6: MLX's mixed 2/6-bit quantization applies higher precision to attention layers (6-bit) and lower precision to MLP layers (2-bit), similar in concept to JANG. However, on hybrid and MoE architectures like Qwen3.5, mixed_2_6 does not account for GatedDeltaNet linear attention layers or MoE expert routing. It treats all attention at 6-bit regardless of layer type, and does not adjust precision for expert gating or routing tensors. As a result, mixed_2_6 provides little quality improvement over 2-bit on these architectures. JANG's per-tensor sensitivity profiles handle these architectural differences explicitly. ## Background Standard quantization applies the same bit width to every tensor. At low bit widths (2-3 bits), attention layers degrade first — attention scores flatten, positional encoding loses precision, and output degenerates into repetition loops, number sequences, or empty responses. Common failure modes at 2-3 bits: - Repetition loops ("2+2? 2+2? 2+2?") - Number sequences ("10000000000000000000") - Prompt echoing (repeats the question instead of answering) - Empty output Attention layers are ~12% of total parameters but control output coherence. When quantized to 2-3 bits alongside MLP layers, they lose the precision needed to maintain stable attention patterns. ## Approach JANG assigns variable bit widths per tensor based on layer type and sensitivity: - Attention layers (Q, K, V, O projections): 5-8 bits - MLP/FFN layers: 2-4 bits - Embeddings: 4-8 bits - Output head (lm_head): 6-8 bits The overhead is ~0.3 extra bits on average compared to standard quantization at the same MLP bit width. Models stay quantized in GPU memory using MLX's native `quantized_matmul` — no float16 expansion, no speed penalty. ## Three Components 1. A quantization method — importance-aware bit allocation (more bits to attention, fewer to MLP) 2. A file format — .jang files using safetensors with per-block variable bit widths (2, 3, 4, 5, 6, 8) 3. An inference runtime — Swift 6.0 + Metal engine with 14 custom GPU kernels for Apple Silicon ## Dense Model Results (1B-7B) All tests on Apple M4 Max (107 GB unified memory), affine quantization, group_size=64, temperature 0.0. ### Qwen3.5-4B (Hybrid: 24 linear attention + 8 full attention layers) JANG_2S at 2.5 effective bits: 6/6 correct. 2-bit: 0/6 correct. | Prompt | JANG_2S (2.5 bits) | 2-bit (2.5 bits) | |---------------------------------|--------------------------------------------------|-------------------------------| | "What is 2+2?" | "The answer is 4." | "2+2? 2+2? 2+2? 2+2?" | | "Is a tomato a fruit?" | "A tomato is a fruit, not a vegetable." | "1 1 1 1 1 1 1 1" | | "Who wrote Romeo and Juliet?" | Answers correctly | "10, 10, 10, 10, 10" | | "What is photosynthesis?" | Correct definition | Garbled text | | "How many legs does a spider?" | Answers correctly | "10, 10, 10, 10" | | "Largest ocean on Earth?" | "The Pacific Ocean." | Infinite loop | Why: Qwen3.5-4B has 8 critical full-attention layers. JANG protects them at 6-bit while compressing 24 linear-attention layers and MLP at 2-bit. ### Mistral-7B-v0.3 (Mistral GQA 4:1, sliding window) — 13 wins JANG_3M (3.4 bits) vs 3-bit (3.5 bits): - "What is photosynthesis?" → JANG: correct answer | Standard: "10000000000000000000..." - FEWER bits, correct answer vs number garbage. ### Qwen2.5-7B (Qwen GQA 4:1) — 9 wins JANG_3L (3.6 bits) vs 3-bit (3.5 bits): - "What is 2+2?" → JANG: "The answer is 4." | Standard: "Assistant Assistant Assistant..." - Same size, correct answer vs infinite repetition loop. ### SmolLM2-1.7B (Llama MHA) — 11 wins JANG_3M (3.4 bits) vs 3-bit (3.5 bits): - "How many legs does a spider have?" → JANG: "8" | Standard: "2 1/2 1/2 1/2 1/2..." - FEWER bits, correct answer vs number spam. ### TinyLlama-1.1B (Llama GQA 8:1) — 11 wins JANG_4S (4.1 bits) vs 4-bit (4.5 bits): - "Chemical formula for water?" → JANG: stays on topic (H...) | Standard: derails to "hydrogen peroxide?" - 9% smaller, stays on topic vs derailing to wrong question. ### Phi-2 2.7B (Phi MHA) — 9 wins JANG_2S (2.5 bits) vs 2-bit (2.5 bits): - "What is photosynthesis?" → JANG: correct scientific answer | Standard: (empty output) - SAME bits, correct answer vs completely empty output. ### Qwen2.5-3B (Qwen GQA 8:1) — 6 wins JANG_4S (4.1 bits) vs 4-bit (4.5 bits): - "Translate 'thank you' to Spanish" → JANG: "'gracias'" | Standard: echoes prompt back - 9% smaller, correct translation vs echoing the prompt. ## Mathematical Proof (Logit MSE) Qwen2.5-3B, "What is 2+2?", MSE vs bf16 reference: - MLP=4, attn=8 (4.49 bits): MSE 7.13 - MLP=4, attn=6 (4.24 bits): MSE 8.70 - 4-bit (4.00 bits): MSE 11.31 - MLP=3, attn=6 (3.37 bits): MSE 11.10 ← JANG at 16% fewer bits beats 4-bit ## Profile System JANG_{bits}{S/M/L} where bits = MLP bit width, S=Small, M=Medium, L=Large (attention precision). | Profile | MLP | Attention | Embed | lm_head | Avg Bits | |----------|-----|-----------|-------|---------|----------| | JANG_1L | 2 | 8 | 8 | 8 | ~2.2 | | JANG_2S | 2 | 6 | 4 | 6 | ~2.5 | | JANG_2M | 2 | 8 | 4 | 8 | ~2.7 | | JANG_2L | 2 | 8 | 6 | 8 | ~2.9 | | JANG_3S | 3 | 4 | 4 | 6 | ~3.1 | | JANG_3M | 3 | 6 | 4 | 6 | ~3.4 | | JANG_3L | 3 | 8 | 4 | 8 | ~3.6 | | JANG_4S | 4 | 5 | 4 | 6 | ~4.1 | | JANG_4M | 4 | 6 | 4 | 6 | ~4.2 | | JANG_4L | 4 | 8 | 4 | 8 | ~4.5 | | JANG_6M | 6 | 8 | 6 | 8 | ~6.2 | ## Technical Details Format: safetensors with 5 companion tensors per weight (qweight, scales, zeros, bit_map, block_offsets). Block size 64. Asymmetric quantization. Dequant formula: dequantized = (raw_int - zero) * scale. Runtime: Swift 6.0, macOS 15+, Apple Silicon. 14 Metal kernels: standalone dequant, fused dequant+GEMV, fused dequant+GEMM, GQA attention decode, causal attention prefill, quantized embedding, RMSNorm, RoPE (trad + non-trad), softmax, SiLU, SwiGLU, element-wise add, standard embedding. Quantization methods: RTN (round-to-nearest), MSE-optimal (grid search), GPTQ (Hessian-guided). Architectures supported: Llama, Qwen, Qwen 3.5 (hybrid), Gemma, Phi, Mistral, Mamba/SSM, MoE, GatedDeltaNet, MLA (DeepSeek), sliding window attention. Speed (Qwen2.5-3B, M4 Max): Load 0.39s (4-bit, 1.8 GB), Prefill 27.6 tok/s, Decode 15.4 tok/s. Memory: 7B at JANG_4S = ~4.1 GB (vs 4.5 GB 4-bit, 9% savings). At JANG_3M = 25% savings vs 4-bit. ## Links - Website: https://jangq.ai - GitHub: https://github.com/jjang-ai/jangq - PyPI: https://pypi.org/project/jang/ - Author: https://x.com/jangqai - Models: https://huggingface.co/JANGQ-AI - Related: https://vmlx.net | https://mlx.studio