The Key Comparison · 200q MMLU Data

JANG vs MLX Uniform

Same framework, same hardware, same average bits. The only difference is how those bits are distributed across the model. Variable wins.

Key Finding

JANG is an MLX extension, not a competitor. It uses the same framework, Metal backend, and safetensors format. The only change is per-tensor variable bit allocation instead of uniform bits. At 4-bit: 86% vs 85% MMLU (budget-neutral, +1 point). At 2-bit: 79% vs 56.5% MMLU (+22.5 points, 6 GB smaller). The lower the bit count, the more variable allocation matters.

Headline Numbers (Qwen3.5-122B, 200q MMLU)

86%
JANG_4K at 3.99 avg bits (69 GB)
+1 point vs MLX 4-bit (85%)
79%
JANG_2S at 2.11 avg bits (38 GB)
+22.5 points vs MLX mixed_2_6 (56.5%)
86.5%
FP16 baseline (244 GB)
JANG_4K retains 99.4% of FP16 quality
56.5%
MLX mixed_2_6 at ~2.5 avg bits (44 GB)
6 GB larger than JANG_2S, 22.5 points worse

Visual: MMLU Accuracy by Configuration

FP16
86.5%
JANG_4K
86.0%
MLX 4-bit
85.0%
JANG_2S
79.0%
JANG_3M
77.5%
MLX 3-bit
75.5%
MLX 2-bit
65.5%
MLX mixed_2_6
56.5%
MLX mixed_1_5
~40%

200-question MMLU subset, Qwen3.5-122B. Bar width proportional to accuracy. JANG configurations shown in purple; MLX uniform/mixed in gray.

Full Benchmark Table

Configuration Avg Bits Size MMLU (200q) vs FP16
FP16 (baseline) 16.0 ~244 GB 86.5%
JANG_4K 3.99 69 GB 86.0% -0.5
MLX 4-bit uniform 4.0 64 GB 85.0% -1.5
JANG_2S 2.11 38 GB 79.0% -7.5
JANG_3M 3.11 ~50 GB 77.5% -9.0
MLX 3-bit uniform 3.0 ~47 GB 75.5% -11.0
MLX 2-bit uniform 2.0 ~34 GB 65.5% -21.0
MLX mixed_2_6 ~2.5 44 GB 56.5% -30.0
MLX mixed_1_5 ~1.5 ~28 GB ~40% -46.5

Feature Comparison

Feature JANG MLX Uniform
Bit Allocation Variable per tensor (2-8 bit) Same bits for all tensors
Layer Awareness Attention vs MLP differentiation No layer differentiation
4-bit MMLU (200q) 86% 85%
2-bit MMLU (200q) 79% (JANG_2S) 56.5% (mixed_2_6) / 65.5% (uniform)
Budget Neutrality Yes — same avg bits, better distribution N/A (fixed)
Predefined Profiles 11 (JANG_1L to JANG_6M) Per-bit (2, 3, 4, 8)
Metal GPU Kernels 14 custom (mixed-precision aware) Standard MLX kernels
File Format .jang (safetensors-based) .safetensors
Framework MLX (extension) MLX (built-in)
Calibration Required No No

How Variable Bits Work

JANG — Variable Allocation
  • Attention q/k/v/o projections: 5-8 bits
  • MLP gate/up/down projections: 2-4 bits
  • Embedding and lm_head: higher bits
  • Average matches target (budget-neutral)
  • Protects coherence-critical layers
  • 11 profiles tune the mix
MLX Uniform — Fixed Allocation
  • All tensors get the same bit width
  • 4-bit: every tensor at 4 bits
  • 2-bit: every tensor at 2 bits
  • mixed_2_6: alternating 2 and 6 bit layers
  • No layer-type awareness
  • Simple but suboptimal at low bits

Why Variable Bits Matter More at Low Bits

At 4 bits, the advantage is modest (+1 point) because 4-bit precision is sufficient for most tensors, including attention. The attention mechanism can still route information effectively at 4 bits.

At 2 bits, the advantage becomes dramatic (+22.5 points) because 2-bit precision destroys the attention mechanism. Attention layers use softmax to create probability distributions over the input sequence — at 2 bits, the quantization noise overwhelms these fine-grained probability weights, causing the model to "look at" the wrong tokens. The output becomes incoherent.

JANG_2S solves this by keeping attention at 5-6 bits (sufficient for accurate softmax routing) while compressing MLP to 2 bits. MLP layers act as learned lookup tables and tolerate aggressive quantization much better. The result: 79% MMLU at 2.11 average bits vs 56.5% for mixed_2_6 — and JANG_2S is 6 GB smaller (38 GB vs 44 GB) because it allocates bits more efficiently.

Why MLX mixed_2_6 Scores Lower Than Uniform 2-bit

A surprising finding: MLX mixed_2_6 (56.5%) scores lower than uniform 2-bit (65.5%), despite using more bits on average (~2.5 vs 2.0). This happens because mixed_2_6 alternates 2-bit and 6-bit layers without regard to layer type. Some attention layers get 2 bits while some MLP layers get 6 bits — exactly backwards from what the model needs.

Uniform 2-bit is at least consistent: every layer gets the same (bad) precision, and the model can partially compensate. Mixed_2_6's random allocation creates unpredictable quality drops in critical attention layers, leading to worse overall performance despite a higher bit budget.

This is the strongest evidence for JANG's approach: which layers get which bits matters more than the average number of bits. JANG_2S uses fewer total bits than mixed_2_6 but scores 22.5 points higher because it puts those bits where they matter.

Frequently Asked Questions

How does JANG compare to MLX uniform quantization?
JANG extends MLX quantization with per-tensor variable bit widths. At 4-bit average, JANG_4K scores 86% MMLU vs 85% for uniform (budget-neutral, +1 point). At 2-bit average, JANG_2S scores 79% vs 56.5% for mixed_2_6 (+22.5 points, 6 GB smaller). The key: JANG protects attention layers and compresses MLP, using the same total bits more intelligently.
What is budget-neutral quantization?
Budget-neutral means JANG_4K uses 3.99 average bits per weight, nearly identical to MLX 4-bit uniform at 4.0 bits. The total number of bits (and approximate file size) is the same. The difference is distribution: attention layers get 5-6 bits, MLP layers get 3-4 bits. Same budget, smarter allocation, better quality.
Is JANG a replacement for MLX?
No, JANG is an extension of MLX. It uses the same MLX framework, Metal GPU backend, and safetensors format. The addition is per-tensor variable bit widths with 14 custom Metal kernels that handle mixed-precision tensors. Think of it as MLX quantization with smarter bit allocation.
Why does mixed_2_6 score lower than uniform 2-bit?
MLX mixed_2_6 alternates 2-bit and 6-bit layers without considering layer type. Some attention layers get 2 bits (destroying their function) while some MLP layers get 6 bits (wasting precision). This random allocation performs worse than uniform 2-bit, which at least gives every layer consistent (if low) precision. JANG_2S solves this by ensuring all attention layers get high precision.

Upgrade from Uniform to Variable

Same MLX framework. Same Apple Silicon. Better bit allocation. Pre-quantized models on HuggingFace.

Browse JANG Models on HuggingFace

Free · Apache 2.0 · Apple Silicon (M1 or later) · Drop-in MLX extension