Same framework, same hardware, same average bits. The only difference is how those bits are distributed across the model. Variable wins.
JANG is an MLX extension, not a competitor. It uses the same framework, Metal backend, and safetensors format. The only change is per-tensor variable bit allocation instead of uniform bits. At 4-bit: 86% vs 85% MMLU (budget-neutral, +1 point). At 2-bit: 79% vs 56.5% MMLU (+22.5 points, 6 GB smaller). The lower the bit count, the more variable allocation matters.
200-question MMLU subset, Qwen3.5-122B. Bar width proportional to accuracy. JANG configurations shown in purple; MLX uniform/mixed in gray.
| Configuration | Avg Bits | Size | MMLU (200q) | vs FP16 |
|---|---|---|---|---|
| FP16 (baseline) | 16.0 | ~244 GB | 86.5% | — |
| JANG_4K | 3.99 | 69 GB | 86.0% | -0.5 |
| MLX 4-bit uniform | 4.0 | 64 GB | 85.0% | -1.5 |
| JANG_2S | 2.11 | 38 GB | 79.0% | -7.5 |
| JANG_3M | 3.11 | ~50 GB | 77.5% | -9.0 |
| MLX 3-bit uniform | 3.0 | ~47 GB | 75.5% | -11.0 |
| MLX 2-bit uniform | 2.0 | ~34 GB | 65.5% | -21.0 |
| MLX mixed_2_6 | ~2.5 | 44 GB | 56.5% | -30.0 |
| MLX mixed_1_5 | ~1.5 | ~28 GB | ~40% | -46.5 |
| Feature | JANG | MLX Uniform |
|---|---|---|
| Bit Allocation | Variable per tensor (2-8 bit) | Same bits for all tensors |
| Layer Awareness | Attention vs MLP differentiation | No layer differentiation |
| 4-bit MMLU (200q) | 86% | 85% |
| 2-bit MMLU (200q) | 79% (JANG_2S) | 56.5% (mixed_2_6) / 65.5% (uniform) |
| Budget Neutrality | Yes — same avg bits, better distribution | N/A (fixed) |
| Predefined Profiles | 11 (JANG_1L to JANG_6M) | Per-bit (2, 3, 4, 8) |
| Metal GPU Kernels | 14 custom (mixed-precision aware) | Standard MLX kernels |
| File Format | .jang (safetensors-based) | .safetensors |
| Framework | MLX (extension) | MLX (built-in) |
| Calibration Required | No | No |
At 4 bits, the advantage is modest (+1 point) because 4-bit precision is sufficient for most tensors, including attention. The attention mechanism can still route information effectively at 4 bits.
At 2 bits, the advantage becomes dramatic (+22.5 points) because 2-bit precision destroys the attention mechanism. Attention layers use softmax to create probability distributions over the input sequence — at 2 bits, the quantization noise overwhelms these fine-grained probability weights, causing the model to "look at" the wrong tokens. The output becomes incoherent.
JANG_2S solves this by keeping attention at 5-6 bits (sufficient for accurate softmax routing) while compressing MLP to 2 bits. MLP layers act as learned lookup tables and tolerate aggressive quantization much better. The result: 79% MMLU at 2.11 average bits vs 56.5% for mixed_2_6 — and JANG_2S is 6 GB smaller (38 GB vs 44 GB) because it allocates bits more efficiently.
A surprising finding: MLX mixed_2_6 (56.5%) scores lower than uniform 2-bit (65.5%),
despite using more bits on average (~2.5 vs 2.0). This happens because mixed_2_6 alternates 2-bit
and 6-bit layers without regard to layer type. Some attention layers get 2 bits while some MLP layers
get 6 bits — exactly backwards from what the model needs.
Uniform 2-bit is at least consistent: every layer gets the same (bad) precision, and the model can partially compensate. Mixed_2_6's random allocation creates unpredictable quality drops in critical attention layers, leading to worse overall performance despite a higher bit budget.
This is the strongest evidence for JANG's approach: which layers get which bits matters more than the average number of bits. JANG_2S uses fewer total bits than mixed_2_6 but scores 22.5 points higher because it puts those bits where they matter.
Same MLX framework. Same Apple Silicon. Better bit allocation. Pre-quantized models on HuggingFace.
Browse JANG Models on HuggingFaceFree · Apache 2.0 · Apple Silicon (M1 or later) · Drop-in MLX extension