What is budget-neutral quantization in JANG?

Budget-neutral means JANG_4K uses approximately the same average bits per weight as MLX 4-bit uniform (3.99 vs 4.0), resulting in a similar file size. The difference is distribution: instead of giving every tensor 4 bits, JANG gives attention layers 5-6 bits and MLP layers 3-4 bits. The total bit budget is the same, but the allocation is smarter — protecting the layers that matter most for output quality.

Why does JANG score 79% at 2 bits while MLX gets 56.5%?

At 2 bits, quantization noise destroys the attention mechanism's ability to route information correctly. MLX mixed_2_6 applies a fixed mix of 2 and 6-bit blocks uniformly across all layers. JANG_2S instead protects all attention layers at 5-6 bits while compressing MLP to 2 bits. Since attention controls what the model 'looks at' for each token, preserving its precision maintains output coherence even at very low overall bit counts.

Is JANG a replacement for MLX or an extension?

JANG is an extension of MLX, not a replacement. It uses MLX's framework, Metal GPU backend, and safetensors format. The key addition is per-tensor variable bit widths with custom Metal kernels that handle mixed-precision tensors. JANG models run on the same MLX infrastructure — the difference is in how bits are allocated across the model's layers.

The Key Comparison · 200q MMLU Data

JANG vs MLX Uniform

Q: How does JANG compare to MLX uniform quantization?

JANG is an extension of MLX quantization that uses variable bit widths per tensor instead of uniform bits. On Qwen3.5-122B (200q MMLU), JANG_4K scores 86% at 3.99 average bits vs MLX 4-bit uniform at 85% — a +1 point gain with budget-neutral allocation. At 2 bits, the gap is dramatic: JANG_2S scores 79% at 2.11 average bits vs MLX mixed_2_6 at 56.5% — a 22.5-point advantage while being 6 GB smaller (38 GB vs 44 GB).

Same framework, same hardware, same average bits. The only difference is how those bits are distributed across the model. Variable wins.

Key Finding

JANG is an MLX extension, not a competitor. It uses the same framework, Metal backend, and safetensors format. The only change is per-tensor variable bit allocation instead of uniform bits. At 4-bit: 86% vs 85% MMLU (budget-neutral, +1 point). At 2-bit: 79% vs 56.5% MMLU (+22.5 points, 6 GB smaller). The lower the bit count, the more variable allocation matters.

Headline Numbers (Qwen3.5-122B, 200q MMLU)

86%
JANG_4K at 3.99 avg bits (69 GB)
+1 point vs MLX 4-bit (85%)
79%
JANG_2S at 2.11 avg bits (38 GB)
+22.5 points vs MLX mixed_2_6 (56.5%)
86.5%
FP16 baseline (244 GB)
JANG_4K retains 99.4% of FP16 quality
56.5%
MLX mixed_2_6 at ~2.5 avg bits (44 GB)
6 GB larger than JANG_2S, 22.5 points worse

Visual: MMLU Accuracy by Configuration

FP16

86.5%

JANG_4K

86.0%

MLX 4-bit

85.0%

JANG_2S

79.0%

JANG_3M

77.5%

MLX 3-bit

75.5%

MLX 2-bit

65.5%

MLX mixed_2_6

56.5%

MLX mixed_1_5

~40%

200-question MMLU subset, Qwen3.5-122B. Bar width proportional to accuracy. JANG configurations shown in purple; MLX uniform/mixed in gray.

Full Benchmark Table

Configuration	Avg Bits	Size	MMLU (200q)	vs FP16
FP16 (baseline)	16.0	~244 GB	86.5%	—
JANG_4K	3.99	69 GB	86.0%	-0.5
MLX 4-bit uniform	4.0	64 GB	85.0%	-1.5
JANG_2S	2.11	38 GB	79.0%	-7.5
JANG_3M	3.11	~50 GB	77.5%	-9.0
MLX 3-bit uniform	3.0	~47 GB	75.5%	-11.0
MLX 2-bit uniform	2.0	~34 GB	65.5%	-21.0
MLX mixed_2_6	~2.5	44 GB	56.5%	-30.0
MLX mixed_1_5	~1.5	~28 GB	~40%	-46.5

Feature Comparison

Feature	JANG	MLX Uniform
Bit Allocation	Variable per tensor (2-8 bit)	Same bits for all tensors
Layer Awareness	Attention vs MLP differentiation	No layer differentiation
4-bit MMLU (200q)	86%	85%
2-bit MMLU (200q)	79% (JANG_2S)	56.5% (mixed_2_6) / 65.5% (uniform)
Budget Neutrality	Yes — same avg bits, better distribution	N/A (fixed)
Predefined Profiles	11 (JANG_1L to JANG_6M)	Per-bit (2, 3, 4, 8)
Metal GPU Kernels	14 custom (mixed-precision aware)	Standard MLX kernels
File Format	.jang (safetensors-based)	.safetensors
Framework	MLX (extension)	MLX (built-in)
Calibration Required	No	No

How Variable Bits Work

JANG — Variable Allocation

Attention q/k/v/o projections: 5-8 bits
MLP gate/up/down projections: 2-4 bits
Embedding and lm_head: higher bits
Average matches target (budget-neutral)
Protects coherence-critical layers
11 profiles tune the mix

MLX Uniform — Fixed Allocation

All tensors get the same bit width
4-bit: every tensor at 4 bits
2-bit: every tensor at 2 bits
mixed_2_6: alternating 2 and 6 bit layers
No layer-type awareness
Simple but suboptimal at low bits

Why Variable Bits Matter More at Low Bits

At 4 bits, the advantage is modest (+1 point) because 4-bit precision is sufficient for most tensors, including attention. The attention mechanism can still route information effectively at 4 bits.

At 2 bits, the advantage becomes dramatic (+22.5 points) because 2-bit precision destroys the attention mechanism. Attention layers use softmax to create probability distributions over the input sequence — at 2 bits, the quantization noise overwhelms these fine-grained probability weights, causing the model to "look at" the wrong tokens. The output becomes incoherent.

JANG_2S solves this by keeping attention at 5-6 bits (sufficient for accurate softmax routing) while compressing MLP to 2 bits. MLP layers act as learned lookup tables and tolerate aggressive quantization much better. The result: 79% MMLU at 2.11 average bits vs 56.5% for mixed_2_6 — and JANG_2S is 6 GB smaller (38 GB vs 44 GB) because it allocates bits more efficiently.

Why MLX mixed_2_6 Scores Lower Than Uniform 2-bit

A surprising finding: MLX mixed_2_6 (56.5%) scores lower than uniform 2-bit (65.5%), despite using more bits on average (~2.5 vs 2.0). This happens because mixed_2_6 alternates 2-bit and 6-bit layers without regard to layer type. Some attention layers get 2 bits while some MLP layers get 6 bits — exactly backwards from what the model needs.

Uniform 2-bit is at least consistent: every layer gets the same (bad) precision, and the model can partially compensate. Mixed_2_6's random allocation creates unpredictable quality drops in critical attention layers, leading to worse overall performance despite a higher bit budget.

This is the strongest evidence for JANG's approach: which layers get which bits matters more than the average number of bits. JANG_2S uses fewer total bits than mixed_2_6 but scores 22.5 points higher because it puts those bits where they matter.

Frequently Asked Questions

How does JANG compare to MLX uniform quantization?

JANG extends MLX quantization with per-tensor variable bit widths. At 4-bit average, JANG_4K scores 86% MMLU vs 85% for uniform (budget-neutral, +1 point). At 2-bit average, JANG_2S scores 79% vs 56.5% for mixed_2_6 (+22.5 points, 6 GB smaller). The key: JANG protects attention layers and compresses MLP, using the same total bits more intelligently.

What is budget-neutral quantization?

Budget-neutral means JANG_4K uses 3.99 average bits per weight, nearly identical to MLX 4-bit uniform at 4.0 bits. The total number of bits (and approximate file size) is the same. The difference is distribution: attention layers get 5-6 bits, MLP layers get 3-4 bits. Same budget, smarter allocation, better quality.

Is JANG a replacement for MLX?

No, JANG is an extension of MLX. It uses the same MLX framework, Metal GPU backend, and safetensors format. The addition is per-tensor variable bit widths with 14 custom Metal kernels that handle mixed-precision tensors. Think of it as MLX quantization with smarter bit allocation.

Why does mixed_2_6 score lower than uniform 2-bit?

MLX mixed_2_6 alternates 2-bit and 6-bit layers without considering layer type. Some attention layers get 2 bits (destroying their function) while some MLP layers get 6 bits (wasting precision). This random allocation performs worse than uniform 2-bit, which at least gives every layer consistent (if low) precision. JANG_2S solves this by ensuring all attention layers get high precision.

Upgrade from Uniform to Variable

Same MLX framework. Same Apple Silicon. Better bit allocation. Pre-quantized models on HuggingFace.

Browse JANG Models on HuggingFace

Free · Apache 2.0 · Apple Silicon (M1 or later) · Drop-in MLX extension