JANG
The GGUF for MLX
397 billion parameters on a 128 GB Mac. 92% MMLU. MLX can't even load it.
JANG_1L fits a 397B model in 112 GB — a 128 GB MacBook Pro can run it with reasoning at 86.5% MMLU. MLX at 2 or 3 bits produces NaN (not a number). MiniMax 230B? MLX scores 26.5% at every bit level. Nemotron-H 120B? MLX 3-bit is broken. JANG is the only way to run these models quantized on Apple Silicon.
JANG assigns more bits to attention and fewer to MLP, so models stay coherent where standard quantization produces garbage or NaN. Same speed, same Metal kernels — just better output. Open source under Apache 2.0.
Variable bit widths based on layer sensitivity
Standard quantization applies the same bit width to every tensor. Attention layers (~12% of parameters) are more sensitive to precision loss than MLP layers — when quantized too aggressively, attention scores flatten, positional encoding degrades, and output degenerates.
JANG classifies tensors into sensitivity tiers and assigns bit widths accordingly. Attention layers get 5–8 bits while MLP compresses to 2–4 bits. The overhead is ~0.3 extra bits on average.
JANG vs MLX — side by side
Each JANG model compared against the closest MLX method by size. 200-question MMLU (20 per subject × 10 subjects), thinking/reasoning enabled where noted, temp 0.0. Apple M4 Max 128 GB / M4 Ultra 256 GB.
Qwen3.5-397B-A17B — 397 billion parameters — JANG vs MLX
397B on a 128 GB Mac — first ever. JANG_1L at 112 GB disk (120 GB GPU peak) fits on a 128 GB MacBook Pro and scores 86.5% MMLU with reasoning. MLX at 2-bit and 3-bit produces NaN — the model is too complex for standard quantization at low bit widths. MLX 4-bit runs at 94% but needs ~280 GB, far beyond any laptop. JANG_2L at 187 GB hits 92% on an M4 Ultra 256 GB.
Nemotron-3-Super-120B-A12B — NVIDIA Hybrid Mamba-2 SSM + Latent MoE + Attention
First working Nemotron-H quantization for Apple Silicon. NVIDIA’s hybrid architecture combines Mamba-2 SSM, Latent MoE, and standard attention — MLX 3-bit is broken on it. JANG_4M at 63 GB scores 93% MMLU with reasoning at 55 tok/s. JANG_2L fits on a 64 GB Mac at 43 GB with 86% MMLU.
MiniMax-M2.5 (230B) — JANG vs MLX
MLX is completely broken on MiniMax at every bit level — 4-bit (26.5%), 3-bit (24.5%), and 2-bit (25%) all score near random. JANG_2L at just 2.10 bits is the only way to run MiniMax quantized on Apple Silicon.
Qwen3.5-122B-A10B — ~4 bits
Qwen3.5-122B-A10B — ~2 bits
Qwen3.5-35B-A3B — ~4 bits
Qwen3.5-35B-A3B — ~2 bits
Download: All models on HuggingFace — 397B, Nemotron-H 120B, 122B, 35B, MiniMax 230B, and more
Three-way comparison on basic prompts
Side-by-side on 6 factual prompts. All methods use MLX’s native Metal kernels. Temperature 0.0, max 80 tokens. M4 Max 128 GB.
MLX’s mixed_2_6 mode protects select v_proj and down_proj layers at 6-bit, but does not account for GatedDeltaNet linear attention layers, MoE expert routing tensors, or hybrid architecture components. JANG’s tier system classifies these architecture-specific tensors explicitly.
JANG_2L: 74% MMLU (200q) at 82.5 GB RAM — 3x higher than MLX 4-bit at 120 GB
On this hybrid MoE model, MLX mixed_2_6 does not improve over 2-bit. The mixed_2_6 heuristic targets v_proj and down_proj in standard transformer layers but misses GatedDeltaNet attention and MoE routing tensors that are critical for this architecture.
<think> reasoning preserved at 2.19 bits
Size, speed, and scores — JANG vs MLX
| Model | Method | Bits | Size | MMLU |
|---|---|---|---|---|
| Qwen3.5-397B-A17B | JANG_2L | ~2.x | 187 GB | 92% |
| JANG_1L | ~2.2 | 112 GB | 86.5% | |
| MLX 4-bit | 4.0 | ~280 GB | 94% | |
| MLX 2-bit / 3-bit | 2-3 | — | NaN | |
| Nemotron-3-Super-120B | JANG_4M | ~4.2 | 63 GB | 93% |
| JANG_2L | ~2.x | 43 GB | 86% | |
| MLX 3-bit | 3.0 | — | Broken | |
| Qwen3.5-122B-A10B | JANG_2M | 2.14 | 44.7 GB | 79% |
| JANG_1L | 2.24 | 46.0 GB | 73% | |
| JANG_2L | 2.19 | 45.3 GB | — | |
| MLX mixed_2_6 | ~2.5 | 45 GB | 46% | |
| 2-bit | 2.0 | 36 GB | 56.5% | |
| Qwen3.5-35B-A3B | JANG_4K | 3.99 | 20.1 GB | 77.5% |
| MLX 4-bit | 4.0 | 18.2 GB | 75.5% | |
| JANG_4S | 4.04 | 20.4 GB | 82% | |
| JANG_2S | 2.17 | 12.8 GB | 65.5% | |
| JANG_2L v2 | 2.28 | 13.3 GB | 56% | |
| MLX mixed_2_6 | ~2.5 | 12.8 GB | ~40% | |
| MiniMax-M2.5 (230B) | JANG_2S | 2.06 | 81.6 GB | — |
| JANG_2L | 2.10 | 82.5 GB | 74% | |
| MLX 4-bit | 4.0 | 119.8 GB | 26.5% | |
| MLX 2-bit | 2.0 | 66.6 GB | 25.0% | |
Apple M4 Max 128 GB / M4 Ultra 256 GB · MMLU: 200-question (10 subjects × 20), reasoning enabled for 397B and Nemotron, thinking disabled for others · 2026-03
Qwen3.5-397B: JANG_1L at 112 GB (120 GB GPU peak) fits on 128 GB Macs — 86.5% MMLU with reasoning, 36 tok/s. JANG_2L at 187 GB hits 92% on M4 Ultra 256 GB. MLX 2/3-bit: NaN. MLX 4-bit: 94% but ~280 GB.
Nemotron-3-Super-120B: JANG_4M at 63 GB scores 93% MMLU, 55 tok/s. JANG_2L at 43 GB scores 86%, fits 64 GB Macs. MLX 3-bit: broken. First working Nemotron-H quantization for Apple Silicon.
MiniMax-M2.5 (230B): JANG_2L scores 74% MMLU at 82.5 GB vs MLX 4-bit at 26.5% (119.8 GB). MLX broken at ALL bit levels (26.5%, 24.5%, 25%). JANG is the only way to run MiniMax quantized.
Pipeline verification: JANG_4S matches MLX 4-bit exactly on 35B MMLU (82% = 82%), confirming the quantization pipeline is lossless at matched bit widths.
Dense model comparisons (1B–7B)
Comparisons at the degradation boundary — the bit width where standard quantization starts producing degenerate output. Same prompts, same temperature, same model. All on M4 Max.
At 2.5 effective bits, JANG_2S gets 6/6 correct while 2-bit gets 0/6. JANG protects the 8 critical full-attention layers at 6-bit while compressing the 24 linear-attention layers and all MLP at 2-bit.
Highlights — 7B models
JANG_3M (3.4 bits)
3-bit (3.5 bits)
JANG_3L (3.6 bits)
3-bit (3.5 bits)
JANG_4S (4.1 bits)
4-bit (4.5 bits)
JANG_2S (2.5 bits)
2-bit (2.5 bits)
More 7B results
JANG_3L (3.6 bits)
3-bit
JANG_3M (3.4 bits)
3-bit
JANG_3L (3.6 bits)
3-bit
JANG_2M (2.7 bits)
2-bit
JANG_4L (4.5 bits)
4-bit
JANG_2S (2.5 bits)
2-bit
Smaller models (1B–3B)
JANG_3M (3.4 bits)
3-bit
JANG_2S (2.5 bits)
2-bit
JANG_4S (4.1 bits)
4-bit
JANG_4L (4.5 bits)
4-bit
JANG (4.12 bits)
4-bit
JANG_4S (4.1 bits)
4-bit
JANG at 3.37 bits beats 4-bit
Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better
Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64
JANG at 3.37 bits (MSE 11.10) beats 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.
All models tested
| Model | Params | Architecture | Tests | Failure mode |
|---|---|---|---|---|
| Qwen3.5-397B-A17B | 397B | MoE, Hybrid | MMLU | MLX 2/3-bit → NaN |
| Nemotron-3-Super-120B | 120B | Hybrid Mamba-2 SSM + Latent MoE + Attn | MMLU | MLX 3-bit → broken |
| MiniMax-M2.5 | 230B | MoE 256 experts, top-8 | MMLU | MLX all bits → random (25%) |
| Qwen3.5-122B-A10B | 122B | MoE 256 experts, Hybrid | MMLU | 2-bit → 56.5%, mixed_2_6 → 46% |
| Qwen3.5-35B-A3B | 35B | MoE 256 experts, Hybrid GDN+FA | MMLU+QA | 2-bit → degenerate, mixed_2_6 → broken |
| Qwen3.5-4B | 4B | Hybrid: 24 linear + 8 full attn | 6 | 2-bit → 0/6 correct |
| Mistral-7B | 7B | Mistral GQA 4:1, sliding window | 13 | 3-bit → number sequences |
| Qwen2.5-7B | 7B | Qwen GQA 4:1 | 9 | 3-bit → repetition loop |
| Qwen2.5-3B | 3B | Qwen GQA 8:1 | 6 | 4-bit → echo/loop |
| SmolLM2-1.7B | 1.7B | Llama MHA | 11 | 3-bit → number sequences |
| TinyLlama-1.1B | 1.1B | Llama GQA 8:1 | 11 | 4-bit → topic derail |
| Phi-2 | 2.7B | Phi MHA, GELU MLP | 9 | 2-bit → empty output |
Apple M4 Max 128 GB / M4 Ultra 256 GB · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 12 models · 1B to 397B
JANG_{bits}{size}
11 predefined profiles from ultra-compressed to near-lossless. S = Small (most compression), M = Medium (balanced), L = Large (best quality).
| Profile | MLP | Attention | Embed | lm_head | Avg Bits |
|---|---|---|---|---|---|
| JANG_1L | 2-bit | 8-bit | 8-bit | 8-bit | ~2.2 |
| JANG_2S | 2-bit | 6-bit | 4-bit | 6-bit | ~2.5 |
| JANG_2M | 2-bit | 8-bit | 4-bit | 8-bit | ~2.7 |
| JANG_2L | 2-bit | 8-bit | 6-bit | 8-bit | ~2.9 |
| JANG_3S | 3-bit | 4-bit | 4-bit | 6-bit | ~3.1 |
| JANG_3M | 3-bit | 6-bit | 4-bit | 6-bit | ~3.4 |
| JANG_3L | 3-bit | 8-bit | 4-bit | 8-bit | ~3.6 |
| JANG_4S | 4-bit | 5-bit | 4-bit | 6-bit | ~4.1 |
| JANG_4M | 4-bit | 6-bit | 4-bit | 6-bit | ~4.2 |
| JANG_4L | 4-bit | 8-bit | 4-bit | 8-bit | ~4.5 |
| JANG_6M | 6-bit | 8-bit | 6-bit | 8-bit | ~6.2 |
Swift + Metal inference engine
14 custom Metal GPU kernels. Zero-copy mmap loading. Fused dequantization for decode and prefill.
Dequant + GEMV
Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.
Dequant + GEMM
Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.
GQA Attention
Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.
RMSNorm + RoPE
Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.
SwiGLU
Fused SiLU activation + element-wise multiply for gated feed-forward networks.
Quantized Embedding
Direct embedding lookup from quantized weights. No full-table dequantization needed.
Convert any model
Python tooling to convert HuggingFace models to .jang format. Pick a profile, choose your quantization method, and go. Supports RTN, MSE-optimal grid search, and GPTQ (Hessian-guided) quantization.
6+ architecture families: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, and hybrid models including Qwen 3.5.
Run bigger models on less RAM
JANG_3M saves 25% vs 4-bit with comparable quality on 7B+ models. Fit models in unified memory that wouldn't fit before.
Pre-quantized models on HuggingFace
Ready to download. Compatible with vMLX Engine / MLX Studio via the JANG loader.
Run JANG models in MLX Studio
MLX Studio has native JANG support with OpenAI-compatible API,
prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching,
and 20+ agentic coding tools. Load any .jang model and serve it locally —
works with Cursor, Continue, Aider, and any OpenAI API client.
Powered by vMLX Engine,
now open source — pip install vmlx.