Mixed-precision quantization for MLX (Apple Silicon) vs llama.cpp (cross-platform). Same goal, different ecosystems.
JANG is the GGUF equivalent for MLX — both use mixed-precision quantization to preserve quality at low bit widths. JANG uses adaptive per-tensor bit allocation with layer sensitivity tiers, achieving 86% MMLU at 4 bits and 79% at 2 bits (200q, Qwen3.5-122B). GGUF K-quants use block-level mixed precision for llama.cpp. GGUF has the larger ecosystem and cross-platform support. JANG has native MLX Metal kernels and zero-copy loading optimized for Apple Silicon.
| Feature | JANG | GGUF |
|---|---|---|
| Target Framework | MLX (Apple Silicon) | llama.cpp (cross-platform) |
| Mixed Precision | Per-tensor variable bits2, 3, 4, 5, 6, 8 bit | Block-level K-quantsQ2_K through Q6_K |
| Bit Allocation Strategy | Layer sensitivity tiersAttention high, MLP low | Block importance within layers |
| Calibration Required | NoArchitecture-aware tiers | Optional (importance matrix) |
| File Format | Safetensors-based (.jang) | Custom binary (.gguf) |
| GPU Kernels | 14 custom Metal kernelsFused dequant+GEMV/GEMM | Metal + CUDA + Vulkan + OpenCL |
| Model Loading | Zero-copy mmap0.3-0.9s for 3-7B | mmap supported |
| Platform Support | macOS (Apple Silicon) | macOS, Windows, Linux, Android |
| Supported Bit Widths | 2, 3, 4, 5, 6, 8 per tensor | 2, 3, 4, 5, 6, 8 per block type |
| Architecture Support | Llama, Qwen, Gemma, Phi, MoE, Mamba | Llama, Qwen, Gemma, Phi, MoE, many more |
| Predefined Profiles | 11 profilesJANG_1L to JANG_6M | ~10 quant typesQ2_K to Q8_0 |
| Ecosystem Size | New (2026) | Very large, mature |
| License | Apache 2.0 | MIT |
Both JANG and GGUF solve the same problem — reducing model size while preserving quality — but they take fundamentally different approaches to mixed-precision bit allocation.
200-question MMLU subset across 8 subjects. All results measured on the same model (Qwen3.5-122B) with identical evaluation conditions. Higher is better.
| Configuration | Avg Bits | Size | MMLU (200q) |
|---|---|---|---|
| FP16 (baseline) | 16.0 | ~244 GB | 86.5% |
| JANG_4K | 3.99 | 69 GB | 86.0% |
| MLX 4-bit uniform | 4.0 | 64 GB | 85.0% |
| JANG_3M | 3.11 | ~50 GB | 77.5% |
| MLX 3-bit uniform | 3.0 | ~47 GB | 75.5% |
| JANG_2S | 2.11 | 38 GB | 79.0% |
| MLX mixed_2_6 | ~2.5 | 44 GB | 56.5% |
| MLX 2-bit uniform | 2.0 | ~34 GB | 65.5% |
| MLX mixed_1_5 | ~1.5 | ~28 GB | ~40% |
200q MMLU on Qwen3.5-122B. JANG_4K uses budget-neutral allocation: same average bits as MLX 4-bit, but distributes them based on layer sensitivity. JANG_2S achieves 79% at 2.11 bits — 22.5 points above MLX mixed_2_6 (56.5%) while being 6 GB smaller.
The fundamental insight behind JANG is that not all layers contribute equally to output quality. Attention layers — which control what the model "looks at" when generating each token — are far more sensitive to quantization than MLP/FFN layers, which act as learned lookup tables.
At 2 bits, this difference becomes dramatic. MLX uniform quantization applies the same 2-bit precision to every tensor, destroying the attention mechanism's ability to route information correctly. JANG_2S protects attention at 5-6 bits while compressing MLP to 2 bits, preserving 79% MMLU compared to uniform's 65.5% (and mixed_2_6's 56.5%).
At 4 bits, the gap is smaller because even uniform 4-bit retains enough precision for attention. JANG_4K still gains +1 point (86% vs 85%) by giving attention layers extra headroom at 5-6 bits and compressing MLP slightly below 4 bits to stay budget-neutral.
GGUF is a mature, battle-tested format with significant advantages in certain scenarios.
Pre-quantized models available on HuggingFace. Apache 2.0 open source.
Browse JANG Models on HuggingFaceFree · Apache 2.0 · Apple Silicon (M1 or later) · MLX native