2026 Comparison

JANG vs GGUF

Mixed-precision quantization for MLX (Apple Silicon) vs llama.cpp (cross-platform). Same goal, different ecosystems.

Summary

JANG is the GGUF equivalent for MLX — both use mixed-precision quantization to preserve quality at low bit widths. JANG uses adaptive per-tensor bit allocation with layer sensitivity tiers, achieving 86% MMLU at 4 bits and 79% at 2 bits (200q, Qwen3.5-122B). GGUF K-quants use block-level mixed precision for llama.cpp. GGUF has the larger ecosystem and cross-platform support. JANG has native MLX Metal kernels and zero-copy loading optimized for Apple Silicon.

Feature Comparison

Feature JANG GGUF
Target Framework MLX (Apple Silicon) llama.cpp (cross-platform)
Mixed Precision Per-tensor variable bits2, 3, 4, 5, 6, 8 bit Block-level K-quantsQ2_K through Q6_K
Bit Allocation Strategy Layer sensitivity tiersAttention high, MLP low Block importance within layers
Calibration Required NoArchitecture-aware tiers Optional (importance matrix)
File Format Safetensors-based (.jang) Custom binary (.gguf)
GPU Kernels 14 custom Metal kernelsFused dequant+GEMV/GEMM Metal + CUDA + Vulkan + OpenCL
Model Loading Zero-copy mmap0.3-0.9s for 3-7B mmap supported
Platform Support macOS (Apple Silicon) macOS, Windows, Linux, Android
Supported Bit Widths 2, 3, 4, 5, 6, 8 per tensor 2, 3, 4, 5, 6, 8 per block type
Architecture Support Llama, Qwen, Gemma, Phi, MoE, Mamba Llama, Qwen, Gemma, Phi, MoE, many more
Predefined Profiles 11 profilesJANG_1L to JANG_6M ~10 quant typesQ2_K to Q8_0
Ecosystem Size New (2026) Very large, mature
License Apache 2.0 MIT

How They Work

Both JANG and GGUF solve the same problem — reducing model size while preserving quality — but they take fundamentally different approaches to mixed-precision bit allocation.

JANG — Per-Tensor Sensitivity Tiers
  • Classifies each tensor by layer type and position
  • Attention layers (q_proj, k_proj, v_proj, o_proj) get 5-8 bits
  • MLP/FFN layers (gate, up, down) compress to 2-4 bits
  • Embedding and output head protected at higher bits
  • No calibration data required — architecture-aware rules
  • 11 predefined profiles from 2.2 to 6.2 average bits
GGUF K-Quants — Block-Level Mixed Precision
  • Divides each tensor into blocks (typically 256 elements)
  • K-quant types mix bit widths within the same tensor
  • Q4_K_M: 4-bit with some 6-bit blocks for important values
  • Optional importance matrix for smarter allocation
  • Block-level granularity vs JANG's tensor-level
  • Mature tooling with imatrix quantization support

Benchmark: 200-Question MMLU (Qwen3.5-122B)

200-question MMLU subset across 8 subjects. All results measured on the same model (Qwen3.5-122B) with identical evaluation conditions. Higher is better.

Configuration Avg Bits Size MMLU (200q)
FP16 (baseline) 16.0 ~244 GB 86.5%
JANG_4K 3.99 69 GB 86.0%
MLX 4-bit uniform 4.0 64 GB 85.0%
JANG_3M 3.11 ~50 GB 77.5%
MLX 3-bit uniform 3.0 ~47 GB 75.5%
JANG_2S 2.11 38 GB 79.0%
MLX mixed_2_6 ~2.5 44 GB 56.5%
MLX 2-bit uniform 2.0 ~34 GB 65.5%
MLX mixed_1_5 ~1.5 ~28 GB ~40%

200q MMLU on Qwen3.5-122B. JANG_4K uses budget-neutral allocation: same average bits as MLX 4-bit, but distributes them based on layer sensitivity. JANG_2S achieves 79% at 2.11 bits — 22.5 points above MLX mixed_2_6 (56.5%) while being 6 GB smaller.

Why Layer Sensitivity Matters

The fundamental insight behind JANG is that not all layers contribute equally to output quality. Attention layers — which control what the model "looks at" when generating each token — are far more sensitive to quantization than MLP/FFN layers, which act as learned lookup tables.

At 2 bits, this difference becomes dramatic. MLX uniform quantization applies the same 2-bit precision to every tensor, destroying the attention mechanism's ability to route information correctly. JANG_2S protects attention at 5-6 bits while compressing MLP to 2 bits, preserving 79% MMLU compared to uniform's 65.5% (and mixed_2_6's 56.5%).

At 4 bits, the gap is smaller because even uniform 4-bit retains enough precision for attention. JANG_4K still gains +1 point (86% vs 85%) by giving attention layers extra headroom at 5-6 bits and compressing MLP slightly below 4 bits to stay budget-neutral.

When to Choose GGUF

GGUF is a mature, battle-tested format with significant advantages in certain scenarios.

GGUF Advantages

  • Cross-platform — GGUF runs on macOS, Windows, Linux, and Android via llama.cpp. JANG is macOS-only (Apple Silicon).
  • Larger ecosystem — Thousands of pre-quantized GGUF models on HuggingFace, extensive tooling, and broad community support.
  • More model architectures — llama.cpp supports more model families than MLX currently does.
  • CUDA support — GGUF works with NVIDIA GPUs via llama.cpp's CUDA backend. JANG is Metal-only.
  • Importance matrix — GGUF's imatrix quantization can use calibration data for smarter per-block allocation.
  • Established tooling — Mature conversion, quantization, and benchmarking tools (llama-quantize, llama-bench, etc.).

Frequently Asked Questions

What is the difference between JANG and GGUF?
JANG is the GGUF equivalent for MLX on Apple Silicon. Both use mixed-precision quantization, but JANG allocates bits per tensor based on layer sensitivity (attention gets more bits, MLP gets fewer), while GGUF K-quants mix bit widths at the block level within each tensor. JANG targets MLX with 14 custom Metal kernels; GGUF targets llama.cpp with cross-platform GPU support.
Is JANG better than GGUF on Apple Silicon?
For MLX-based inference on Apple Silicon, JANG is purpose-built. It uses native Metal GPU kernels with zero-copy mmap loading (0.3-0.9s for 3-7B models) and achieves 86% MMLU at 4 bits (200q). GGUF through llama.cpp also supports Metal but is designed for cross-platform compatibility rather than Apple Silicon optimization.
Can I convert GGUF models to JANG?
Not directly. JANG quantization starts from the original model weights (FP16/BF16 safetensors) and applies its own sensitivity-aware bit allocation. The recommended workflow is to quantize the original HuggingFace model using JANG's Python tools, which produce .jang files optimized for MLX inference.
How does JANG_4K compare to GGUF Q4_K_M?
Both target approximately 4 bits per weight on average. JANG_4K scores 86% on 200q MMLU (Qwen3.5-122B) with per-tensor variable allocation. GGUF Q4_K_M uses block-level mixed precision within each tensor. Direct comparison requires running both on the same model and benchmark, but the approaches are philosophically similar — both protect important information at higher precision.

Try JANG Models

Pre-quantized models available on HuggingFace. Apache 2.0 open source.

Browse JANG Models on HuggingFace

Free · Apache 2.0 · Apple Silicon (M1 or later) · MLX native