2026 Comparison

JANG vs GPTQ

Per-tensor mixed precision (architecture-aware) vs Hessian-based quantization (calibration-based). Two philosophies for low-bit inference.

Summary

JANG assigns variable bit widths per tensor without calibration. GPTQ uses second-order (Hessian) optimization with calibration data to minimize quantization error at a fixed bit width. JANG targets MLX on Apple Silicon with 14 Metal kernels; GPTQ targets CUDA GPUs. JANG supports 2-8 bit mixed precision natively; GPTQ typically applies uniform 2-4 bits across all layers. At 4 bits, JANG achieves 86% MMLU (200q); at 2 bits, 79%.

Feature Comparison

Feature JANG GPTQ
Target Hardware Apple Silicon (Metal GPU) NVIDIA GPUs (CUDA)
Quantization Method Layer sensitivity tiersArchitecture-aware, no calibration Hessian-based optimizationSecond-order, calibration required
Calibration Required No Yes (128+ samples)
Mixed Precision Per-tensor variable (2-8 bit) Typically uniform per modelAll layers same bits
Supported Bit Widths 2, 3, 4, 5, 6, 8 2, 3, 4, 8
Quantization Speed Seconds (no forward passes) Minutes to hours (Hessian computation)
Error Optimization Structural (layer-type rules) Per-column Hessian minimization
GPU Kernels 14 custom Metal kernels Custom CUDA kernels (Marlin, ExLlama)
Model Loading Zero-copy mmap (0.3-0.9s) Standard GPU loading
Inference Framework MLX (Apple native) vLLM, TGI, ExLlamaV2, Transformers
Act-Order Support N/A (per-tensor, not per-column) Yes (quantize by activation order)
Group Size Per-tensor granularity Configurable (32, 64, 128)
Predefined Profiles 11 profiles (JANG_1L to JANG_6M) Manual bit-width selection
License Apache 2.0 Apache 2.0

How They Work

JANG and GPTQ represent two fundamentally different approaches to the quantization problem. GPTQ is mathematically rigorous at the per-column level; JANG is architecturally aware at the per-tensor level.

JANG — Sensitivity Tier Allocation
  • Classifies each tensor by architectural role
  • Attention projections (q, k, v, o): 5-8 bits
  • MLP layers (gate, up, down): 2-4 bits
  • Embedding and lm_head: protected at higher bits
  • No calibration — instant quantization
  • Budget-neutral: same average bits, smarter distribution
GPTQ — Hessian Optimization
  • Computes Hessian matrix H = 2XTX per layer
  • Quantizes columns in order of importance (act-order)
  • Updates remaining weights to compensate for error
  • Minimizes squared error per layer using calibration data
  • Optional group quantization (32-128 element groups)
  • Mathematically optimal for the given calibration set

The key difference: GPTQ optimizes within each layer to find the best quantized values for a given bit width. JANG optimizes across layers to determine which layers deserve more bits. These approaches are complementary in principle — JANG's Python tools include GPTQ as one of the quantization methods that can be applied after tier allocation.

JANG Benchmark Data: 200-Question MMLU

Qwen3.5-122B on 200-question MMLU subset. JANG's mixed-precision allocation demonstrates the value of per-tensor variable bit widths compared to uniform approaches.

Configuration Avg Bits Size MMLU (200q)
FP16 (baseline) 16.0 ~244 GB 86.5%
JANG_4K (mixed 4-bit) 3.99 69 GB 86.0%
MLX uniform 4-bit 4.0 64 GB 85.0%
JANG_3M 3.11 ~50 GB 77.5%
MLX uniform 3-bit 3.0 ~47 GB 75.5%
JANG_2S (mixed 2-bit) 2.11 38 GB 79.0%
MLX mixed_2_6 ~2.5 44 GB 56.5%
MLX uniform 2-bit 2.0 ~34 GB 65.5%

200q MMLU on Qwen3.5-122B. JANG's per-tensor variable allocation is most impactful at low bits: JANG_2S (79%) vs MLX mixed_2_6 (56.5%) is a 22.5-point advantage. GPTQ at 2-bit on the same model would use Hessian optimization but apply uniform 2-bit to all layers, missing the structural insight that attention needs more precision than MLP.

Quantization Speed

GPTQ's Hessian computation is the main bottleneck. For each layer, GPTQ must compute H = 2XTX using calibration data, then process each column sequentially (or in groups), updating remaining weights after each quantization step. This is an O(d2) operation per layer.

For a 70B parameter model, GPTQ calibration and quantization typically takes 1-4 hours on an A100 GPU. For 120B+ models, it can take 6+ hours and requires significant GPU memory to hold both the model and the Hessian matrices.

JANG quantization is near-instant. Since bit allocation is determined by predefined tiers (no calibration, no forward passes, no Hessian computation), quantizing a 120B model takes seconds. The actual weight quantization uses fast round-to-nearest (RTN) or optional MSE-optimal/GPTQ methods per tensor.

When to Choose GPTQ

GPTQ Advantages

  • Mathematically optimal — GPTQ minimizes quantization error per layer using second-order information. For a given uniform bit width, it finds provably better weight values than RTN.
  • CUDA ecosystem — Deep integration with vLLM, TGI, ExLlamaV2, and Transformers for server-side inference on NVIDIA hardware.
  • Act-order quantization — Quantizing columns by activation importance order further reduces error, especially at 2-3 bits.
  • Wide model support — Pre-quantized GPTQ models for hundreds of architectures on HuggingFace. The TheBloke and other quantizers provide extensive coverage.
  • ExLlamaV2 inference — Fastest GPTQ inference kernel, optimized for single-user interactive use on NVIDIA GPUs.

Frequently Asked Questions

What is the difference between JANG and GPTQ?
JANG assigns variable bit widths per tensor based on layer sensitivity (attention vs MLP) without calibration data. GPTQ applies a uniform bit width across all layers but uses Hessian-based optimization with calibration data to find the best quantized weight values within that constraint. JANG targets MLX/Apple Silicon; GPTQ targets CUDA/NVIDIA GPUs.
Does GPTQ work on Apple Silicon?
GPTQ inference is primarily designed for NVIDIA CUDA GPUs. There is no native Metal inference path for GPTQ models. On Apple Silicon, use JANG for mixed-precision quantized models or MLX's built-in uniform quantization. JANG provides 14 custom Metal kernels with fused dequant+GEMV/GEMM operations.
Can JANG use GPTQ internally?
Yes. JANG's Python quantization tools support RTN (round-to-nearest), MSE-optimal, and GPTQ as the per-tensor quantization method. After JANG determines the bit allocation per tensor (e.g., 6 bits for attention, 3 bits for MLP), each tensor can optionally be quantized using GPTQ's Hessian method for better within-tensor accuracy.
Which is faster to quantize?
JANG is dramatically faster. With RTN quantization, JANG completes in seconds for any model size since it requires no forward passes or Hessian computation. GPTQ requires computing H = 2X^TX per layer using calibration data, then per-column optimization. For a 70B model, GPTQ takes 1-4 hours on an A100; for 120B+, 6+ hours.

Try JANG on Apple Silicon

Per-tensor mixed precision. No calibration. 14 Metal kernels. Pre-quantized models available.

Browse JANG Models on HuggingFace

Free · Apache 2.0 · Apple Silicon (M1 or later) · Instant quantization