Per-tensor mixed precision (architecture-aware) vs Hessian-based quantization (calibration-based). Two philosophies for low-bit inference.
JANG assigns variable bit widths per tensor without calibration. GPTQ uses second-order (Hessian) optimization with calibration data to minimize quantization error at a fixed bit width. JANG targets MLX on Apple Silicon with 14 Metal kernels; GPTQ targets CUDA GPUs. JANG supports 2-8 bit mixed precision natively; GPTQ typically applies uniform 2-4 bits across all layers. At 4 bits, JANG achieves 86% MMLU (200q); at 2 bits, 79%.
| Feature | JANG | GPTQ |
|---|---|---|
| Target Hardware | Apple Silicon (Metal GPU) | NVIDIA GPUs (CUDA) |
| Quantization Method | Layer sensitivity tiersArchitecture-aware, no calibration | Hessian-based optimizationSecond-order, calibration required |
| Calibration Required | No | Yes (128+ samples) |
| Mixed Precision | Per-tensor variable (2-8 bit) | Typically uniform per modelAll layers same bits |
| Supported Bit Widths | 2, 3, 4, 5, 6, 8 | 2, 3, 4, 8 |
| Quantization Speed | Seconds (no forward passes) | Minutes to hours (Hessian computation) |
| Error Optimization | Structural (layer-type rules) | Per-column Hessian minimization |
| GPU Kernels | 14 custom Metal kernels | Custom CUDA kernels (Marlin, ExLlama) |
| Model Loading | Zero-copy mmap (0.3-0.9s) | Standard GPU loading |
| Inference Framework | MLX (Apple native) | vLLM, TGI, ExLlamaV2, Transformers |
| Act-Order Support | N/A (per-tensor, not per-column) | Yes (quantize by activation order) |
| Group Size | Per-tensor granularity | Configurable (32, 64, 128) |
| Predefined Profiles | 11 profiles (JANG_1L to JANG_6M) | Manual bit-width selection |
| License | Apache 2.0 | Apache 2.0 |
JANG and GPTQ represent two fundamentally different approaches to the quantization problem. GPTQ is mathematically rigorous at the per-column level; JANG is architecturally aware at the per-tensor level.
The key difference: GPTQ optimizes within each layer to find the best quantized values for a given bit width. JANG optimizes across layers to determine which layers deserve more bits. These approaches are complementary in principle — JANG's Python tools include GPTQ as one of the quantization methods that can be applied after tier allocation.
Qwen3.5-122B on 200-question MMLU subset. JANG's mixed-precision allocation demonstrates the value of per-tensor variable bit widths compared to uniform approaches.
| Configuration | Avg Bits | Size | MMLU (200q) |
|---|---|---|---|
| FP16 (baseline) | 16.0 | ~244 GB | 86.5% |
| JANG_4K (mixed 4-bit) | 3.99 | 69 GB | 86.0% |
| MLX uniform 4-bit | 4.0 | 64 GB | 85.0% |
| JANG_3M | 3.11 | ~50 GB | 77.5% |
| MLX uniform 3-bit | 3.0 | ~47 GB | 75.5% |
| JANG_2S (mixed 2-bit) | 2.11 | 38 GB | 79.0% |
| MLX mixed_2_6 | ~2.5 | 44 GB | 56.5% |
| MLX uniform 2-bit | 2.0 | ~34 GB | 65.5% |
200q MMLU on Qwen3.5-122B. JANG's per-tensor variable allocation is most impactful at low bits: JANG_2S (79%) vs MLX mixed_2_6 (56.5%) is a 22.5-point advantage. GPTQ at 2-bit on the same model would use Hessian optimization but apply uniform 2-bit to all layers, missing the structural insight that attention needs more precision than MLP.
GPTQ's Hessian computation is the main bottleneck. For each layer, GPTQ must compute H = 2XTX using calibration data, then process each column sequentially (or in groups), updating remaining weights after each quantization step. This is an O(d2) operation per layer.
For a 70B parameter model, GPTQ calibration and quantization typically takes 1-4 hours on an A100 GPU. For 120B+ models, it can take 6+ hours and requires significant GPU memory to hold both the model and the Hessian matrices.
JANG quantization is near-instant. Since bit allocation is determined by predefined tiers (no calibration, no forward passes, no Hessian computation), quantizing a 120B model takes seconds. The actual weight quantization uses fast round-to-nearest (RTN) or optional MSE-optimal/GPTQ methods per tensor.
Per-tensor mixed precision. No calibration. 14 Metal kernels. Pre-quantized models available.
Browse JANG Models on HuggingFaceFree · Apache 2.0 · Apple Silicon (M1 or later) · Instant quantization