Adaptive quantization for Apple Silicon (MLX) vs NVIDIA GPUs (CUDA). Two approaches to preserving quality at low bits.
JANG and AWQ solve similar problems on different hardware. AWQ uses activation-aware calibration to identify and protect important weights on CUDA GPUs. JANG uses architecture-aware layer sensitivity tiers on Apple Silicon — no calibration data required. JANG supports 2-8 bit mixed precision per tensor (AWQ is primarily 4-bit). At 4 bits, JANG achieves 86% MMLU (200q); at 2 bits, 79% where uniform gets 56.5%.
| Feature | JANG | AWQ |
|---|---|---|
| Target Hardware | Apple Silicon (Metal GPU) | NVIDIA GPUs (CUDA) |
| Quantization Approach | Layer sensitivity tiersArchitecture-aware rules | Activation-aware weight scalingCalibration-based |
| Calibration Required | No | Yes (calibration dataset) |
| Supported Bit Widths | 2, 3, 4, 5, 6, 8 per tensor | Primarily 4-bit (INT4)Some 3-bit support |
| Mixed Precision | Per-tensor variable bits11 predefined profiles | Uniform within modelAll layers same bit width |
| Quantization Granularity | Per-tensor (layer-level) | Per-channel scaling |
| GPU Kernels | 14 custom Metal kernels | Custom CUDA kernels |
| Inference Framework | MLX (Apple native) | vLLM, TGI, TensorRT-LLM |
| Model Loading | Zero-copy mmap (0.3-0.9s) | Standard GPU loading |
| Quantization Speed | Fast (no forward passes) | Slower (requires calibration runs) |
| Quality at 2-bit | 79% MMLU (200q) | Not typically supported |
| Quality at 4-bit | 86% MMLU (200q) | Comparable (calibration-dependent) |
| File Format | Safetensors (.jang) | Safetensors (.safetensors) |
| License | Apache 2.0 | MIT |
AWQ and JANG both recognize that not all weights are equally important, but they identify and protect important weights using fundamentally different methods.
The key philosophical difference: AWQ determines importance empirically (by running data), while JANG determines importance structurally (by layer type). JANG's approach means you never need a calibration dataset, and quantization is nearly instant. AWQ's approach can potentially find data-specific patterns but requires GPU time and a representative dataset.
Qwen3.5-122B on 200-question MMLU subset. AWQ numbers are not directly comparable (different evaluation setups), but JANG's results against MLX baselines demonstrate the value of mixed-precision allocation.
| Configuration | Avg Bits | Size | MMLU (200q) |
|---|---|---|---|
| FP16 (baseline) | 16.0 | ~244 GB | 86.5% |
| JANG_4K (mixed 4-bit) | 3.99 | 69 GB | 86.0% |
| MLX uniform 4-bit | 4.0 | 64 GB | 85.0% |
| JANG_2S (mixed 2-bit) | 2.11 | 38 GB | 79.0% |
| MLX mixed_2_6 | ~2.5 | 44 GB | 56.5% |
| MLX uniform 2-bit | 2.0 | ~34 GB | 65.5% |
200q MMLU on Qwen3.5-122B. JANG_2S achieves 79% at 2.11 average bits — 22.5 points above MLX mixed_2_6 and 6 GB smaller. AWQ typically operates at 4-bit only, making the 2-bit range a JANG-exclusive capability.
One of the most significant practical differences between JANG and AWQ is the calibration requirement.
AWQ requires running a calibration dataset (typically 128-512 samples from a text corpus like C4 or WikiText) through the full model to measure activation magnitudes. This determines which weight channels are most important. The process requires a GPU capable of loading the full model in FP16 and takes minutes to hours depending on model size and calibration set.
JANG skips calibration entirely. Its bit allocation is determined by predefined sensitivity tiers based on the model architecture. Attention layers are structurally more important than MLP layers across virtually all transformer architectures — this insight is baked into JANG's profiles. Quantization completes in seconds with no forward passes, no GPU memory overhead for calibration, and no dependency on a representative dataset.
No calibration required. 2-8 bit mixed precision. Pre-quantized models on HuggingFace.
Browse JANG Models on HuggingFaceFree · Apache 2.0 · Apple Silicon (M1 or later) · No calibration needed