2026 Comparison

JANG vs AWQ

Adaptive quantization for Apple Silicon (MLX) vs NVIDIA GPUs (CUDA). Two approaches to preserving quality at low bits.

Summary

JANG and AWQ solve similar problems on different hardware. AWQ uses activation-aware calibration to identify and protect important weights on CUDA GPUs. JANG uses architecture-aware layer sensitivity tiers on Apple Silicon — no calibration data required. JANG supports 2-8 bit mixed precision per tensor (AWQ is primarily 4-bit). At 4 bits, JANG achieves 86% MMLU (200q); at 2 bits, 79% where uniform gets 56.5%.

Feature Comparison

Feature JANG AWQ
Target Hardware Apple Silicon (Metal GPU) NVIDIA GPUs (CUDA)
Quantization Approach Layer sensitivity tiersArchitecture-aware rules Activation-aware weight scalingCalibration-based
Calibration Required No Yes (calibration dataset)
Supported Bit Widths 2, 3, 4, 5, 6, 8 per tensor Primarily 4-bit (INT4)Some 3-bit support
Mixed Precision Per-tensor variable bits11 predefined profiles Uniform within modelAll layers same bit width
Quantization Granularity Per-tensor (layer-level) Per-channel scaling
GPU Kernels 14 custom Metal kernels Custom CUDA kernels
Inference Framework MLX (Apple native) vLLM, TGI, TensorRT-LLM
Model Loading Zero-copy mmap (0.3-0.9s) Standard GPU loading
Quantization Speed Fast (no forward passes) Slower (requires calibration runs)
Quality at 2-bit 79% MMLU (200q) Not typically supported
Quality at 4-bit 86% MMLU (200q) Comparable (calibration-dependent)
File Format Safetensors (.jang) Safetensors (.safetensors)
License Apache 2.0 MIT

How They Work

AWQ and JANG both recognize that not all weights are equally important, but they identify and protect important weights using fundamentally different methods.

JANG — Layer Sensitivity Tiers
  • Classifies tensors by type: attention vs MLP vs embedding
  • Attention (q, k, v, o projections) get 5-8 bits
  • MLP/FFN (gate, up, down) compress to 2-4 bits
  • No calibration data needed — rules are architecture-aware
  • Quantization is fast: no forward passes required
  • 11 profiles from JANG_1L (2.2 bits) to JANG_6M (6.2 bits)
AWQ — Activation-Aware Scaling
  • Runs calibration data through the model
  • Measures activation magnitudes per channel
  • Identifies "salient" weights based on activation patterns
  • Scales salient weights up before quantization
  • Applies per-channel scaling factors during inference
  • Typically 4-bit (INT4) with group quantization

The key philosophical difference: AWQ determines importance empirically (by running data), while JANG determines importance structurally (by layer type). JANG's approach means you never need a calibration dataset, and quantization is nearly instant. AWQ's approach can potentially find data-specific patterns but requires GPU time and a representative dataset.

JANG Benchmark Data: 200-Question MMLU

Qwen3.5-122B on 200-question MMLU subset. AWQ numbers are not directly comparable (different evaluation setups), but JANG's results against MLX baselines demonstrate the value of mixed-precision allocation.

Configuration Avg Bits Size MMLU (200q)
FP16 (baseline) 16.0 ~244 GB 86.5%
JANG_4K (mixed 4-bit) 3.99 69 GB 86.0%
MLX uniform 4-bit 4.0 64 GB 85.0%
JANG_2S (mixed 2-bit) 2.11 38 GB 79.0%
MLX mixed_2_6 ~2.5 44 GB 56.5%
MLX uniform 2-bit 2.0 ~34 GB 65.5%

200q MMLU on Qwen3.5-122B. JANG_2S achieves 79% at 2.11 average bits — 22.5 points above MLX mixed_2_6 and 6 GB smaller. AWQ typically operates at 4-bit only, making the 2-bit range a JANG-exclusive capability.

Calibration: Required vs Not Required

One of the most significant practical differences between JANG and AWQ is the calibration requirement.

AWQ Calibration Process

AWQ requires running a calibration dataset (typically 128-512 samples from a text corpus like C4 or WikiText) through the full model to measure activation magnitudes. This determines which weight channels are most important. The process requires a GPU capable of loading the full model in FP16 and takes minutes to hours depending on model size and calibration set.

JANG: No Calibration Needed

JANG skips calibration entirely. Its bit allocation is determined by predefined sensitivity tiers based on the model architecture. Attention layers are structurally more important than MLP layers across virtually all transformer architectures — this insight is baked into JANG's profiles. Quantization completes in seconds with no forward passes, no GPU memory overhead for calibration, and no dependency on a representative dataset.

When to Choose AWQ

AWQ Advantages

  • NVIDIA GPU optimized — AWQ has custom CUDA kernels and integrates with vLLM, TGI, and TensorRT-LLM for high-throughput server inference.
  • Proven at scale — AWQ is widely deployed in production serving environments on NVIDIA hardware.
  • Data-aware — Calibration can capture task-specific importance patterns that architecture-based rules might miss.
  • Broad ecosystem — Pre-quantized AWQ models are widely available on HuggingFace, and AutoAWQ makes quantization straightforward.
  • Server deployment — If you are deploying on NVIDIA A100/H100 clusters, AWQ is the more established choice.

Frequently Asked Questions

What is the difference between JANG and AWQ?
JANG uses architecture-aware layer sensitivity tiers to assign variable bit widths per tensor (2-8 bits) on Apple Silicon via MLX. AWQ uses activation-aware weight scaling with calibration data to protect important weights at a uniform bit width (typically 4-bit) on NVIDIA CUDA GPUs. JANG requires no calibration; AWQ requires running a dataset through the model.
Does JANG require calibration data like AWQ?
No. JANG determines bit allocation from predefined sensitivity tiers based on the model architecture. Attention layers automatically receive higher precision, MLP layers receive lower precision. No calibration dataset, no forward passes, no GPU time for calibration. This makes JANG quantization nearly instant.
Can I use AWQ models on Apple Silicon?
AWQ models are designed for NVIDIA CUDA inference and do not run natively on Apple Silicon. For quantized models on Apple Silicon Macs, use JANG (for mixed-precision) or MLX's built-in uniform quantization. JANG provides the quality advantages of intelligent bit allocation that AWQ provides on CUDA, but optimized for Metal GPUs.
Which supports lower bit widths, JANG or AWQ?
JANG supports a much wider range. JANG offers 2, 3, 4, 5, 6, and 8 bits per tensor with 11 predefined profiles from 2.2 to 6.2 average bits. JANG_2S achieves 79% MMLU at 2.11 average bits. AWQ is primarily designed for 4-bit (INT4) quantization, with limited 3-bit support in some implementations. Sub-4-bit quantization is a key JANG differentiator.

Try JANG on Apple Silicon

No calibration required. 2-8 bit mixed precision. Pre-quantized models on HuggingFace.

Browse JANG Models on HuggingFace

Free · Apache 2.0 · Apple Silicon (M1 or later) · No calibration needed