开源

JANG

The GGUF for MLX

Apple Silicon MLX混合精度量化。

MLX的均匀量化对所有层应用相同的位宽。在低位（2-3）时，注意力层首先退化——输出崩溃为重复循环或退化序列。JANG根据灵敏度为每个张量分配可变位宽：注意力获得更多位，MLP获得更少位。

模型在GPU内存中保持量化状态，使用MLX原生 quantized_matmul 内核即时反量化——无float16扩展，无速度损失。格式基于safetensors。量化工具和运行时为Apache 2.0开源。

基于重要性的位分配 2-bit到8-bit混合精度 14 custom Metal GPU kernels Swift + Metal运行时逐块可变位宽开源 · Apache 2.0

在GitHub上查看查看结果

94%

122B MMLU (JANG_4K)

90%

122B 2.14 bits HumanEval

42 GB

磁盘（小于MLX mixed_2_6）

Apache 2.0

开源许可证

工作原理

基于层灵敏度的可变位宽

均匀量化对所有张量应用相同的位宽。注意力层（约占参数的12%）比MLP层对精度损失更敏感——过度量化时注意力分数变平，位置编码退化，输出退化。

JANG将张量分为灵敏度等级并相应分配位宽。注意力层获得5–8 bits，MLP压缩至2–4 bits。开销约为平均0.3 bits。

Attention

8-bit — 受保护

MLP

2-bit — 已压缩

Embed

4-bit

lm_head

6-bit

结果

                JANG_2M
                 → 2.7 avg bits → 
                连贯输出
              

                3-bit
                 → 3.0 avg bits → 
                重复循环
              

MMLU Benchmark

JANG vs MLX — side by side

Each JANG model compared against the closest MLX method by size. 50-question MMLU, thinking disabled, temp 0.0. Apple M4 Max 128 GB.

Qwen3.5-122B-A10B — ~4 bits — NEW

JANG

JANG_4K

69 GB · 3.99 bits · ~40 tok/s

94%

MMLU

+4 points vs MLX 4-bit

MLX

4-bit

64 GB · 4.0 bits · ~50 tok/s

90%

MMLU

Qwen3.5-122B-A10B — ~2 bits

JANG

JANG_2M

42 GB · 2.14 bits

84%

MMLU

+38 points · 2 GB smaller

MLX

mixed_2_6

44 GB · ~2.5 bits

46%

MMLU

Qwen3.5-35B-A3B — ~4 bits

JANG

JANG_4K

16.7 GB · 3.99 bits · ~100 tok/s

84%

MMLU

+2 points · 1.3 GB smaller

MLX

4-bit

18 GB · 4.0 bits · ~110 tok/s

82%

MMLU

Qwen3.5-35B-A3B — ~2 bits

JANG

JANG_2S

12 GB · 2.17 bits · fits 16 GB RAM

62%

MMLU

+28 points · 1 GB smaller

MLX

mixed_2_6

13 GB · ~2.5 bits

34%

MMLU

Qwen3.5-122B-A10B — JANG_1L — 200-question MMLU

JANG

JANG_1L

51 GB · 2.24 bits · 48 tok/s

73%

MMLU (200 questions) · Wins 9/10 subjects

+27 points

MLX

mixed_2_6

44 GB · ~2.5 bits · 66 tok/s

46%

MMLU (200 questions)

All scores verified from HuggingFace model cards. Download: JANG_4K · JANG_2S · JANG_1L

较小模型（1B–3B）

SmolLM2-1.7B

Llama MHA JANG_3M 3.4 bits M4 Max

“How many legs does a spider have?”

JANG_3M (3.4 bits)

“8. How many arms does a spider have? Answer: 8”

3-bit

2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

JANG_3M (3.4 bits) answers “8”. 3-bit (3.5 bits) outputs number sequences.

Phi-2 (2.7B)

Phi MHA JANG_2S 2.5 bits M4 Max

“What is photosynthesis?”

JANG_2S (2.5 bits)

“Photosynthesis is the process by which plants use sunlight to con...”

2-bit

(empty output)

Same average bits. JANG produces correct answer. Standard produces empty output.

Qwen2.5-3B

Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits)

“Thank you in Spanish is 'gracias'.”

4-bit

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits) translates correctly. 4-bit (4.5 bits) echoes the prompt.

Qwen2.5-3B — Creative

JANG_4L 4.5 bits M4 Max

“Write a haiku about the moon.”

JANG_4L (4.5 bits)

“The moon’s glow, a tranquil sight...”

4-bit

ancient sky, lunar glow, ancient sky, lunar glow...

JANG generates coherent text. Standard loops a phrase.

Qwen2.5-3B — Tomato

JANG (4.12 bits) M4 Max

“Is a tomato a fruit or vegetable?”

JANG (4.12 bits)

“A tomato is a fruit. It is botanically classified...”

4-bit

Is a tomato a fruit or vegetable? Is it a vegetabl...

JANG answers correctly. Standard enters repetition loop.

TinyLlama-1.1B

Llama GQA 8:1 JANG_4S 4.1 bits M4 Max

“What is the chemical formula for water?”

JANG_4S (4.1 bits)

“What is the chemical formula for water? Answers: 1. H...”

4-bit

“What is the chemical formula for hydrogen peroxide?...”

JANG_4S (4.1 bits) stays on topic. 4-bit (4.5 bits) derails to a different question.

Logit MSE证明

JANG 3.37 bits超越均匀4-bit

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8

7.13 MSE — 4.49 bits

MLP=4, attn=6

8.70 MSE — 4.24 bits

4-bit

11.31 MSE — 4.00 bits

MLP=3, attn=6

11.10 MSE — 3.37 bits ✔

JANG 3.37 bits（MSE 11.10）超越均匀4.00 bits（MSE 11.31）——少16%的位，更好的质量。

总结

所有已测试模型

Model	Params	Architecture	测试	失败模式
Mistral-7B	7B	Mistral GQA 4:1, sliding window	13	3-bit → number sequences, 4b → loops
TinyLlama-1.1B	1.1B	Llama GQA 8:1	11	4-bit → topic derail
SmolLM2-1.7B	1.7B	Llama MHA	11	3-bit → number sequences
Phi-2	2.7B	Phi MHA, GELU MLP	9	2-bit → empty output
Qwen2.5-7B	7B	Qwen GQA 4:1	9	3-bit → repetition loop
Qwen2.5-3B	3B	Qwen GQA 8:1	6	4-bit → echo/loop
Qwen3.5-4B	4B	Hybrid: 24 linear + 8 full attn	6	2-bit → 0/6 correct

所有测试：Apple M4 Max · 107 GB统一内存 · MLX仿射量化 · group_size=64 · 相同分词器 · 相同提示模板 · 45个实验 · 8个模型 · Qwen3.5-9B已下载，测试待进行

配置

JANG_{bits}{size}

从超压缩到近无损的11个预定义配置。S = Small（最大压缩），M = Medium（平衡），L = Large（最佳质量）。

Profile	MLP	Attention	Embed	lm_head	Avg Bits
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

运行时

Swift + Metal推理引擎

14个自定义Metal GPU内核。零拷贝mmap加载。解码和预填充的融合反量化。

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# Loading model (zero-copy mmap)...

# Profile: JANG_4L (MLP=4, attn=8, avg=4.5 bits)

# Size: 1.8 GB — loaded in 0.39s

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.

Dequant + GEMM

Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.

GQA Attention

Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.

RMSNorm + RoPE

Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.

SwiGLU

Fused SiLU activation + element-wise multiply for gated feed-forward networks.

Quantized Embedding

Direct embedding lookup from quantized weights. No full-table dequantization needed.

量化

转换任何模型

将HuggingFace模型转换为.jang格式的Python工具。选择配置，选择量化方法，即可开始。支持RTN、MSE最优网格搜索和GPTQ（Hessian引导）量化。

6+架构系列：Llama、Qwen、Gemma、Phi、Mistral、Mamba/SSM、MoE，以及包括Qwen 3.5的混合模型。

开源 — Apache 2.0许可证

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-7B \

--profile JANG_4L \

--method gptq \

--output ./Qwen2.5-7B-JANG_4L/

# Quantizing with GPTQ (Hessian-guided)...

# Attention layers: 8-bit | MLP: 4-bit

# Average bits: 4.5 | Size: 4.1 GB

# Done ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

内存

用更少RAM运行更大模型

JANG_3M在7B+模型上比均匀4-bit节省25%，质量相当。将以前放不下的模型装入统一内存。

~4.1 GB

7B at JANG_4S (vs 4.5 GB 4-bit)

~8.2 GB

14B at JANG_4S (vs 9 GB 4-bit)

~41 GB

70B at JANG_4S (vs 45 GB 4-bit)

25%

Savings at JANG_3M vs 4-bit

模型

HuggingFace预量化模型

可直接下载。通过JANG加载器与vMLX Engine / MLX Studio兼容。

Qwen3.5-122B-A10B-JANG_2M

2.14 bits · 42 GB · 84% MMLU ·

Qwen3.5-27B-JANG_1L

JANG_1L profile · Dense model

Qwen3.5-35B-A3B-JANG_2L

2.28 bits · 13.3 GB · 4/6 correct · 106 tok/s

Qwen3.5-122B-A10B-JANG_2L

2.19 bits · 45.3 GB · 3/4 correct · 49 tok/s

HuggingFace全部模型

原生集成

在MLX Studio中运行JANG模型

MLX Studio原生支持JANG，提供OpenAI兼容API、前缀缓存、分页KV缓存、KV量化（q4/q8）、连续批处理和20+智能编码工具。加载任何.jang模型并在本地提供服务——兼容Cursor、Continue、Aider及任何OpenAI API客户端。由vMLX Engine驱动，现已开源——pip install vmlx。

MLX Studio vMLX Engine

JANG

基于层灵敏度的可变位宽

JANG vs MLX — side by side

更多7B结果

JANG_3L (3.6 bits)

3-bit

JANG_3M (3.4 bits)

3-bit

JANG_3L (3.6 bits)

3-bit

JANG_2M (2.7 bits)

2-bit

JANG_4L (4.5 bits)

4-bit

JANG_2S (2.5 bits)

2-bit

较小模型（1B–3B）

JANG_3M (3.4 bits)

3-bit

JANG_2S (2.5 bits)

2-bit

JANG_4S (4.1 bits)

4-bit

JANG_4L (4.5 bits)

4-bit

JANG (4.12 bits)

4-bit

JANG_4S (4.1 bits)

4-bit

JANG 3.37 bits超越均匀4-bit

所有已测试模型

JANG_{bits}{size}

Swift + Metal推理引擎

Dequant + GEMV

Dequant + GEMM

GQA Attention

RMSNorm + RoPE

SwiGLU

Quantized Embedding

转换任何模型

用更少RAM运行更大模型

HuggingFace预量化模型

在MLX Studio中运行JANG模型