开源

JANG

The GGUF for MLX

128 GB Mac上的3970亿参数。92% MMLU。MLX甚至无法加载。

JANG_1L将397B模型压缩到112 GB — 128 GB MacBook Pro可用推理模式达到86.5% MMLU。MLX在2~3 bits下输出NaN。MiniMax 230B？MLX在所有bit水平下只有26.5%。Nemotron-H 120B？MLX 3-bit完全损坏。JANG是在Apple Silicon上量化运行这些模型的唯一方式。

JANG为attention分配更多bits，为MLP分配更少bits，使模型在standard quantization产生垃圾或NaN的地方仍能正常工作。同样速度，同样Metal内核 — 更好的输出。Apache 2.0开源。

基于重要性的比特分配 2-bit到8-bit混合精度 14个自定义Metal GPU内核 Swift + Metal运行时逐块可变位宽开源 · Apache 2.0

在GitHub上查看查看结果

397B

最大模型 — 可在128 GB Mac上运行

92%

397B上的MMLU（JANG_2L）

93%

Nemotron-H 120B上的MMLU（JANG_4M）

Apache 2.0

开源许可证

工作原理

基于层敏感度的可变位宽

Standard quantization对每个张量应用相同的位宽。Attention层（约占参数的12%）比MLP层对精度损失更敏感——过度量化时，attention分数变得平坦，位置编码退化，输出退化。

JANG将张量按敏感度分级并相应分配位宽。Attention层获得5~8 bits，而MLP压缩到2~4 bits。开销为平均约0.3 bits的额外量。

Attention

8-bit — 受保护

MLP

2-bit — 已压缩

Embed

4-bit

lm_head

6-bit

Result

JANG_2M
 → 2.7 avg bits → 
coherent output

3-bit
 → 3.0 avg bits → 
repetition loops

MMLU基准测试

JANG vs MLX——并排对比

每个JANG模型与大小最接近的MLX方法进行比较。200题MMLU（每科20题 × 10科），thinking/reasoning在标注处启用，temp 0.0。Apple M4 Max 128 GB / M4 Ultra 256 GB。

Qwen3.5-397B-A17B — 397 billion parameters — JANG vs MLX

JANG

JANG_1L

112 GB disk · 120 GB GPU peak · 36 tok/s · FITS 128 GB MACS

86.5%

MMLU (200q, reasoning) · 173/200

397B intelligence on a laptop

MLX

2-bit / 3-bit

Cannot run — NaN output

NaN

Model too complex for standard quantization

JANG

JANG_2L

187 GB disk · 197 GB GPU · 36 tok/s · M4 Ultra 256 GB

92%

MMLU (200q, reasoning) · 184/200

Near-FP16 quality at 2.x bits

MLX

4-bit

~280 GB · requires massive machines

94%

MMLU (200q, reasoning)

397B on a 128 GB Mac — first ever. JANG_1L at 112 GB disk (120 GB GPU peak) fits on a 128 GB MacBook Pro and scores 86.5% MMLU with reasoning. MLX at 2-bit and 3-bit produces NaN — the model is too complex for standard quantization at low bit widths. MLX 4-bit runs at 94% but needs ~280 GB, far beyond any laptop. JANG_2L at 187 GB hits 92% on an M4 Ultra 256 GB.

Nemotron-3-Super-120B-A12B — NVIDIA Hybrid Mamba-2 SSM + Latent MoE + Attention

JANG

JANG_4M

63 GB · 55 tok/s

93%

MMLU (200q, reasoning) · 186/200

First Nemotron-H on Apple Silicon

MLX

3-bit

Broken

—

Cannot produce valid output

JANG

JANG_2L

43 GB · 52 tok/s · fits 64 GB Macs

86%

MMLU (200q, reasoning) · 172/200

120B on a 64 GB Mac

First working Nemotron-H quantization for Apple Silicon. NVIDIA’s hybrid architecture combines Mamba-2 SSM, Latent MoE, and standard attention — MLX 3-bit is broken on it. JANG_4M at 63 GB scores 93% MMLU with reasoning at 55 tok/s. JANG_2L fits on a 64 GB Mac at 43 GB with 86% MMLU.

MiniMax-M2.5 (230B) — JANG vs MLX

JANG

JANG_2L

82.5 GB · 2.10 bits · 0.9s per question

74.0%

MMLU (200q) · 148/200

+47.5 points · MLX broken at ALL bit levels

MLX

4-bit

119.8 GB · 4.0 bits · 0.9s per question

26.5%

MMLU (200q) · 53/200

MLX is completely broken on MiniMax at every bit level — 4-bit (26.5%), 3-bit (24.5%), and 2-bit (25%) all score near random. JANG_2L at just 2.10 bits is the only way to run MiniMax quantized on Apple Silicon.

Per-subject breakdown — MiniMax-M2.5 (230B) — all methods

科目	JANG_2L	MLX 4-bit	MLX 3-bit	MLX 2-bit
Abstract Algebra	10/20	3/20	2/20	5/20
Anatomy	15/20	7/20	5/20	5/20
Astronomy	20/20	7/20	6/20	4/20
College CS	13/20	4/20	5/20	6/20
College Physics	13/20	8/20	6/20	6/20
HS Biology	18/20	4/20	5/20	6/20
HS Chemistry	18/20	4/20	5/20	5/20
HS Mathematics	8/20	6/20	6/20	3/20
Logical Fallacies	18/20	5/20	4/20	5/20
World Religions	15/20	5/20	5/20	5/20
Total	148/200 (74%)	53/200 (26.5%)	49/200 (24.5%)	50/200 (25%)

JANG wins all 10 subjects against all MLX methods. MLX 4-bit, 3-bit, and 2-bit all score near random (25%). Root cause: MLX generates meta-commentary instead of direct answers on this model.

Qwen3.5-122B-A10B — ~4 bits

JANG

JANG_4K

71 GB · 3.99 bits · ~40 tok/s

86%

MMLU (200q) · 172/200

+1 point vs MLX 4-bit

MLX

4-bit

64 GB · 4.0 bits · ~50 tok/s

85%

MMLU (200q) · 170/200

Per-subject breakdown — 122B ~4 bits

科目	JANG_4K	MLX 4-bit
Abstract Algebra	16/20	15/20
Anatomy	19/20	18/20
Astronomy	19/20	19/20
College CS	15/20	15/20
College Physics	14/20	14/20
HS Biology	19/20	19/20
HS Chemistry	18/20	18/20
HS Mathematics	14/20	14/20
Logical Fallacies	19/20	19/20
World Religions	19/20	19/20
Total	172/200 (86%)	170/200 (85%)

JANG wins 2 subjects, ties 8. Neck-and-neck at ~4 bits.

Qwen3.5-122B-A10B — ~2 bits

JANG

JANG_2S

44 GB · 2.11 bits · ~45 tok/s

79%

MMLU (200q) · 158/200

+22.5 points

MLX

2-bit

36 GB · 2.0 bits · ~52 tok/s

56.5%

MMLU (200q) · 113/200

Per-subject breakdown — 122B ~2 bits

科目	JANG_2S	MLX 2-bit
Abstract Algebra	9/20	9/20
Anatomy	18/20	11/20
Astronomy	20/20	16/20
College CS	14/20	8/20
College Physics	15/20	10/20
HS Biology	19/20	15/20
HS Chemistry	18/20	13/20
HS Mathematics	11/20	4/20
Logical Fallacies	16/20	13/20
World Religions	18/20	14/20
Total	158/200 (79%)	113/200 (56.5%)

JANG wins 9 of 10 subjects, ties 1 (Abstract Algebra).

Qwen3.5-35B-A3B — ~4 bits

JANG

JANG_4K

20.1 GB · 3.99 bits · ~100 tok/s

77.5%

MMLU (200q) · 155/200

+2 points

MLX

4-bit

18.2 GB · 4.0 bits · ~110 tok/s

75.5%

MMLU (200q) · 151/200

Per-subject breakdown — 35B ~4 bits

科目	JANG_4K	MLX 4-bit
Abstract Algebra	12/20	10/20
Anatomy	17/20	16/20
Astronomy	18/20	18/20
College CS	14/20	15/20
College Physics	14/20	13/20
HS Biology	18/20	18/20
HS Chemistry	17/20	17/20
HS Mathematics	10/20	8/20
Logical Fallacies	18/20	19/20
World Religions	17/20	17/20
Total	155/200 (77.5%)	151/200 (75.5%)

JANG wins 4 subjects, loses 2 (College CS, Logical Fallacies), ties 4.

Qwen3.5-35B-A3B — ~2 bits

JANG

JANG_2S

12.8 GB · 2.17 bits · fits 16 GB RAM

65.5%

MMLU (200q) · 131/200

+25 points

MLX

2-bit

12.8 GB · ~2.5 bits

~40%

MMLU (est. from 34% at 50q)

Per-subject breakdown — 35B ~2 bits (JANG only)

科目	JANG_2S	MLX 2-bit
Abstract Algebra	8/20	—
Anatomy	14/20	—
Astronomy	19/20	—
College CS	14/20	—
College Physics	11/20	—
HS Biology	16/20	—
HS Chemistry	14/20	—
HS Mathematics	5/20	—
Logical Fallacies	14/20	—
World Religions	16/20	—
Total	131/200 (65.5%)	~40% (est.)

MLX 2-bit 200q not yet tested. Estimate based on 34% at 50 questions.

Test methodology & conditions

MMLU: 200-question subset (10 subjects × 20 questions each), thinking disabled, temperature 0.0.
Hardware: Apple M4 Max 128 GB unified memory.
Quantization: MLX affine quantization, group_size=64. JANG uses variable bit widths via quant_predicate.
Models: All methods use the same base model weights. JANG stays quantized in GPU memory using MLX’s native quantized_matmul — no float16 expansion.
Reproducibility: All scores verified from HuggingFace model cards. Code at github.com/jjang-ai/jangq.

Download: All models on HuggingFace — 397B, Nemotron-H 120B, 122B, 35B, MiniMax 230B, and more

QA提示词测试

基本提示词的三方比较

6个事实性问题并排比较。所有方法使用MLX的原生Metal内核。Temperature 0.0，最大80 tokens。M4 Max 128 GB。

Qwen3.5-122B-A10B — JANG_1L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, top-8, 10B active, Hybrid JANG_1L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_1L · 2.24 bits

46.0 GB RAM · 48 tok/s

MLX mixed_2_6 · ~2.2 bits

44.9 GB RAM · 66 tok/s

2-bit · 2.0 bits

35.6 GB RAM · 67 tok/s

“What is 2+2?”

✓ “2+2 is 4”

∼ “2+2=4” then repeats

∼ “2+2=4” then loops

“Is a tomato a fruit?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: empty think tag

✗ 2-bit: rephrases, no answer

“What is photosynthesis?”

✓ “plants use energy of sun to make food”

✗ Degenerate output

✗ “Photos-sense y=y”

“Three planets larger?”

∼ JANG: uses <think> (partial)

∼ mixed_2_6: uses <think> (partial)

✗ Misreads question

“Who wrote Romeo and Juliet?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: double think tag

∼ 2-bit: uses <think> (partial)

“Capital of France?”

✓ “Paris”

✓ mixed_2_6: “Paris”

✓ 2-bit: “Paris” with details

JANG_1L: 3 正确，3 部分正确，0 失败 · mixed_2_6: 1 正确，1 部分正确，4 失败 · 2-bit: 1 正确，2 部分正确，3 失败

MLX’s mixed_2_6 mode protects select v_proj and down_proj layers at 6-bit, but does not account for GatedDeltaNet linear attention layers, MoE expert routing tensors, or hybrid architecture components. JANG’s tier system classifies these architecture-specific tensors explicitly.

MiniMax-M2.5 (230B) — JANG_2S (2.06 bits)

MoE 256 experts, top-8, 10B active JANG_2S · 2.06 bits Mac Studio M4 Ultra 192 GB

JANG_2S · 2.06 bits
81.6 GB GPU · 50 tok/s
JANG_2L · 2.10 bits
82.5 GB RAM · 74% MMLU (200q)

JANG_2S: 2.06 bits下3/6正确 · 230B模型 81.6 GB · 50 tok/s JANG_2L: 82.5 GB RAM下74% MMLU（200题）——比120 GB的MLX 4-bit高3倍

Qwen3.5-35B-A3B — JANG_2L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, Hybrid GDN+FA JANG_2L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_2L · 2.28 bits

13.3 GB RAM · 100 tok/s

MLX mixed_2_6 · ~2.2 bits

12.8 GB RAM · 120 tok/s

2-bit · 2.0 bits

10.1 GB RAM · 128 tok/s

“What is 2+2?”

✓ “2+2 equals 4”

✗ “2+2=4” then loops

✗ Number sequences

“Is a tomato a fruit?”

✗ JANG: loops

∼ mixed_2_6: partial reasoning

✗ 2-bit: degenerate

“What is photosynthesis?”

✓ “convert light energy”

✗ “I cannot respond”

✗ “6 6 6”

“Three planets larger?”

✓ “Jupiter, Saturn, Uranus”

✗ “Antina” loops

✗ Number sequences

“Who wrote Romeo and Juliet?”

∼ JANG: “Shakespeare” (partial)

✗ mixed_2_6: contradicts itself

✗ 2-bit: degenerate

“Capital of France?”

✓ “Paris” with details

✗ Never answers

∼ 2-bit: “Paris” partial

JANG_2L: 4 正确，1 部分正确，1 失败 · mixed_2_6: 0 正确，1 部分正确，5 失败 · 2-bit: 0 正确，1 部分正确，5 失败

On this hybrid MoE model, MLX mixed_2_6 does not improve over 2-bit. The mixed_2_6 heuristic targets v_proj and down_proj in standard transformer layers but misses GatedDeltaNet attention and MoE routing tensors that are critical for this architecture.

Qwen3.5-122B-A10B — 1220亿参数，正面对比

MoE 256 experts, top-8, 10B active JANG_2L vs 2-bit M4 Max 128 GB

JANG_2L · 2.19 bits
45.3 GB RAM · 38–49 tok/s
2-bit · 2.0 bits
35.6 GB RAM · 52–65 tok/s

“What is photosynthesis?”

“process by which green plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose”

“Photos-sense” then “y = y = y” degenerate

“Three planets larger than Earth?”

Uses <think> reasoning tags, lists Jupiter with details

Misreads as “larger than Earth’s moon”, rambles

“Capital of France?”

“Paris” with government details

“Paris, on the banks of the River Seine” — both correct

“What is 2+2?”

“2+2 is 4.” (then repeats) — PARTIAL

“2+2=4” then “2. 2. 2.” loops

JANG: 3/4 正确  ·  2-bit: 1/4 正确  ·  45.3 vs 35.6 GB GPU  ·  <code style="font-size:0.72rem"><think></code> 推理能力在2.19 bits下保留

所有模型对比

大小、速度和分数 — JANG vs MLX

模型	方法	Bits	大小	MMLU
Qwen3.5-397B-A17B	JANG_2L	~2.x	187 GB	92%
	JANG_1L	~2.2	112 GB	86.5%
	MLX 4-bit	4.0	~280 GB	94%
	MLX 2-bit / 3-bit	2-3	—	NaN

Nemotron-3-Super-120B	JANG_4M	~4.2	63 GB	93%
	JANG_2L	~2.x	43 GB	86%
	MLX 3-bit	3.0	—	Broken

Qwen3.5-122B-A10B	JANG_2M	2.14	44.7 GB	79%
	JANG_1L	2.24	46.0 GB	73%
	JANG_2L	2.19	45.3 GB	—
	MLX mixed_2_6	~2.5	45 GB	46%
	2-bit	2.0	36 GB	56.5%

Qwen3.5-35B-A3B	JANG_4K	3.99	20.1 GB	77.5%
	MLX 4-bit	4.0	18.2 GB	75.5%
	JANG_4S	4.04	20.4 GB	82%
	JANG_2S	2.17	12.8 GB	65.5%
	JANG_2L v2	2.28	13.3 GB	56%
	MLX mixed_2_6	~2.5	12.8 GB	~40%

MiniMax-M2.5 (230B)	JANG_2S	2.06	81.6 GB	—
	JANG_2L	2.10	82.5 GB	74%
	MLX 4-bit	4.0	119.8 GB	26.5%
	MLX 2-bit	2.0	66.6 GB	25.0%

Apple M4 Max 128 GB / M4 Ultra 256 GB · MMLU: 200-question (10 subjects × 20), reasoning enabled for 397B and Nemotron, thinking disabled for others · 2026-03

Qwen3.5-397B: JANG_1L at 112 GB (120 GB GPU peak) fits on 128 GB Macs — 86.5% MMLU with reasoning, 36 tok/s. JANG_2L at 187 GB hits 92% on M4 Ultra 256 GB. MLX 2/3-bit: NaN. MLX 4-bit: 94% but ~280 GB.

Nemotron-3-Super-120B: JANG_4M at 63 GB scores 93% MMLU, 55 tok/s. JANG_2L at 43 GB scores 86%, fits 64 GB Macs. MLX 3-bit: broken. First working Nemotron-H quantization for Apple Silicon.

MiniMax-M2.5 (230B): JANG_2L scores 74% MMLU at 82.5 GB vs MLX 4-bit at 26.5% (119.8 GB). MLX broken at ALL bit levels (26.5%, 24.5%, 25%). JANG is the only way to run MiniMax quantized.

Pipeline verification: JANG_4S matches MLX 4-bit exactly on 35B MMLU (82% = 82%), confirming the quantization pipeline is lossless at matched bit widths.

397B

已测试的最大模型

已测试的架构系列

tok/s（Nemotron 120B, JANG_4M）

0.3s

加载时间（3B模型，mmap）

早期结果

Dense模型比较（1B–7B）

在质量退化边界进行比较——standard quantization开始产生退化输出的位宽。相同提示词，相同temperature，相同模型。全部在M4 Max上。

Qwen3.5-4B（混合架构）

Hybrid: 24 linear + 8 full attn JANG_2S 2.5 eff. bits M4 Max · 107 GB

At 2.5 effective bits, JANG_2S gets 6/6 correct while 2-bit gets 0/6. JANG protects the 8 critical full-attention layers at 6-bit while compressing the 24 linear-attention layers and all MLP at 2-bit.

“What is 2+2?”

JANG: “The answer is 4.”

2-bit: “2+2? 2+2? 2+2?”

“Is a tomato a fruit?”

JANG: “A tomato is a fruit, not a vegetable.”

2-bit: “1 1 1 1 1 1 1 1”

“Who wrote Romeo and Juliet?”

JANG: Answers correctly

2-bit: “10, 10, 10, 10”

“What is photosynthesis?”

JANG: Correct definition

2-bit: Garbled text

“How many legs does a spider have?”

JANG: Answers correctly

2-bit: “10, 10, 10”

“Largest ocean on Earth?”

JANG: “The Pacific Ocean.”

2-bit: Infinite loop

亮点 — 7B模型

Mistral-7B-v0.3

Mistral GQA 4:1 JANG_3M 3.4 bits M4 Max

"光合作用是什么？"

JANG_3M (3.4 bits)

“Photosynthesis is the process by which plants and some other organisms...”

3-bit（3.5 bits）

10000000000000000000000000000...

JANG_3M在3.4 bits下产生正确输出。3-bit（3.5 bits）输出数字序列。

Qwen2.5-7B

Qwen GQA 4:1 JANG_3L 3.6 bits M4 Max

"2+2等于几？"

JANG_3L (3.6 bits)

“The answer is 4.”

3-bit（3.5 bits）

Assistant Assistant Assistant Assistant Assistant...

JANG_3L（3.6 bits）正确回答。3-bit（3.5 bits）进入重复循环。

Mistral-7B — 4-bit

Mistral GQA 4:1 JANG_4S 4.1 bits M4 Max

"2+2等于几？"

JANG_4S (4.1 bits)

“The answer is 4. But what if...”

4-bit (4.5 bits)

4. What is 2+2? 4. What is 2+2? 4...

JANG_4S（4.1 bits）正确回答。4-bit（4.5 bits）循环问题。

Mistral-7B — 2-bit

Mistral GQA 4:1 JANG_2S 2.5 bits M4 Max

"说出太阳系的三颗行星。"

JANG_2S (2.5 bits)

“1. Jupiter 2. Mars 3. Saturn”

2-bit（2.5 bits）

is a new planet, and it is a new planet...

JANG_2S（2.5 bits）列出三颗行星。2-bit（2.5 bits）进入重复循环。

小型模型（1B–3B）

SmolLM2-1.7B

Llama MHA JANG_3M 3.4 bits M4 Max

"蜘蛛有几条腿？"

JANG_3M (3.4 bits)

“8. How many arms does a spider have? Answer: 8”

3-bit

2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

JANG_3M (3.4 bits)回答"8"。3-bit (3.5 bits)输出数字序列。

Phi-2 (2.7B)

Phi MHA JANG_2S 2.5 bits M4 Max

"光合作用是什么？"

JANG_2S (2.5 bits)

“Photosynthesis is the process by which plants use sunlight to con...”

2-bit

(empty output)

相同的平均bits。JANG产生正确答案。Standard产生空输出。

Qwen2.5-3B

Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max

"把'thank you'翻译成西班牙语。"

JANG_4S (4.1 bits)

“Thank you in Spanish is 'gracias'.”

4-bit

“Translate 'thank you' to Spanish.”

JANG_4S（4.1 bits）正确翻译。4-bit（4.5 bits）重复提示词。

Qwen2.5-3B — 创作

JANG_4L 4.5 bits M4 Max

"写一首关于月亮的俳句。"

JANG_4L (4.5 bits)

“The moon’s glow, a tranquil sight...”

4-bit

ancient sky, lunar glow, ancient sky, lunar glow...

JANG生成连贯文本。Standard循环一个短语。

Qwen2.5-3B — 番茄

JANG (4.12 bits) M4 Max

"番茄是水果还是蔬菜？"

JANG (4.12 bits)

“A tomato is a fruit. It is botanically classified...”

4-bit

Is a tomato a fruit or vegetable? Is it a vegetabl...

JANG正确回答。Standard进入重复循环。

TinyLlama-1.1B

Llama GQA 8:1 JANG_4S 4.1 bits M4 Max

"水的化学式是什么？"

JANG_4S (4.1 bits)

“What is the chemical formula for water? Answers: 1. H...”

4-bit

“What is the chemical formula for hydrogen peroxide?...”

JANG_4S（4.1 bits）保持主题。4-bit（4.5 bits）偏离到不同的问题。

Logit MSE证明

JANG 3.37 bits超越4-bit

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8

7.13 MSE — 4.49 bits

MLP=4, attn=6

8.70 MSE — 4.24 bits

4-bit

11.31 MSE — 4.00 bits

MLP=3, attn=6

11.10 MSE — 3.37 bits ✔

JANG at 3.37 bits (MSE 11.10) beats 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.

总结

所有已测试模型

模型	参数量	架构	测试	失败模式
Qwen3.5-397B-A17B	397B	MoE, Hybrid	MMLU	MLX 2/3-bit → NaN
Nemotron-3-Super-120B	120B	Hybrid Mamba-2 SSM + Latent MoE + Attn	MMLU	MLX 3-bit → broken
MiniMax-M2.5	230B	MoE 256 experts, top-8	MMLU	MLX all bits → random (25%)
Qwen3.5-122B-A10B	122B	MoE 256 experts, Hybrid	MMLU	2-bit → 56.5%, mixed_2_6 → 46%
Qwen3.5-35B-A3B	35B	MoE 256 experts, Hybrid GDN+FA	MMLU+QA	2-bit → degenerate, mixed_2_6 → broken
Qwen3.5-4B	4B	Hybrid: 24 linear + 8 full attn	6	2-bit → 0/6 correct
Mistral-7B	7B	Mistral GQA 4:1, sliding window	13	3-bit → number sequences
Qwen2.5-7B	7B	Qwen GQA 4:1	9	3-bit → repetition loop
Qwen2.5-3B	3B	Qwen GQA 8:1	6	4-bit → echo/loop
SmolLM2-1.7B	1.7B	Llama MHA	11	3-bit → number sequences
TinyLlama-1.1B	1.1B	Llama GQA 8:1	11	4-bit → topic derail
Phi-2	2.7B	Phi MHA, GELU MLP	9	2-bit → empty output

Apple M4 Max 128 GB / M4 Ultra 256 GB · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 12 models · 1B to 397B

配置方案

JANG_{bits}{size}

从超压缩到近无损的11个预设配置。S = Small（最大压缩），M = Medium（平衡），L = Large（最高质量）。

配置	MLP	Attention	Embed	lm_head	平均Bits
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

运行时

Swift + Metal推理引擎

14个自定义Metal GPU内核。零拷贝mmap加载。融合反量化用于decode和prefill。

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# 加载模型（零拷贝mmap）...

# 配置：JANG_4L（MLP=4，attn=8，平均=4.5 bits）

# 大小：1.8 GB — 0.39秒加载完成

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

单token decode的融合反量化+矩阵-向量乘法。所有位宽（2、3、4、5、6、8）在一个内核中完成。

Dequant + GEMM

提示词prefill的融合反量化+矩阵-矩阵乘法。针对Apple GPU threadgroup内存进行了分块优化。

GQA Attention

Grouped-query attention decode + causal prefill。支持标准、滑动窗口和混合架构。

RMSNorm + RoPE

融合归一化和旋转位置编码。支持传统和非传统RoPE变体。

SwiGLU

门控前馈网络的融合SiLU激活+逐元素乘法。

量化Embedding

从量化权重直接查找embedding。无需对整表进行反量化。

量化

转换任意模型

将HuggingFace模型转换为.jang格式的Python工具。选择配置，选择量化方法，然后运行。支持RTN、MSE最优网格搜索和GPTQ（Hessian引导）quantization。

支持6+架构系列：Llama、Qwen、Gemma、Phi、Mistral、Mamba/SSM、MoE，以及包括Qwen 3.5在内的混合模型。

开源 — Apache 2.0许可证

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-7B \

--profile JANG_4L \

--method gptq \

--output ./Qwen2.5-7B-JANG_4L/

# 使用GPTQ（Hessian引导）量化中...

# Attention层：8-bit | MLP：4-bit

# 平均bits：4.5 | 大小：4.1 GB

# 完成 ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

内存

用更少的RAM运行更大的模型

JANG_3M在7B以上模型中比4-bit节省25%，且质量相当。可以将以前无法装入的模型放入unified memory。

~4.1 GB

JANG_4S下的7B（对比标准 4.5 GB）

~8.2 GB

JANG_4S下的14B（对比4-bit 9 GB）

~41 GB

JANG_4S下的70B（对比4-bit 45 GB）

25%

JANG_3M对比4-bit的节省率

模型

HuggingFace上的预量化模型

可供下载。通过JANG加载器与vMLX Engine / MLX Studio兼容。

Qwen3.5-397B-A17B-JANG_1L

112 GB · 86.5% MMLU · 36 tok/s · Fits 128 GB Mac

Qwen3.5-397B-A17B-JANG_2L

187 GB · 92% MMLU · 36 tok/s · M4 Ultra 256 GB

Nemotron-3-Super-120B-JANG_4M

63 GB · 93% MMLU · 55 tok/s

Nemotron-3-Super-120B-JANG_2L

43 GB · 86% MMLU · 52 tok/s · Fits 64 GB Mac

Qwen3.5-122B-A10B-JANG_4K

3.99 bits · 71 GB · 86% MMLU (200q) · ~40 tok/s

Qwen3.5-122B-A10B-JANG_2S

2.11 bits · 44 GB · 79% MMLU (200q) · ~45 tok/s

Qwen3.5-35B-A3B-JANG_4K

3.99 bits · 20.1 GB · 77.5% MMLU (200q) · ~100 tok/s

Qwen3.5-35B-A3B-JANG_2S

2.17 bits · 12.8 GB · 65.5% MMLU (200q) · Fits 16 GB RAM

HuggingFace上的所有模型

原生集成

在MLX Studio中运行JANG模型

MLX Studio提供原生JANG支持，包含OpenAI兼容API、prefix caching、paged KV cache、KV quantization（q4/q8）、continuous batching以及20多种智能编程工具。加载任何.jang模型并在本地部署——兼容Cursor、Continue、Aider及所有OpenAI API客户端。由vMLX Engine驱动，现已开源——pip install vmlx。

MLX Studio vMLX Engine

JANG

基于层敏感度的可变位宽

JANG vs MLX——并排对比

基本提示词的三方比较

大小、速度和分数 — JANG vs MLX

Dense模型比较（1B–7B）

亮点 — 7B模型

JANG_3M (3.4 bits)

3-bit（3.5 bits）

JANG_3L (3.6 bits)

3-bit（3.5 bits）

JANG_4S (4.1 bits)

4-bit (4.5 bits)

JANG_2S (2.5 bits)

2-bit（2.5 bits）

更多7B结果

JANG_3L (3.6 bits)

3-bit

JANG_3M (3.4 bits)

3-bit

JANG_3L (3.6 bits)

3-bit

JANG_2M (2.7 bits)

2-bit

JANG_4L (4.5 bits)

4-bit

JANG_2S (2.5 bits)

2-bit

小型模型（1B–3B）

JANG_3M (3.4 bits)

3-bit

JANG_2S (2.5 bits)

2-bit

JANG_4S (4.1 bits)

4-bit

JANG_4L (4.5 bits)

4-bit

JANG (4.12 bits)

4-bit

JANG_4S (4.1 bits)

4-bit

JANG 3.37 bits超越4-bit

所有已测试模型

JANG_{bits}{size}

Swift + Metal推理引擎

Dequant + GEMV

Dequant + GEMM

GQA Attention

RMSNorm + RoPE

SwiGLU

量化Embedding

转换任意模型

用更少的RAM运行更大的模型

HuggingFace上的预量化模型

在MLX Studio中运行JANG模型