オープンソース

JANG

The GGUF for MLX

128 GB Macで3970億パラメータ。92% MMLU。MLXはロードすらできません。

JANG_1Lは397Bモデルを112 GBに圧縮 — 128 GB MacBook Proで推論モード86.5% MMLUを達成。MLXは2~3 bitsでNaNを出力。MiniMax 230B? MLXは全bit水準で26.5%。Nemotron-H 120B? MLX 3-bitは完全に壊れています。JANGだけがApple Siliconでこれらのモデルを量子化実行できます。

JANGはattentionにより多くのbitsを、MLPにより少ないbitsを割り当て、standard quantizationがゴミやNaNを出力する場所でもモデルが正常動作。同じ速度、同じMetalカーネル — より良い出力。Apache 2.0オープンソース。

重要度に基づくビット割り当て 2-bit〜8-bit混合精度 14個のカスタムMetal GPUカーネル Swift + Metalランタイムブロックごとの可変ビット幅オープンソース · Apache 2.0

GitHubで見る結果を見る

397B

最大モデル — 128 GB Macで動作

92%

397BでのMMLU（JANG_2L）

93%

Nemotron-H 120BでのMMLU（JANG_4M）

Apache 2.0

オープンソースライセンス

仕組み

レイヤー感度に基づく可変ビット幅

Standard quantizationはすべてのテンソルに同じビット幅を適用します。Attentionレイヤー（パラメータの約12%）はMLPレイヤーよりも精度損失に敏感です — 過度に量子化するとattentionスコアが平坦になり、位置エンコーディングが劣化し、出力が退化します。

JANGはテンソルを感度の階層に分類し、それに応じてビット幅を割り当てます。Attentionレイヤーは5〜8 bitsを割り当てられ、MLPは2〜4 bitsに圧縮されます。オーバーヘッドは平均約0.3 bitsの追加です。

Attention

8-bit — 保護

MLP

2-bit — 圧縮

Embed

4-bit

lm_head

6-bit

Result

JANG_2M
 → 2.7 avg bits → 
coherent output

3-bit
 → 3.0 avg bits → 
repetition loops

MMLU ベンチマーク

JANG vs MLX — 並列比較

各JANGモデルをサイズが最も近いMLX方式と比較。200問MMLU（10科目×各20問）、thinking/reasoningは記載箇所で有効化、temp 0.0。Apple M4 Max 128 GB / M4 Ultra 256 GB。

Qwen3.5-397B-A17B — 397 billion parameters — JANG vs MLX

JANG

JANG_1L

112 GB disk · 120 GB GPU peak · 36 tok/s · FITS 128 GB MACS

86.5%

MMLU (200q, reasoning) · 173/200

397B intelligence on a laptop

MLX

2-bit / 3-bit

Cannot run — NaN output

NaN

Model too complex for standard quantization

JANG

JANG_2L

187 GB disk · 197 GB GPU · 36 tok/s · M4 Ultra 256 GB

92%

MMLU (200q, reasoning) · 184/200

Near-FP16 quality at 2.x bits

MLX

4-bit

~280 GB · requires massive machines

94%

MMLU (200q, reasoning)

397B on a 128 GB Mac — first ever. JANG_1L at 112 GB disk (120 GB GPU peak) fits on a 128 GB MacBook Pro and scores 86.5% MMLU with reasoning. MLX at 2-bit and 3-bit produces NaN — the model is too complex for standard quantization at low bit widths. MLX 4-bit runs at 94% but needs ~280 GB, far beyond any laptop. JANG_2L at 187 GB hits 92% on an M4 Ultra 256 GB.

Nemotron-3-Super-120B-A12B — NVIDIA Hybrid Mamba-2 SSM + Latent MoE + Attention

JANG

JANG_4M

63 GB · 55 tok/s

93%

MMLU (200q, reasoning) · 186/200

First Nemotron-H on Apple Silicon

MLX

3-bit

Broken

—

Cannot produce valid output

JANG

JANG_2L

43 GB · 52 tok/s · fits 64 GB Macs

86%

MMLU (200q, reasoning) · 172/200

120B on a 64 GB Mac

First working Nemotron-H quantization for Apple Silicon. NVIDIA’s hybrid architecture combines Mamba-2 SSM, Latent MoE, and standard attention — MLX 3-bit is broken on it. JANG_4M at 63 GB scores 93% MMLU with reasoning at 55 tok/s. JANG_2L fits on a 64 GB Mac at 43 GB with 86% MMLU.

MiniMax-M2.5 (230B) — JANG vs MLX

JANG

JANG_2L

82.5 GB · 2.10 bits · 0.9s per question

74.0%

MMLU (200q) · 148/200

+47.5 points · MLX broken at ALL bit levels

MLX

4-bit

119.8 GB · 4.0 bits · 0.9s per question

26.5%

MMLU (200q) · 53/200

MLX is completely broken on MiniMax at every bit level — 4-bit (26.5%), 3-bit (24.5%), and 2-bit (25%) all score near random. JANG_2L at just 2.10 bits is the only way to run MiniMax quantized on Apple Silicon.

Per-subject breakdown — MiniMax-M2.5 (230B) — all methods

科目	JANG_2L	MLX 4-bit	MLX 3-bit	MLX 2-bit
Abstract Algebra	10/20	3/20	2/20	5/20
Anatomy	15/20	7/20	5/20	5/20
Astronomy	20/20	7/20	6/20	4/20
College CS	13/20	4/20	5/20	6/20
College Physics	13/20	8/20	6/20	6/20
HS Biology	18/20	4/20	5/20	6/20
HS Chemistry	18/20	4/20	5/20	5/20
HS Mathematics	8/20	6/20	6/20	3/20
Logical Fallacies	18/20	5/20	4/20	5/20
World Religions	15/20	5/20	5/20	5/20
Total	148/200 (74%)	53/200 (26.5%)	49/200 (24.5%)	50/200 (25%)

JANG wins all 10 subjects against all MLX methods. MLX 4-bit, 3-bit, and 2-bit all score near random (25%). Root cause: MLX generates meta-commentary instead of direct answers on this model.

Qwen3.5-122B-A10B — ~4 bits

JANG

JANG_4K

71 GB · 3.99 bits · ~40 tok/s

86%

MMLU (200q) · 172/200

+1 point vs MLX 4-bit

MLX

4-bit

64 GB · 4.0 bits · ~50 tok/s

85%

MMLU (200q) · 170/200

Per-subject breakdown — 122B ~4 bits

科目	JANG_4K	MLX 4-bit
Abstract Algebra	16/20	15/20
Anatomy	19/20	18/20
Astronomy	19/20	19/20
College CS	15/20	15/20
College Physics	14/20	14/20
HS Biology	19/20	19/20
HS Chemistry	18/20	18/20
HS Mathematics	14/20	14/20
Logical Fallacies	19/20	19/20
World Religions	19/20	19/20
Total	172/200 (86%)	170/200 (85%)

JANG wins 2 subjects, ties 8. Neck-and-neck at ~4 bits.

Qwen3.5-122B-A10B — ~2 bits

JANG

JANG_2S

44 GB · 2.11 bits · ~45 tok/s

79%

MMLU (200q) · 158/200

+22.5 points

MLX

2-bit

36 GB · 2.0 bits · ~52 tok/s

56.5%

MMLU (200q) · 113/200

Per-subject breakdown — 122B ~2 bits

科目	JANG_2S	MLX 2-bit
Abstract Algebra	9/20	9/20
Anatomy	18/20	11/20
Astronomy	20/20	16/20
College CS	14/20	8/20
College Physics	15/20	10/20
HS Biology	19/20	15/20
HS Chemistry	18/20	13/20
HS Mathematics	11/20	4/20
Logical Fallacies	16/20	13/20
World Religions	18/20	14/20
Total	158/200 (79%)	113/200 (56.5%)

JANG wins 9 of 10 subjects, ties 1 (Abstract Algebra).

Qwen3.5-35B-A3B — ~4 bits

JANG

JANG_4K

20.1 GB · 3.99 bits · ~100 tok/s

77.5%

MMLU (200q) · 155/200

+2 points

MLX

4-bit

18.2 GB · 4.0 bits · ~110 tok/s

75.5%

MMLU (200q) · 151/200

Per-subject breakdown — 35B ~4 bits

科目	JANG_4K	MLX 4-bit
Abstract Algebra	12/20	10/20
Anatomy	17/20	16/20
Astronomy	18/20	18/20
College CS	14/20	15/20
College Physics	14/20	13/20
HS Biology	18/20	18/20
HS Chemistry	17/20	17/20
HS Mathematics	10/20	8/20
Logical Fallacies	18/20	19/20
World Religions	17/20	17/20
Total	155/200 (77.5%)	151/200 (75.5%)

JANG wins 4 subjects, loses 2 (College CS, Logical Fallacies), ties 4.

Qwen3.5-35B-A3B — ~2 bits

JANG

JANG_2S

12.8 GB · 2.17 bits · fits 16 GB RAM

65.5%

MMLU (200q) · 131/200

+25 points

MLX

2-bit

12.8 GB · ~2.5 bits

~40%

MMLU (est. from 34% at 50q)

Per-subject breakdown — 35B ~2 bits (JANG only)

科目	JANG_2S	MLX 2-bit
Abstract Algebra	8/20	—
Anatomy	14/20	—
Astronomy	19/20	—
College CS	14/20	—
College Physics	11/20	—
HS Biology	16/20	—
HS Chemistry	14/20	—
HS Mathematics	5/20	—
Logical Fallacies	14/20	—
World Religions	16/20	—
Total	131/200 (65.5%)	~40% (est.)

MLX 2-bit 200q not yet tested. Estimate based on 34% at 50 questions.

Test methodology & conditions

MMLU: 200-question subset (10 subjects × 20 questions each), thinking disabled, temperature 0.0.
Hardware: Apple M4 Max 128 GB unified memory.
Quantization: MLX affine quantization, group_size=64. JANG uses variable bit widths via quant_predicate.
Models: All methods use the same base model weights. JANG stays quantized in GPU memory using MLX’s native quantized_matmul — no float16 expansion.
Reproducibility: All scores verified from HuggingFace model cards. Code at github.com/jjang-ai/jangq.

Download: All models on HuggingFace — 397B, Nemotron-H 120B, 122B, 35B, MiniMax 230B, and more

QA プロンプトテスト

基本プロンプトでの3者比較

6つの事実問題で並列比較。全方式がMLXのネイティブMetalカーネルを使用。Temperature 0.0、最大80 tokens。M4 Max 128 GB。

Qwen3.5-122B-A10B — JANG_1L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, top-8, 10B active, Hybrid JANG_1L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_1L · 2.24 bits

46.0 GB RAM · 48 tok/s

MLX mixed_2_6 · ~2.2 bits

44.9 GB RAM · 66 tok/s

2-bit · 2.0 bits

35.6 GB RAM · 67 tok/s

“What is 2+2?”

✓ “2+2 is 4”

∼ “2+2=4” then repeats

∼ “2+2=4” then loops

“Is a tomato a fruit?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: empty think tag

✗ 2-bit: rephrases, no answer

“What is photosynthesis?”

✓ “plants use energy of sun to make food”

✗ Degenerate output

✗ “Photos-sense y=y”

“Three planets larger?”

∼ JANG: uses <think> (partial)

∼ mixed_2_6: uses <think> (partial)

✗ Misreads question

“Who wrote Romeo and Juliet?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: double think tag

∼ 2-bit: uses <think> (partial)

“Capital of France?”

✓ “Paris”

✓ mixed_2_6: “Paris”

✓ 2-bit: “Paris” with details

JANG_1L: 3 正解、3 部分正解、0 失敗 · mixed_2_6: 1 正解、1 部分正解、4 失敗 · 2-bit: 1 正解、2 部分正解、3 失敗

MLX’s mixed_2_6 mode protects select v_proj and down_proj layers at 6-bit, but does not account for GatedDeltaNet linear attention layers, MoE expert routing tensors, or hybrid architecture components. JANG’s tier system classifies these architecture-specific tensors explicitly.

MiniMax-M2.5 (230B) — JANG_2S (2.06 bits)

MoE 256 experts, top-8, 10B active JANG_2S · 2.06 bits Mac Studio M4 Ultra 192 GB

JANG_2S · 2.06 bits
81.6 GB GPU · 50 tok/s
JANG_2L · 2.10 bits
82.5 GB RAM · 74% MMLU (200q)

JANG_2S: 2.06 bitsで3/6 正解 · 230Bモデル 81.6 GB · 50 tok/s JANG_2L: 82.5 GB RAMで74% MMLU（200問） — 120 GBのMLX 4-bitの3倍高い

Qwen3.5-35B-A3B — JANG_2L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, Hybrid GDN+FA JANG_2L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_2L · 2.28 bits

13.3 GB RAM · 100 tok/s

MLX mixed_2_6 · ~2.2 bits

12.8 GB RAM · 120 tok/s

2-bit · 2.0 bits

10.1 GB RAM · 128 tok/s

“What is 2+2?”

✓ “2+2 equals 4”

✗ “2+2=4” then loops

✗ Number sequences

“Is a tomato a fruit?”

✗ JANG: loops

∼ mixed_2_6: partial reasoning

✗ 2-bit: degenerate

“What is photosynthesis?”

✓ “convert light energy”

✗ “I cannot respond”

✗ “6 6 6”

“Three planets larger?”

✓ “Jupiter, Saturn, Uranus”

✗ “Antina” loops

✗ Number sequences

“Who wrote Romeo and Juliet?”

∼ JANG: “Shakespeare” (partial)

✗ mixed_2_6: contradicts itself

✗ 2-bit: degenerate

“Capital of France?”

✓ “Paris” with details

✗ Never answers

∼ 2-bit: “Paris” partial

JANG_2L: 4 正解、1 部分正解、1 失敗 · mixed_2_6: 0 正解、1 部分正解、5 失敗 · 2-bit: 0 正解、1 部分正解、5 失敗

On this hybrid MoE model, MLX mixed_2_6 does not improve over 2-bit. The mixed_2_6 heuristic targets v_proj and down_proj in standard transformer layers but misses GatedDeltaNet attention and MoE routing tensors that are critical for this architecture.

Qwen3.5-122B-A10B — 1220億パラメータ、直接比較

MoE 256 experts, top-8, 10B active JANG_2L vs 2-bit M4 Max 128 GB

JANG_2L · 2.19 bits
45.3 GB RAM · 38–49 tok/s
2-bit · 2.0 bits
35.6 GB RAM · 52–65 tok/s

“What is photosynthesis?”

“process by which green plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose”

“Photos-sense” then “y = y = y” degenerate

“Three planets larger than Earth?”

Uses <think> reasoning tags, lists Jupiter with details

Misreads as “larger than Earth’s moon”, rambles

“Capital of France?”

“Paris” with government details

“Paris, on the banks of the River Seine” — both correct

“What is 2+2?”

“2+2 is 4.” (then repeats) — PARTIAL

“2+2=4” then “2. 2. 2.” loops

JANG: 3/4 正解  ·  2-bit: 1/4 正解  ·  45.3 vs 35.6 GB GPU  ·  <code style="font-size:0.72rem"><think></code> 推論機能 2.19 bitsで維持

全モデル比較

サイズ、速度、スコア — JANG vs MLX

モデル	方法	Bits	サイズ	MMLU
Qwen3.5-397B-A17B	JANG_2L	~2.x	187 GB	92%
	JANG_1L	~2.2	112 GB	86.5%
	MLX 4-bit	4.0	~280 GB	94%
	MLX 2-bit / 3-bit	2-3	—	NaN

Nemotron-3-Super-120B	JANG_4M	~4.2	63 GB	93%
	JANG_2L	~2.x	43 GB	86%
	MLX 3-bit	3.0	—	Broken

Qwen3.5-122B-A10B	JANG_2M	2.14	44.7 GB	79%
	JANG_1L	2.24	46.0 GB	73%
	JANG_2L	2.19	45.3 GB	—
	MLX mixed_2_6	~2.5	45 GB	46%
	2-bit	2.0	36 GB	56.5%

Qwen3.5-35B-A3B	JANG_4K	3.99	20.1 GB	77.5%
	MLX 4-bit	4.0	18.2 GB	75.5%
	JANG_4S	4.04	20.4 GB	82%
	JANG_2S	2.17	12.8 GB	65.5%
	JANG_2L v2	2.28	13.3 GB	56%
	MLX mixed_2_6	~2.5	12.8 GB	~40%

MiniMax-M2.5 (230B)	JANG_2S	2.06	81.6 GB	—
	JANG_2L	2.10	82.5 GB	74%
	MLX 4-bit	4.0	119.8 GB	26.5%
	MLX 2-bit	2.0	66.6 GB	25.0%

Apple M4 Max 128 GB / M4 Ultra 256 GB · MMLU: 200-question (10 subjects × 20), reasoning enabled for 397B and Nemotron, thinking disabled for others · 2026-03

Qwen3.5-397B: JANG_1L at 112 GB (120 GB GPU peak) fits on 128 GB Macs — 86.5% MMLU with reasoning, 36 tok/s. JANG_2L at 187 GB hits 92% on M4 Ultra 256 GB. MLX 2/3-bit: NaN. MLX 4-bit: 94% but ~280 GB.

Nemotron-3-Super-120B: JANG_4M at 63 GB scores 93% MMLU, 55 tok/s. JANG_2L at 43 GB scores 86%, fits 64 GB Macs. MLX 3-bit: broken. First working Nemotron-H quantization for Apple Silicon.

MiniMax-M2.5 (230B): JANG_2L scores 74% MMLU at 82.5 GB vs MLX 4-bit at 26.5% (119.8 GB). MLX broken at ALL bit levels (26.5%, 24.5%, 25%). JANG is the only way to run MiniMax quantized.

Pipeline verification: JANG_4S matches MLX 4-bit exactly on 35B MMLU (82% = 82%), confirming the quantization pipeline is lossless at matched bit widths.

397B

テスト済み最大モデル

テスト済みアーキテクチャファミリー

tok/s（Nemotron 120B, JANG_4M）

0.3s

ロード時間（3Bモデル、mmap）

以前の結果

Denseモデル比較（1B〜7B）

品質劣化の境界での比較 — standard quantizationが退化した出力を生成し始めるビット幅。同じプロンプト、同じtemperature、同じモデル。すべてM4 Maxで。

Qwen3.5-4B（ハイブリッドアーキテクチャ）

Hybrid: 24 linear + 8 full attn JANG_2S 2.5 eff. bits M4 Max · 107 GB

At 2.5 effective bits, JANG_2S gets 6/6 correct while 2-bit gets 0/6. JANG protects the 8 critical full-attention layers at 6-bit while compressing the 24 linear-attention layers and all MLP at 2-bit.

“What is 2+2?”

JANG: “The answer is 4.”

2-bit: “2+2? 2+2? 2+2?”

“Is a tomato a fruit?”

JANG: “A tomato is a fruit, not a vegetable.”

2-bit: “1 1 1 1 1 1 1 1”

“Who wrote Romeo and Juliet?”

JANG: Answers correctly

2-bit: “10, 10, 10, 10”

“What is photosynthesis?”

JANG: Correct definition

2-bit: Garbled text

“How many legs does a spider have?”

JANG: Answers correctly

2-bit: “10, 10, 10”

“Largest ocean on Earth?”

JANG: “The Pacific Ocean.”

2-bit: Infinite loop

ハイライト — 7Bモデル

Mistral-7B-v0.3

Mistral GQA 4:1 JANG_3M 3.4 bits M4 Max

「光合成とは何ですか？」

JANG_3M (3.4 bits)

“Photosynthesis is the process by which plants and some other organisms...”

3-bit（3.5 bits）

10000000000000000000000000000...

JANG_3Mは3.4 bitsで正確な出力を生成します。3-bit（3.5 bits）は数字のシーケンスを出力します。

Qwen2.5-7B

Qwen GQA 4:1 JANG_3L 3.6 bits M4 Max

「2+2は？」

JANG_3L (3.6 bits)

“The answer is 4.”

3-bit（3.5 bits）

Assistant Assistant Assistant Assistant Assistant...

JANG_3L（3.6 bits）は正確に回答します。3-bit（3.5 bits）は繰り返しループに入ります。

Mistral-7B — 4-bit

Mistral GQA 4:1 JANG_4S 4.1 bits M4 Max

「2+2は？」

JANG_4S (4.1 bits)

“The answer is 4. But what if...”

4-bit (4.5 bits)

4. What is 2+2? 4. What is 2+2? 4...

JANG_4S（4.1 bits）は正確に回答します。4-bit（4.5 bits）は質問をループします。

Mistral-7B — 2-bit

Mistral GQA 4:1 JANG_2S 2.5 bits M4 Max

「太陽系の惑星を3つ挙げてください。」

JANG_2S (2.5 bits)

“1. Jupiter 2. Mars 3. Saturn”

2-bit（2.5 bits）

is a new planet, and it is a new planet...

JANG_2S（2.5 bits）は3つの惑星を列挙します。2-bit（2.5 bits）は繰り返しループに入ります。

追加7B結果

Qwen2.5-7B

JANG_3L 3.6 bits M4 Max

「ロミオとジュリエットの作者は誰ですか？」

JANG_3L (3.6 bits)

“The play Romeo and Juliet was written by William Shakespeare”

3-bit

Who wrote Romeo and Juliet?

JANGは正確に回答します。Standardはプロンプトをそのまま繰り返します。

Mistral-7B

JANG_3M 3.4 bits M4 Max

「レオナルド・ダ・ヴィンチの有名な絵画を挙げてください。」

JANG_3M (3.4 bits)

“The Last Supper, The Mona Lisa, The Vitruvian Man...”

3-bit

Name a famous painting by Michelangelo. Name a famous painting by Rap...

JANGは正確な作品を列挙します。Standardは異なるアーティストを幻覚します。

Mistral-7B

JANG_3L 3.6 bits M4 Max

「クモの脚は何本ですか？」

JANG_3L (3.6 bits)

“Spiders have eight legs.”

3-bit

TDM 10000000000000000000000...

JANGは正確に回答します。Standardは数字のシーケンスを出力します。

Mistral-7B — 2.7 bits

JANG_2M 2.7 bits M4 Max

「地球で最も大きい海は何ですか？」

JANG_2M (2.7 bits)

“The Pacific Ocean, The Atlantic Ocean, The Indian...”

2-bit

## 1000000000000000000000000...

JANG_2M（2.7 bits）は正確に回答します。2-bitは数字のシーケンスを出力します。

Mistral-7B — 4-bit

JANG_4L 4.5 bits M4 Max

「ロミオとジュリエットの作者は誰ですか？」

JANG_4L (4.5 bits)

“William Shakespeare. What is the name of the play...”

4-bit

William Shakespeare 1564-1616 1564-1616 1564-1616...

JANGは正確に回答します。Standardは日付を繰り返しループします。

Mistral-7B — 2-bit

JANG_2S 2.5 bits M4 Max

「トマトは果物ですか、野菜ですか？」

JANG_2S (2.5 bits)

“The tomato is a fruit, not a vegetable”

2-bit

The tomato is a fruit or a vegetable?...

同じ平均ビット幅。JANGは正確に回答し、standardはループします。

小型モデル（1B–3B）

SmolLM2-1.7B

Llama MHA JANG_3M 3.4 bits M4 Max

「クモの脚は何本ですか？」

JANG_3M (3.4 bits)

“8. How many arms does a spider have? Answer: 8”

3-bit

2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

JANG_3M (3.4 bits)は「8」と回答。3-bit (3.5 bits)は数列を出力。

Phi-2 (2.7B)

Phi MHA JANG_2S 2.5 bits M4 Max

「光合成とは何ですか？」

JANG_2S (2.5 bits)

“Photosynthesis is the process by which plants use sunlight to con...”

2-bit

(empty output)

同じ平均bits。JANGは正確な回答を生成します。Standardは空の出力を生成します。

Qwen2.5-3B

Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max

「'thank you'をスペイン語に翻訳してください。」

JANG_4S (4.1 bits)

“Thank you in Spanish is 'gracias'.”

4-bit

“Translate 'thank you' to Spanish.”

JANG_4S（4.1 bits）は正確に翻訳します。4-bit（4.5 bits）はプロンプトを繰り返します。

Qwen2.5-3B — 創作

JANG_4L 4.5 bits M4 Max

「月についての俳句を書いてください。」

JANG_4L (4.5 bits)

“The moon’s glow, a tranquil sight...”

4-bit

ancient sky, lunar glow, ancient sky, lunar glow...

JANGは一貫したテキストを生成します。Standardはフレーズをループします。

Qwen2.5-3B — トマト

JANG (4.12 bits) M4 Max

「トマトは果物ですか、野菜ですか？」

JANG (4.12 bits)

“A tomato is a fruit. It is botanically classified...”

4-bit

Is a tomato a fruit or vegetable? Is it a vegetabl...

JANGは正確に回答します。Standardは繰り返しループに入ります。

TinyLlama-1.1B

Llama GQA 8:1 JANG_4S 4.1 bits M4 Max

「水の化学式は何ですか？」

JANG_4S (4.1 bits)

“What is the chemical formula for water? Answers: 1. H...”

4-bit

“What is the chemical formula for hydrogen peroxide?...”

JANG_4S（4.1 bits）はトピックを維持します。4-bit（4.5 bits）は別の質問に脱線します。

Logit MSE証明

JANG 3.37 bitsが4-bitを上回ります

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8

7.13 MSE — 4.49 bits

MLP=4, attn=6

8.70 MSE — 4.24 bits

4-bit

11.31 MSE — 4.00 bits

MLP=3, attn=6

11.10 MSE — 3.37 bits ✔

JANG at 3.37 bits (MSE 11.10) beats 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.

まとめ

テスト済み全モデル

モデル	パラメータ	アーキテクチャ	テスト	失敗モード
Qwen3.5-397B-A17B	397B	MoE, Hybrid	MMLU	MLX 2/3-bit → NaN
Nemotron-3-Super-120B	120B	Hybrid Mamba-2 SSM + Latent MoE + Attn	MMLU	MLX 3-bit → broken
MiniMax-M2.5	230B	MoE 256 experts, top-8	MMLU	MLX all bits → random (25%)
Qwen3.5-122B-A10B	122B	MoE 256 experts, Hybrid	MMLU	2-bit → 56.5%, mixed_2_6 → 46%
Qwen3.5-35B-A3B	35B	MoE 256 experts, Hybrid GDN+FA	MMLU+QA	2-bit → degenerate, mixed_2_6 → broken
Qwen3.5-4B	4B	Hybrid: 24 linear + 8 full attn	6	2-bit → 0/6 correct
Mistral-7B	7B	Mistral GQA 4:1, sliding window	13	3-bit → number sequences
Qwen2.5-7B	7B	Qwen GQA 4:1	9	3-bit → repetition loop
Qwen2.5-3B	3B	Qwen GQA 8:1	6	4-bit → echo/loop
SmolLM2-1.7B	1.7B	Llama MHA	11	3-bit → number sequences
TinyLlama-1.1B	1.1B	Llama GQA 8:1	11	4-bit → topic derail
Phi-2	2.7B	Phi MHA, GELU MLP	9	2-bit → empty output

Apple M4 Max 128 GB / M4 Ultra 256 GB · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 12 models · 1B to 397B

プロファイル

JANG_{bits}{size}

超圧縮からほぼ無損失まで11の定義済みプロファイル。S = Small（最大圧縮）、M = Medium（バランス）、L = Large（最高品質）。

プロファイル	MLP	Attention	Embed	lm_head	平均Bits
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

ランタイム

Swift + Metal推論エンジン

14個のカスタムMetal GPUカーネル。Zero-copy mmapロード。デコードとプリフィルの融合逆量子化。

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# モデル読み込み（zero-copy mmap）...

# プロファイル：JANG_4L（MLP=4, attn=8, 平均=4.5 bits）

# サイズ：1.8 GB — 0.39秒で読み込み完了

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

単一トークンdecodeのための逆量子化 + 行列-ベクトル乗算を融合。すべてのビット幅（2, 3, 4, 5, 6, 8）を1つのカーネルで処理します。

Dequant + GEMM

プロンプトprefillのための逆量子化 + 行列-行列乗算を融合。Apple GPUのthreadgroupメモリに最適化されたタイリング処理。

GQA Attention

Grouped-query attention decode + causal prefill。標準、sliding window、ハイブリッドアーキテクチャをサポートします。

RMSNorm + RoPE

正規化とrotary position embeddingを融合。従来型および非従来型のRoPEバリアントをサポートします。

SwiGLU

ゲート付きフィードフォワードネットワークのためのSiLU活性化 + 要素ごとの乗算を融合。

量子化Embedding

量子化された重みから直接embeddingを検索します。テーブル全体の逆量子化は不要です。

量子化

あらゆるモデルを変換

HuggingFaceモデルを.jangフォーマットに変換するPythonツールです。プロファイルを選択し、量子化手法を選んで実行するだけです。RTN、MSE最適グリッドサーチ、GPTQ（Hessianベース）quantizationをサポートしています。

6以上のアーキテクチャファミリーをサポート：Llama、Qwen、Gemma、Phi、Mistral、Mamba/SSM、MoE、Qwen 3.5を含むハイブリッドモデル。

オープンソース — Apache 2.0ライセンス

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-7B \

--profile JANG_4L \

--method gptq \

--output ./Qwen2.5-7B-JANG_4L/

# GPTQ（Hessianベース）で量子化中...

# Attentionレイヤー：8-bit | MLP：4-bit

# 平均bits：4.5 | サイズ：4.1 GB

# 完了 ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

メモリ

より少ないRAMでより大きなモデルを実行

JANG_3Mは7B以上のモデルで4-bitと同等の品質を維持しながら25%を節約します。以前は収まらなかったモデルをunified memoryに格納できます。

~4.1 GB

JANG_4Sで7B（4-bit 4.5 GB比）

~8.2 GB

JANG_4Sで14B（4-bit 9 GB比）

~41 GB

JANG_4Sで70B（4-bit 45 GB比）

25%

JANG_3Mの4-bit比の節約率

モデル

HuggingFaceの事前量子化モデル

ダウンロード可能。JANGローダーを通じてvMLX Engine / MLX Studioと互換性があります。

Qwen3.5-397B-A17B-JANG_1L

112 GB · 86.5% MMLU · 36 tok/s · Fits 128 GB Mac

Qwen3.5-397B-A17B-JANG_2L

187 GB · 92% MMLU · 36 tok/s · M4 Ultra 256 GB

Nemotron-3-Super-120B-JANG_4M

63 GB · 93% MMLU · 55 tok/s

Nemotron-3-Super-120B-JANG_2L

43 GB · 86% MMLU · 52 tok/s · Fits 64 GB Mac

Qwen3.5-122B-A10B-JANG_4K

3.99 bits · 71 GB · 86% MMLU (200q) · ~40 tok/s

Qwen3.5-122B-A10B-JANG_2S

2.11 bits · 44 GB · 79% MMLU (200q) · ~45 tok/s

Qwen3.5-35B-A3B-JANG_4K

3.99 bits · 20.1 GB · 77.5% MMLU (200q) · ~100 tok/s

Qwen3.5-35B-A3B-JANG_2S

2.17 bits · 12.8 GB · 65.5% MMLU (200q) · Fits 16 GB RAM

HuggingFaceの全モデル

ネイティブ統合

MLX StudioでJANGモデルを実行

MLX StudioはOpenAI互換API、prefix caching、paged KV cache、KV quantization（q4/q8）、continuous batching、20以上のエージェントコーディングツールとともにネイティブJANGサポートを提供します。任意の.jangモデルを読み込んでローカルでサーブできます — Cursor、Continue、Aider、およびすべてのOpenAI APIクライアントに対応。vMLX Engineを搭載、現在オープンソース — pip install vmlx。

MLX Studio vMLX Engine