オープンソース

JANG

The GGUF for MLX

Apple Silicon用 MLX 混合精度量子化。

MLXの均一量子化はすべてのレイヤーに同じビット幅を適用します。低ビット（2-3）ではアテンションレイヤーが最初に劣化します — 出力が繰り返しループや退化シーケンスに崩壊します。JANGは感度に基づいてテンソルごとに可変ビット幅を割り当てます：アテンションにより多くのビット、MLPにより少ないビット。

モデルはGPUメモリ内で量子化されたまま保持され、MLXのネイティブ quantized_matmul カーネルを使用してオンザフライで逆量子化されます — float16展開なし、速度ペナルティなし。フォーマットはsafetensorsベース。量子化ツールとランタイムはApache 2.0オープンソース。

重要度ベースのビット割り当て 2-bit〜8-bit混合精度 14 custom Metal GPU kernels Swift + Metalランタイムブロック別可変ビット幅オープンソース · Apache 2.0

GitHubで見る結果を見る

94%

122B MMLU (JANG_4K)

90%

122Bで2.14 bits HumanEval

42 GB

ディスク（MLX mixed_2_6より小さい）

Apache 2.0

オープンソースライセンス

仕組み

レイヤー感度に基づく可変ビット幅

均一量子化はすべてのテンソルに同じビット幅を適用します。アテンションレイヤー（パラメータの約12%）はMLPレイヤーよりも精度損失に敏感です — 過度に量子化されるとアテンションスコアが平坦化し、位置エンコーディングが劣化し、出力が退化します。

JANGはテンソルを感度階層に分類し、それに応じてビット幅を割り当てます。アテンションレイヤーに5–8 bits、MLPに2–4 bits圧縮。オーバーヘッドは平均約0.3 bits。

Attention

8-bit — 保護

MLP

2-bit — 圧縮

Embed

4-bit

lm_head

6-bit

結果

                JANG_2M
                 → 2.7 avg bits → 
                一貫した出力
              

                3-bit
                 → 3.0 avg bits → 
                繰り返しループ
              

MMLU Benchmark

JANG vs MLX — side by side

Each JANG model compared against the closest MLX method by size. 50-question MMLU, thinking disabled, temp 0.0. Apple M4 Max 128 GB.

Qwen3.5-122B-A10B — ~4 bits — NEW

JANG

JANG_4K

69 GB · 3.99 bits · ~40 tok/s

94%

MMLU

+4 points vs MLX 4-bit

MLX

4-bit

64 GB · 4.0 bits · ~50 tok/s

90%

MMLU

Qwen3.5-122B-A10B — ~2 bits

JANG

JANG_2M

42 GB · 2.14 bits

84%

MMLU

+38 points · 2 GB smaller

MLX

mixed_2_6

44 GB · ~2.5 bits

46%

MMLU

Qwen3.5-35B-A3B — ~4 bits

JANG

JANG_4K

16.7 GB · 3.99 bits · ~100 tok/s

84%

MMLU

+2 points · 1.3 GB smaller

MLX

4-bit

18 GB · 4.0 bits · ~110 tok/s

82%

MMLU

Qwen3.5-35B-A3B — ~2 bits

JANG

JANG_2S

12 GB · 2.17 bits · fits 16 GB RAM

62%

MMLU

+28 points · 1 GB smaller

MLX

mixed_2_6

13 GB · ~2.5 bits

34%

MMLU

Qwen3.5-122B-A10B — JANG_1L — 200-question MMLU

JANG

JANG_1L

51 GB · 2.24 bits · 48 tok/s

73%

MMLU (200 questions) · Wins 9/10 subjects

+27 points

MLX

mixed_2_6

44 GB · ~2.5 bits · 66 tok/s

46%

MMLU (200 questions)

All scores verified from HuggingFace model cards. Download: JANG_4K · JANG_2S · JANG_1L

追加7B結果

Qwen2.5-7B

JANG_3L 3.6 bits M4 Max

“Who wrote Romeo and Juliet?”

JANG_3L (3.6 bits)

“The play Romeo and Juliet was written by William Shakespeare”

3-bit

Who wrote Romeo and Juliet?

JANG answers correctly. Standard echoes the prompt back.

Mistral-7B

JANG_3M 3.4 bits M4 Max

“Name a famous painting by Leonardo da Vinci.”

JANG_3M (3.4 bits)

“The Last Supper, The Mona Lisa, The Vitruvian Man...”

3-bit

Name a famous painting by Michelangelo. Name a famous painting by Rap...

JANG lists correct works. Standard hallucinates different artists.

Mistral-7B

JANG_3L 3.6 bits M4 Max

“How many legs does a spider have?”

JANG_3L (3.6 bits)

“Spiders have eight legs.”

3-bit

TDM 10000000000000000000000...

JANG answers correctly. Standard outputs number sequences.

Mistral-7B — 2.7 bits

JANG_2M 2.7 bits M4 Max

“What is the largest ocean on Earth?”

JANG_2M (2.7 bits)

“The Pacific Ocean, The Atlantic Ocean, The Indian...”

2-bit

## 1000000000000000000000000...

JANG_2M (2.7 bits) answers correctly. 2-bit outputs number sequences.

Mistral-7B — 4-bit

JANG_4L 4.5 bits M4 Max

“Who wrote Romeo and Juliet?”

JANG_4L (4.5 bits)

“William Shakespeare. What is the name of the play...”

4-bit

William Shakespeare 1564-1616 1564-1616 1564-1616...

JANG answers correctly. Standard loops the dates repeatedly.

Mistral-7B — 2-bit

JANG_2S 2.5 bits M4 Max

“Is a tomato a fruit or vegetable?”

JANG_2S (2.5 bits)

“The tomato is a fruit, not a vegetable”

2-bit

The tomato is a fruit or a vegetable?...

Same average bit width. JANG answers correctly, standard loops.

小型モデル（1B〜3B）

SmolLM2-1.7B

Llama MHA JANG_3M 3.4 bits M4 Max

“How many legs does a spider have?”

JANG_3M (3.4 bits)

“8. How many arms does a spider have? Answer: 8”

3-bit

2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

JANG_3M (3.4 bits) answers “8”. 3-bit (3.5 bits) outputs number sequences.

Phi-2 (2.7B)

Phi MHA JANG_2S 2.5 bits M4 Max

“What is photosynthesis?”

JANG_2S (2.5 bits)

“Photosynthesis is the process by which plants use sunlight to con...”

2-bit

(empty output)

Same average bits. JANG produces correct answer. Standard produces empty output.

Qwen2.5-3B

Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits)

“Thank you in Spanish is 'gracias'.”

4-bit

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits) translates correctly. 4-bit (4.5 bits) echoes the prompt.

Qwen2.5-3B — Creative

JANG_4L 4.5 bits M4 Max

“Write a haiku about the moon.”

JANG_4L (4.5 bits)

“The moon’s glow, a tranquil sight...”

4-bit

ancient sky, lunar glow, ancient sky, lunar glow...

JANG generates coherent text. Standard loops a phrase.

Qwen2.5-3B — Tomato

JANG (4.12 bits) M4 Max

“Is a tomato a fruit or vegetable?”

JANG (4.12 bits)

“A tomato is a fruit. It is botanically classified...”

4-bit

Is a tomato a fruit or vegetable? Is it a vegetabl...

JANG answers correctly. Standard enters repetition loop.

TinyLlama-1.1B

Llama GQA 8:1 JANG_4S 4.1 bits M4 Max

“What is the chemical formula for water?”

JANG_4S (4.1 bits)

“What is the chemical formula for water? Answers: 1. H...”

4-bit

“What is the chemical formula for hydrogen peroxide?...”

JANG_4S (4.1 bits) stays on topic. 4-bit (4.5 bits) derails to a different question.

Logit MSE 証明

JANG 3.37 bitsが均一4-bitを上回る

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8

7.13 MSE — 4.49 bits

MLP=4, attn=6

8.70 MSE — 4.24 bits

4-bit

11.31 MSE — 4.00 bits

MLP=3, attn=6

11.10 MSE — 3.37 bits ✔

JANG 3.37 bits（MSE 11.10）が均一4.00 bits（MSE 11.31）を上回る — 16%少ないビットでより高い品質。

要約

テスト済みの全モデル

Model	Params	Architecture	テスト	失敗モード
Mistral-7B	7B	Mistral GQA 4:1, sliding window	13	3-bit → number sequences, 4b → loops
TinyLlama-1.1B	1.1B	Llama GQA 8:1	11	4-bit → topic derail
SmolLM2-1.7B	1.7B	Llama MHA	11	3-bit → number sequences
Phi-2	2.7B	Phi MHA, GELU MLP	9	2-bit → empty output
Qwen2.5-7B	7B	Qwen GQA 4:1	9	3-bit → repetition loop
Qwen2.5-3B	3B	Qwen GQA 8:1	6	4-bit → echo/loop
Qwen3.5-4B	4B	Hybrid: 24 linear + 8 full attn	6	2-bit → 0/6 correct

すべてのテスト：Apple M4 Max · 107 GB統合メモリ · MLX affine量子化 · group_size=64 · 同一トークナイザー · 同一プロンプトテンプレート · 45実験 · 8モデル · Qwen3.5-9Bダウンロード済み、テスト保留中

プロファイル

JANG_{bits}{size}

超圧縮からほぼ無損失まで11の定義済みプロファイル。S = Small（最大圧縮）、M = Medium（バランス）、L = Large（最高品質）。

Profile	MLP	Attention	Embed	lm_head	Avg Bits
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

ランタイム

Swift + Metal 推論エンジン

14カスタムMetal GPUカーネル。ゼロコピーmmapロード。デコードとプリフィル用融合逆量子化。

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# Loading model (zero-copy mmap)...

# Profile: JANG_4L (MLP=4, attn=8, avg=4.5 bits)

# Size: 1.8 GB — loaded in 0.39s

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.

Dequant + GEMM

Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.

GQA Attention

Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.

RMSNorm + RoPE

Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.

SwiGLU

Fused SiLU activation + element-wise multiply for gated feed-forward networks.

Quantized Embedding

Direct embedding lookup from quantized weights. No full-table dequantization needed.

量子化

あらゆるモデルを変換

HuggingFaceモデルを.jangフォーマットに変換するPythonツール。プロファイルを選択し、量子化方法を選び、実行。RTN、MSE最適グリッドサーチ、GPTQ（ヘッセ行列ベース）量子化をサポート。

6以上のアーキテクチャファミリ：Llama、Qwen、Gemma、Phi、Mistral、Mamba/SSM、MoE、Qwen 3.5を含むハイブリッドモデル。

オープンソース — Apache 2.0ライセンス

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-7B \

--profile JANG_4L \

--method gptq \

--output ./Qwen2.5-7B-JANG_4L/

# Quantizing with GPTQ (Hessian-guided)...

# Attention layers: 8-bit | MLP: 4-bit

# Average bits: 4.5 | Size: 4.1 GB

# Done ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

メモリ

より少ないRAMでより大きなモデルを実行

JANG_3Mは7B以上のモデルで均一4-bitと同等の品質で25%節約。以前は収まらなかったモデルを統合メモリに収容。

~4.1 GB

7B at JANG_4S (vs 4.5 GB 4-bit)

~8.2 GB

14B at JANG_4S (vs 9 GB 4-bit)

~41 GB

70B at JANG_4S (vs 45 GB 4-bit)

25%

Savings at JANG_3M vs 4-bit

モデル

HuggingFace事前量子化モデル

ダウンロード可能。JANGローダー経由でvMLX Engine / MLX Studioと互換。

Qwen3.5-122B-A10B-JANG_2M

2.14 bits · 42 GB · 84% MMLU ·

Qwen3.5-27B-JANG_1L

JANG_1L profile · Dense model

Qwen3.5-35B-A3B-JANG_2L

2.28 bits · 13.3 GB · 4/6 correct · 106 tok/s

Qwen3.5-122B-A10B-JANG_2L

2.19 bits · 45.3 GB · 3/4 correct · 49 tok/s

HuggingFace全モデル

ネイティブ統合

MLX StudioでJANGモデルを実行

MLX StudioはOpenAI互換API、プレフィックスキャッシング、ページドKVキャッシュ、KV量子化（q4/q8）、連続バッチング、20以上のエージェントコーディングツールとともにネイティブJANGサポート。.jangモデルをロードしてローカルでサービング — Cursor、Continue、Aider、すべてのOpenAI APIクライアントと互換。vMLX Engineで駆動、オープンソース — pip install vmlx。

MLX Studio vMLX Engine