오픈소스

JANG

The GGUF for MLX

128 GB Mac에서 3970억 파라미터. 92% MMLU. MLX는 로드조차 할 수 없습니다.

JANG_1L은 397B 모델을 112 GB에 압축합니다 — 128 GB MacBook Pro에서 추론 모드로 86.5% MMLU를 달성합니다. MLX는 2~3 bits에서 NaN을 출력합니다. MiniMax 230B? MLX는 모든 bit 수준에서 26.5%입니다. Nemotron-H 120B? MLX 3-bit는 완전히 고장입니다. JANG만이 Apple Silicon에서 이 모델들을 양자화하여 실행할 수 있습니다.

JANG은 attention에 더 많은 bits를, MLP에 더 적은 bits를 할당하여, standard quantization이 쓰레기나 NaN을 생성하는 곳에서도 모델이 정상 작동합니다. 같은 속도, 같은 Metal 커널 — 더 나은 출력. Apache 2.0 오픈소스.

중요도 기반 비트 할당 2-bit ~ 8-bit 혼합 정밀도 14개 커스텀 Metal GPU 커널 Swift + Metal 런타임 블록별 가변 비트 폭 오픈소스 · Apache 2.0

GitHub에서 보기 결과 보기

397B

최대 모델 — 128 GB Mac에서 작동

92%

397B에서 MMLU (JANG_2L)

93%

Nemotron-H 120B에서 MMLU (JANG_4M)

Apache 2.0

오픈소스 라이선스

작동 원리

레이어 민감도에 기반한 가변 비트 폭

Standard quantization은 모든 텐서에 동일한 비트 폭을 적용합니다. Attention 레이어(파라미터의 약 12%)는 MLP 레이어보다 정밀도 손실에 더 민감합니다 — 너무 공격적으로 양자화하면 attention 스코어가 평탄해지고, 위치 인코딩이 저하되며, 출력이 퇴화됩니다.

JANG은 텐서를 민감도 등급으로 분류하고 그에 따라 비트 폭을 할당합니다. Attention 레이어는 5~8 bits를 할당받고 MLP는 2~4 bits로 압축됩니다. 오버헤드는 평균 약 0.3 bits 추가입니다.

Attention

8-bit — 보호됨

MLP

2-bit — 압축됨

Embed

4-bit

lm_head

6-bit

Result

JANG_2M
 → 2.7 avg bits → 
coherent output

3-bit
 → 3.0 avg bits → 
repetition loops

MMLU 벤치마크

JANG vs MLX — 나란히 비교

각 JANG 모델을 크기가 가장 비슷한 MLX 방법과 비교. 200문항 MMLU (과목당 20문항 × 10과목), thinking/reasoning은 표시된 곳에서 활성화, temp 0.0. Apple M4 Max 128 GB / M4 Ultra 256 GB.

Qwen3.5-397B-A17B — 397 billion parameters — JANG vs MLX

JANG

JANG_1L

112 GB disk · 120 GB GPU peak · 36 tok/s · FITS 128 GB MACS

86.5%

MMLU (200q, reasoning) · 173/200

397B intelligence on a laptop

MLX

2-bit / 3-bit

Cannot run — NaN output

NaN

Model too complex for standard quantization

JANG

JANG_2L

187 GB disk · 197 GB GPU · 36 tok/s · M4 Ultra 256 GB

92%

MMLU (200q, reasoning) · 184/200

Near-FP16 quality at 2.x bits

MLX

4-bit

~280 GB · requires massive machines

94%

MMLU (200q, reasoning)

397B on a 128 GB Mac — first ever. JANG_1L at 112 GB disk (120 GB GPU peak) fits on a 128 GB MacBook Pro and scores 86.5% MMLU with reasoning. MLX at 2-bit and 3-bit produces NaN — the model is too complex for standard quantization at low bit widths. MLX 4-bit runs at 94% but needs ~280 GB, far beyond any laptop. JANG_2L at 187 GB hits 92% on an M4 Ultra 256 GB.

Nemotron-3-Super-120B-A12B — NVIDIA Hybrid Mamba-2 SSM + Latent MoE + Attention

JANG

JANG_4M

63 GB · 55 tok/s

93%

MMLU (200q, reasoning) · 186/200

First Nemotron-H on Apple Silicon

MLX

3-bit

Broken

—

Cannot produce valid output

JANG

JANG_2L

43 GB · 52 tok/s · fits 64 GB Macs

86%

MMLU (200q, reasoning) · 172/200

120B on a 64 GB Mac

First working Nemotron-H quantization for Apple Silicon. NVIDIA’s hybrid architecture combines Mamba-2 SSM, Latent MoE, and standard attention — MLX 3-bit is broken on it. JANG_4M at 63 GB scores 93% MMLU with reasoning at 55 tok/s. JANG_2L fits on a 64 GB Mac at 43 GB with 86% MMLU.

MiniMax-M2.5 (230B) — JANG vs MLX

JANG

JANG_2L

82.5 GB · 2.10 bits · 0.9s per question

74.0%

MMLU (200q) · 148/200

+47.5 points · MLX broken at ALL bit levels

MLX

4-bit

119.8 GB · 4.0 bits · 0.9s per question

26.5%

MMLU (200q) · 53/200

MLX is completely broken on MiniMax at every bit level — 4-bit (26.5%), 3-bit (24.5%), and 2-bit (25%) all score near random. JANG_2L at just 2.10 bits is the only way to run MiniMax quantized on Apple Silicon.

Per-subject breakdown — MiniMax-M2.5 (230B) — all methods

과목	JANG_2L	MLX 4-bit	MLX 3-bit	MLX 2-bit
Abstract Algebra	10/20	3/20	2/20	5/20
Anatomy	15/20	7/20	5/20	5/20
Astronomy	20/20	7/20	6/20	4/20
College CS	13/20	4/20	5/20	6/20
College Physics	13/20	8/20	6/20	6/20
HS Biology	18/20	4/20	5/20	6/20
HS Chemistry	18/20	4/20	5/20	5/20
HS Mathematics	8/20	6/20	6/20	3/20
Logical Fallacies	18/20	5/20	4/20	5/20
World Religions	15/20	5/20	5/20	5/20
Total	148/200 (74%)	53/200 (26.5%)	49/200 (24.5%)	50/200 (25%)

JANG wins all 10 subjects against all MLX methods. MLX 4-bit, 3-bit, and 2-bit all score near random (25%). Root cause: MLX generates meta-commentary instead of direct answers on this model.

Qwen3.5-122B-A10B — ~4 bits

JANG

JANG_4K

71 GB · 3.99 bits · ~40 tok/s

86%

MMLU (200q) · 172/200

+1 point vs MLX 4-bit

MLX

4-bit

64 GB · 4.0 bits · ~50 tok/s

85%

MMLU (200q) · 170/200

Per-subject breakdown — 122B ~4 bits

과목	JANG_4K	MLX 4-bit
Abstract Algebra	16/20	15/20
Anatomy	19/20	18/20
Astronomy	19/20	19/20
College CS	15/20	15/20
College Physics	14/20	14/20
HS Biology	19/20	19/20
HS Chemistry	18/20	18/20
HS Mathematics	14/20	14/20
Logical Fallacies	19/20	19/20
World Religions	19/20	19/20
Total	172/200 (86%)	170/200 (85%)

JANG wins 2 subjects, ties 8. Neck-and-neck at ~4 bits.

Qwen3.5-122B-A10B — ~2 bits

JANG

JANG_2S

44 GB · 2.11 bits · ~45 tok/s

79%

MMLU (200q) · 158/200

+22.5 points

MLX

2-bit

36 GB · 2.0 bits · ~52 tok/s

56.5%

MMLU (200q) · 113/200

Per-subject breakdown — 122B ~2 bits

과목	JANG_2S	MLX 2-bit
Abstract Algebra	9/20	9/20
Anatomy	18/20	11/20
Astronomy	20/20	16/20
College CS	14/20	8/20
College Physics	15/20	10/20
HS Biology	19/20	15/20
HS Chemistry	18/20	13/20
HS Mathematics	11/20	4/20
Logical Fallacies	16/20	13/20
World Religions	18/20	14/20
Total	158/200 (79%)	113/200 (56.5%)

JANG wins 9 of 10 subjects, ties 1 (Abstract Algebra).

Qwen3.5-35B-A3B — ~4 bits

JANG

JANG_4K

20.1 GB · 3.99 bits · ~100 tok/s

77.5%

MMLU (200q) · 155/200

+2 points

MLX

4-bit

18.2 GB · 4.0 bits · ~110 tok/s

75.5%

MMLU (200q) · 151/200

Per-subject breakdown — 35B ~4 bits

과목	JANG_4K	MLX 4-bit
Abstract Algebra	12/20	10/20
Anatomy	17/20	16/20
Astronomy	18/20	18/20
College CS	14/20	15/20
College Physics	14/20	13/20
HS Biology	18/20	18/20
HS Chemistry	17/20	17/20
HS Mathematics	10/20	8/20
Logical Fallacies	18/20	19/20
World Religions	17/20	17/20
Total	155/200 (77.5%)	151/200 (75.5%)

JANG wins 4 subjects, loses 2 (College CS, Logical Fallacies), ties 4.

Qwen3.5-35B-A3B — ~2 bits

JANG

JANG_2S

12.8 GB · 2.17 bits · fits 16 GB RAM

65.5%

MMLU (200q) · 131/200

+25 points

MLX

2-bit

12.8 GB · ~2.5 bits

~40%

MMLU (est. from 34% at 50q)

Per-subject breakdown — 35B ~2 bits (JANG only)

과목	JANG_2S	MLX 2-bit
Abstract Algebra	8/20	—
Anatomy	14/20	—
Astronomy	19/20	—
College CS	14/20	—
College Physics	11/20	—
HS Biology	16/20	—
HS Chemistry	14/20	—
HS Mathematics	5/20	—
Logical Fallacies	14/20	—
World Religions	16/20	—
Total	131/200 (65.5%)	~40% (est.)

MLX 2-bit 200q not yet tested. Estimate based on 34% at 50 questions.

Test methodology & conditions

MMLU: 200-question subset (10 subjects × 20 questions each), thinking disabled, temperature 0.0.
Hardware: Apple M4 Max 128 GB unified memory.
Quantization: MLX affine quantization, group_size=64. JANG uses variable bit widths via quant_predicate.
Models: All methods use the same base model weights. JANG stays quantized in GPU memory using MLX’s native quantized_matmul — no float16 expansion.
Reproducibility: All scores verified from HuggingFace model cards. Code at github.com/jjang-ai/jangq.

Download: All models on HuggingFace — 397B, Nemotron-H 120B, 122B, 35B, MiniMax 230B, and more

QA 프롬프트 테스트

기본 프롬프트에 대한 3자 비교

6개 사실 질문으로 나란히 비교. 모든 방법이 MLX의 네이티브 Metal 커널을 사용. Temperature 0.0, 최대 80 tokens. M4 Max 128 GB.

Qwen3.5-122B-A10B — JANG_1L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, top-8, 10B active, Hybrid JANG_1L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_1L · 2.24 bits

46.0 GB RAM · 48 tok/s

MLX mixed_2_6 · ~2.2 bits

44.9 GB RAM · 66 tok/s

2-bit · 2.0 bits

35.6 GB RAM · 67 tok/s

“What is 2+2?”

✓ “2+2 is 4”

∼ “2+2=4” then repeats

∼ “2+2=4” then loops

“Is a tomato a fruit?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: empty think tag

✗ 2-bit: rephrases, no answer

“What is photosynthesis?”

✓ “plants use energy of sun to make food”

✗ Degenerate output

✗ “Photos-sense y=y”

“Three planets larger?”

∼ JANG: uses <think> (partial)

∼ mixed_2_6: uses <think> (partial)

✗ Misreads question

“Who wrote Romeo and Juliet?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: double think tag

∼ 2-bit: uses <think> (partial)

“Capital of France?”

✓ “Paris”

✓ mixed_2_6: “Paris”

✓ 2-bit: “Paris” with details

JANG_1L: 3 정답, 3 부분 정답, 0 실패 · mixed_2_6: 1 정답, 1 부분 정답, 4 실패 · 2-bit: 1 정답, 2 부분 정답, 3 실패

MLX’s mixed_2_6 mode protects select v_proj and down_proj layers at 6-bit, but does not account for GatedDeltaNet linear attention layers, MoE expert routing tensors, or hybrid architecture components. JANG’s tier system classifies these architecture-specific tensors explicitly.

MiniMax-M2.5 (230B) — JANG_2S (2.06 bits)

MoE 256 experts, top-8, 10B active JANG_2S · 2.06 bits Mac Studio M4 Ultra 192 GB

JANG_2S · 2.06 bits
81.6 GB GPU · 50 tok/s
JANG_2L · 2.10 bits
82.5 GB RAM · 74% MMLU (200q)

JANG_2S: 2.06 bits에서 3/6 정답 · 230B 모델 81.6 GB · 50 tok/s JANG_2L: 82.5 GB RAM에서 74% MMLU (200문항) — 120 GB의 MLX 4-bit 대비 3배 높음

Qwen3.5-35B-A3B — JANG_2L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, Hybrid GDN+FA JANG_2L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_2L · 2.28 bits

13.3 GB RAM · 100 tok/s

MLX mixed_2_6 · ~2.2 bits

12.8 GB RAM · 120 tok/s

2-bit · 2.0 bits

10.1 GB RAM · 128 tok/s

“What is 2+2?”

✓ “2+2 equals 4”

✗ “2+2=4” then loops

✗ Number sequences

“Is a tomato a fruit?”

✗ JANG: loops

∼ mixed_2_6: partial reasoning

✗ 2-bit: degenerate

“What is photosynthesis?”

✓ “convert light energy”

✗ “I cannot respond”

✗ “6 6 6”

“Three planets larger?”

✓ “Jupiter, Saturn, Uranus”

✗ “Antina” loops

✗ Number sequences

“Who wrote Romeo and Juliet?”

∼ JANG: “Shakespeare” (partial)

✗ mixed_2_6: contradicts itself

✗ 2-bit: degenerate

“Capital of France?”

✓ “Paris” with details

✗ Never answers

∼ 2-bit: “Paris” partial

JANG_2L: 4 정답, 1 부분 정답, 1 실패 · mixed_2_6: 0 정답, 1 부분 정답, 5 실패 · 2-bit: 0 정답, 1 부분 정답, 5 실패

On this hybrid MoE model, MLX mixed_2_6 does not improve over 2-bit. The mixed_2_6 heuristic targets v_proj and down_proj in standard transformer layers but misses GatedDeltaNet attention and MoE routing tensors that are critical for this architecture.

Qwen3.5-122B-A10B — 1220억 파라미터, 직접 비교

MoE 256 experts, top-8, 10B active JANG_2L vs 2-bit M4 Max 128 GB

JANG_2L · 2.19 bits
45.3 GB RAM · 38–49 tok/s
2-bit · 2.0 bits
35.6 GB RAM · 52–65 tok/s

“What is photosynthesis?”

“process by which green plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose”

“Photos-sense” then “y = y = y” degenerate

“Three planets larger than Earth?”

Uses <think> reasoning tags, lists Jupiter with details

Misreads as “larger than Earth’s moon”, rambles

“Capital of France?”

“Paris” with government details

“Paris, on the banks of the River Seine” — both correct

“What is 2+2?”

“2+2 is 4.” (then repeats) — PARTIAL

“2+2=4” then “2. 2. 2.” loops

JANG: 3/4 정답  ·  2-bit: 1/4 정답  ·  45.3 vs 35.6 GB GPU  ·  <code style="font-size:0.72rem"><think></code> 추론 기능 2.19 bits에서 유지

모든 모델 비교

크기, 속도, 점수 — JANG vs MLX

모델	방법	Bits	크기	MMLU
Qwen3.5-397B-A17B	JANG_2L	~2.x	187 GB	92%
	JANG_1L	~2.2	112 GB	86.5%
	MLX 4-bit	4.0	~280 GB	94%
	MLX 2-bit / 3-bit	2-3	—	NaN

Nemotron-3-Super-120B	JANG_4M	~4.2	63 GB	93%
	JANG_2L	~2.x	43 GB	86%
	MLX 3-bit	3.0	—	Broken

Qwen3.5-122B-A10B	JANG_2M	2.14	44.7 GB	79%
	JANG_1L	2.24	46.0 GB	73%
	JANG_2L	2.19	45.3 GB	—
	MLX mixed_2_6	~2.5	45 GB	46%
	2-bit	2.0	36 GB	56.5%

Qwen3.5-35B-A3B	JANG_4K	3.99	20.1 GB	77.5%
	MLX 4-bit	4.0	18.2 GB	75.5%
	JANG_4S	4.04	20.4 GB	82%
	JANG_2S	2.17	12.8 GB	65.5%
	JANG_2L v2	2.28	13.3 GB	56%
	MLX mixed_2_6	~2.5	12.8 GB	~40%

MiniMax-M2.5 (230B)	JANG_2S	2.06	81.6 GB	—
	JANG_2L	2.10	82.5 GB	74%
	MLX 4-bit	4.0	119.8 GB	26.5%
	MLX 2-bit	2.0	66.6 GB	25.0%

Apple M4 Max 128 GB / M4 Ultra 256 GB · MMLU: 200-question (10 subjects × 20), reasoning enabled for 397B and Nemotron, thinking disabled for others · 2026-03

Qwen3.5-397B: JANG_1L at 112 GB (120 GB GPU peak) fits on 128 GB Macs — 86.5% MMLU with reasoning, 36 tok/s. JANG_2L at 187 GB hits 92% on M4 Ultra 256 GB. MLX 2/3-bit: NaN. MLX 4-bit: 94% but ~280 GB.

Nemotron-3-Super-120B: JANG_4M at 63 GB scores 93% MMLU, 55 tok/s. JANG_2L at 43 GB scores 86%, fits 64 GB Macs. MLX 3-bit: broken. First working Nemotron-H quantization for Apple Silicon.

MiniMax-M2.5 (230B): JANG_2L scores 74% MMLU at 82.5 GB vs MLX 4-bit at 26.5% (119.8 GB). MLX broken at ALL bit levels (26.5%, 24.5%, 25%). JANG is the only way to run MiniMax quantized.

Pipeline verification: JANG_4S matches MLX 4-bit exactly on 35B MMLU (82% = 82%), confirming the quantization pipeline is lossless at matched bit widths.

397B

테스트된 최대 모델

테스트된 아키텍처 계열

tok/s (Nemotron 120B, JANG_4M)

0.3s

로드 시간 (3B 모델, mmap)

이전 결과

Dense 모델 비교 (1B–7B)

품질 저하 경계에서의 비교 — standard quantization이 퇴화된 출력을 생성하기 시작하는 비트 폭. 같은 프롬프트, 같은 temperature, 같은 모델. 모두 M4 Max에서.

Qwen3.5-4B (하이브리드 아키텍처)

Hybrid: 24 linear + 8 full attn JANG_2S 2.5 eff. bits M4 Max · 107 GB

At 2.5 effective bits, JANG_2S gets 6/6 correct while 2-bit gets 0/6. JANG protects the 8 critical full-attention layers at 6-bit while compressing the 24 linear-attention layers and all MLP at 2-bit.

“What is 2+2?”

JANG: “The answer is 4.”

2-bit: “2+2? 2+2? 2+2?”

“Is a tomato a fruit?”

JANG: “A tomato is a fruit, not a vegetable.”

2-bit: “1 1 1 1 1 1 1 1”

“Who wrote Romeo and Juliet?”

JANG: Answers correctly

2-bit: “10, 10, 10, 10”

“What is photosynthesis?”

JANG: Correct definition

2-bit: Garbled text

“How many legs does a spider have?”

JANG: Answers correctly

2-bit: “10, 10, 10”

“Largest ocean on Earth?”

JANG: “The Pacific Ocean.”

2-bit: Infinite loop

하이라이트 — 7B 모델

Mistral-7B-v0.3

Mistral GQA 4:1 JANG_3M 3.4 bits M4 Max

"광합성이란 무엇인가요?"

JANG_3M (3.4 bits)

“Photosynthesis is the process by which plants and some other organisms...”

3-bit (3.5 bits)

10000000000000000000000000000...

JANG_3M은 3.4 bits에서 정확한 출력을 생성합니다. 3-bit (3.5 bits)는 숫자 시퀀스를 출력합니다.

Qwen2.5-7B

Qwen GQA 4:1 JANG_3L 3.6 bits M4 Max

"2+2는 얼마인가요?"

JANG_3L (3.6 bits)

“The answer is 4.”

3-bit (3.5 bits)

Assistant Assistant Assistant Assistant Assistant...

JANG_3L (3.6 bits)은 정확하게 답합니다. 3-bit (3.5 bits)는 반복 루프에 진입합니다.

Mistral-7B — 4-bit

Mistral GQA 4:1 JANG_4S 4.1 bits M4 Max

"2+2는 얼마인가요?"

JANG_4S (4.1 bits)

“The answer is 4. But what if...”

4-bit (4.5 bits)

4. What is 2+2? 4. What is 2+2? 4...

JANG_4S (4.1 bits)는 정확하게 답합니다. 4-bit (4.5 bits)는 질문을 반복합니다.

Mistral-7B — 2-bit

Mistral GQA 4:1 JANG_2S 2.5 bits M4 Max

"태양계의 행성 세 개를 말해 주세요."

JANG_2S (2.5 bits)

“1. Jupiter 2. Mars 3. Saturn”

2-bit (2.5 bits)

is a new planet, and it is a new planet...

JANG_2S (2.5 bits)는 행성 세 개를 나열합니다. 2-bit (2.5 bits)는 반복 루프에 진입합니다.

추가 7B 결과

Qwen2.5-7B

JANG_3L 3.6 bits M4 Max

"로미오와 줄리엣의 작가는 누구인가요?"

JANG_3L (3.6 bits)

“The play Romeo and Juliet was written by William Shakespeare”

3-bit

Who wrote Romeo and Juliet?

JANG은 정확하게 답합니다. Standard는 프롬프트를 그대로 반복합니다.

Mistral-7B

JANG_3M 3.4 bits M4 Max

"레오나르도 다 빈치의 유명한 그림을 말해 주세요."

JANG_3M (3.4 bits)

“The Last Supper, The Mona Lisa, The Vitruvian Man...”

3-bit

Name a famous painting by Michelangelo. Name a famous painting by Rap...

JANG은 정확한 작품을 나열합니다. Standard는 다른 작가를 환각합니다.

Mistral-7B

JANG_3L 3.6 bits M4 Max

"거미의 다리는 몇 개인가요?"

JANG_3L (3.6 bits)

“Spiders have eight legs.”

3-bit

TDM 10000000000000000000000...

JANG은 정확하게 답합니다. Standard는 숫자 시퀀스를 출력합니다.

Mistral-7B — 2.7 bits

JANG_2M 2.7 bits M4 Max

"지구에서 가장 큰 바다는 무엇인가요?"

JANG_2M (2.7 bits)

“The Pacific Ocean, The Atlantic Ocean, The Indian...”

2-bit

## 1000000000000000000000000...

JANG_2M (2.7 bits)은 정확하게 답합니다. 2-bit는 숫자 시퀀스를 출력합니다.

Mistral-7B — 4-bit

JANG_4L 4.5 bits M4 Max

"로미오와 줄리엣의 작가는 누구인가요?"

JANG_4L (4.5 bits)

“William Shakespeare. What is the name of the play...”

4-bit

William Shakespeare 1564-1616 1564-1616 1564-1616...

JANG은 정확하게 답합니다. Standard는 날짜를 반복적으로 루프합니다.

Mistral-7B — 2-bit

JANG_2S 2.5 bits M4 Max

"토마토는 과일인가요, 채소인가요?"

JANG_2S (2.5 bits)

“The tomato is a fruit, not a vegetable”

2-bit

The tomato is a fruit or a vegetable?...

같은 평균 비트 폭. JANG은 정확하게 답하고, standard는 루프합니다.

소형 모델 (1B–3B)

SmolLM2-1.7B

Llama MHA JANG_3M 3.4 bits M4 Max

"거미의 다리는 몇 개인가요?"

JANG_3M (3.4 bits)

“8. How many arms does a spider have? Answer: 8”

3-bit

2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

JANG_3M (3.4 bits)이 "8"이라고 답합니다. 3-bit (3.5 bits)은 숫자 시퀀스를 출력합니다.

Phi-2 (2.7B)

Phi MHA JANG_2S 2.5 bits M4 Max

"광합성이란 무엇인가요?"

JANG_2S (2.5 bits)

“Photosynthesis is the process by which plants use sunlight to con...”

2-bit

(empty output)

같은 평균 bits. JANG은 정확한 답을 생성합니다. Standard는 빈 출력을 생성합니다.

Qwen2.5-3B

Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max

"'thank you'를 스페인어로 번역해 주세요."

JANG_4S (4.1 bits)

“Thank you in Spanish is 'gracias'.”

4-bit

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits)는 정확하게 번역합니다. 4-bit (4.5 bits)는 프롬프트를 반복합니다.

Qwen2.5-3B — 창작

JANG_4L 4.5 bits M4 Max

"달에 관한 하이쿠를 써 주세요."

JANG_4L (4.5 bits)

“The moon’s glow, a tranquil sight...”

4-bit

ancient sky, lunar glow, ancient sky, lunar glow...

JANG은 일관된 텍스트를 생성합니다. Standard는 문구를 반복합니다.

Qwen2.5-3B — 토마토

JANG (4.12 bits) M4 Max

"토마토는 과일인가요, 채소인가요?"

JANG (4.12 bits)

“A tomato is a fruit. It is botanically classified...”

4-bit

Is a tomato a fruit or vegetable? Is it a vegetabl...

JANG은 정확하게 답합니다. Standard는 반복 루프에 진입합니다.

TinyLlama-1.1B

Llama GQA 8:1 JANG_4S 4.1 bits M4 Max

"물의 화학식은 무엇인가요?"

JANG_4S (4.1 bits)

“What is the chemical formula for water? Answers: 1. H...”

4-bit

“What is the chemical formula for hydrogen peroxide?...”

JANG_4S (4.1 bits)는 주제를 유지합니다. 4-bit (4.5 bits)는 다른 질문으로 이탈합니다.

Logit MSE 증명

JANG 3.37 bits가 4-bit을 능가합니다

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8

7.13 MSE — 4.49 bits

MLP=4, attn=6

8.70 MSE — 4.24 bits

4-bit

11.31 MSE — 4.00 bits

MLP=3, attn=6

11.10 MSE — 3.37 bits ✔

JANG at 3.37 bits (MSE 11.10) beats 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.

요약

테스트된 모든 모델

모델	파라미터	아키텍처	테스트	실패 모드
Qwen3.5-397B-A17B	397B	MoE, Hybrid	MMLU	MLX 2/3-bit → NaN
Nemotron-3-Super-120B	120B	Hybrid Mamba-2 SSM + Latent MoE + Attn	MMLU	MLX 3-bit → broken
MiniMax-M2.5	230B	MoE 256 experts, top-8	MMLU	MLX all bits → random (25%)
Qwen3.5-122B-A10B	122B	MoE 256 experts, Hybrid	MMLU	2-bit → 56.5%, mixed_2_6 → 46%
Qwen3.5-35B-A3B	35B	MoE 256 experts, Hybrid GDN+FA	MMLU+QA	2-bit → degenerate, mixed_2_6 → broken
Qwen3.5-4B	4B	Hybrid: 24 linear + 8 full attn	6	2-bit → 0/6 correct
Mistral-7B	7B	Mistral GQA 4:1, sliding window	13	3-bit → number sequences
Qwen2.5-7B	7B	Qwen GQA 4:1	9	3-bit → repetition loop
Qwen2.5-3B	3B	Qwen GQA 8:1	6	4-bit → echo/loop
SmolLM2-1.7B	1.7B	Llama MHA	11	3-bit → number sequences
TinyLlama-1.1B	1.1B	Llama GQA 8:1	11	4-bit → topic derail
Phi-2	2.7B	Phi MHA, GELU MLP	9	2-bit → empty output

Apple M4 Max 128 GB / M4 Ultra 256 GB · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 12 models · 1B to 397B

프로필

JANG_{bits}{size}

초압축부터 거의 무손실까지 11개 사전 정의 프로필. S = Small (최대 압축), M = Medium (균형), L = Large (최고 품질).

프로필	MLP	Attention	Embed	lm_head	평균 Bits
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

런타임

Swift + Metal 추론 엔진

14개 커스텀 Metal GPU 커널. Zero-copy mmap 로딩. Decode와 prefill을 위한 융합 역양자화.

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# 모델 로딩 (zero-copy mmap)...

# 프로필: JANG_4L (MLP=4, attn=8, 평균=4.5 bits)

# 크기: 1.8 GB — 0.39초에 로딩 완료

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

단일 토큰 decode를 위한 역양자화 + 행렬-벡터 곱셈 융합. 모든 비트 폭(2, 3, 4, 5, 6, 8)을 하나의 커널에서 처리합니다.

Dequant + GEMM

프롬프트 prefill을 위한 역양자화 + 행렬-행렬 곱셈 융합. Apple GPU threadgroup 메모리에 최적화된 타일링 처리.

GQA Attention

Grouped-query attention decode + causal prefill. 표준, sliding window, 하이브리드 아키텍처를 지원합니다.

RMSNorm + RoPE

정규화와 rotary position embedding을 융합. 전통적 및 비전통적 RoPE 변형을 지원합니다.

SwiGLU

게이트 피드포워드 네트워크를 위한 SiLU 활성화 + 요소별 곱셈 융합.

양자화된 Embedding

양자화된 가중치에서 직접 embedding을 조회합니다. 전체 테이블 역양자화가 필요 없습니다.

양자화

모든 모델 변환

HuggingFace 모델을 .jang 포맷으로 변환하는 Python 도구입니다. 프로필을 선택하고, 양자화 방법을 선택한 후 실행하면 됩니다. RTN, MSE 최적 그리드 서치, GPTQ (Hessian 기반) quantization을 지원합니다.

6개 이상의 아키텍처 계열 지원: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, Qwen 3.5 등 하이브리드 모델.

오픈소스 — Apache 2.0 라이선스

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-7B \

--profile JANG_4L \

--method gptq \

--output ./Qwen2.5-7B-JANG_4L/

# GPTQ (Hessian 기반)로 양자화 중...

# Attention 레이어: 8-bit | MLP: 4-bit

# 평균 bits: 4.5 | 크기: 4.1 GB

# 완료 ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

메모리

더 적은 RAM으로 더 큰 모델 실행

JANG_3M은 7B 이상 모델에서 4-bit 대비 25%를 절약하면서 동등한 품질을 유지합니다. 이전에는 불가능했던 모델을 unified memory에 적재할 수 있습니다.

~4.1 GB

JANG_4S에서 7B (4-bit 4.5 GB 대비)

~8.2 GB

JANG_4S에서 14B (4-bit 9 GB 대비)

~41 GB

JANG_4S에서 70B (4-bit 45 GB 대비)

25%

JANG_3M의 4-bit 대비 절감률

모델

HuggingFace의 사전 양자화 모델

다운로드 준비 완료. JANG 로더를 통해 vMLX Engine / MLX Studio와 호환됩니다.

Qwen3.5-397B-A17B-JANG_1L

112 GB · 86.5% MMLU · 36 tok/s · Fits 128 GB Mac

Qwen3.5-397B-A17B-JANG_2L

187 GB · 92% MMLU · 36 tok/s · M4 Ultra 256 GB

Nemotron-3-Super-120B-JANG_4M

63 GB · 93% MMLU · 55 tok/s

Nemotron-3-Super-120B-JANG_2L

43 GB · 86% MMLU · 52 tok/s · Fits 64 GB Mac

Qwen3.5-122B-A10B-JANG_4K

3.99 bits · 71 GB · 86% MMLU (200q) · ~40 tok/s

Qwen3.5-122B-A10B-JANG_2S

2.11 bits · 44 GB · 79% MMLU (200q) · ~45 tok/s

Qwen3.5-35B-A3B-JANG_4K

3.99 bits · 20.1 GB · 77.5% MMLU (200q) · ~100 tok/s

Qwen3.5-35B-A3B-JANG_2S

2.17 bits · 12.8 GB · 65.5% MMLU (200q) · Fits 16 GB RAM

HuggingFace의 모든 모델

네이티브 통합

MLX Studio에서 JANG 모델 실행

MLX Studio는 OpenAI 호환 API, prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching, 20개 이상의 에이전트 코딩 도구와 함께 네이티브 JANG 지원을 제공합니다. 모든 .jang 모델을 로드하고 로컬에서 서빙할 수 있습니다 — Cursor, Continue, Aider 및 모든 OpenAI API 클라이언트와 호환됩니다. vMLX Engine 기반, 현재 오픈소스 — pip install vmlx.

MLX Studio vMLX Engine