오픈소스

JANG

The GGUF for MLX

Apple Silicon용 MLX 혼합 정밀도 양자화.

MLX의 균일 양자화는 모든 레이어에 동일한 비트 폭을 적용합니다. 낮은 비트(2-3)에서 어텐션 레이어가 먼저 열화됩니다 — 출력이 반복 루프나 퇴화 시퀀스로 붕괴합니다. JANG은 감도에 따라 텐서별 가변 비트 폭을 할당합니다: 어텐션에 더 많은 비트, MLP에 더 적은 비트.

모델은 GPU 메모리에 양자화된 상태로 유지되며 MLX의 네이티브 quantized_matmul 커널을 사용하여 즉석 역양자화됩니다 — float16 확장 없이, 속도 패널티 없이. 포맷은 safetensors 기반. 양자화 도구와 런타임은 Apache 2.0 오픈소스.

중요도 기반 비트 할당 2-bit ~ 8-bit 혼합 정밀도 14 custom Metal GPU kernels Swift + Metal 런타임 블록별 가변 비트 폭 오픈소스 · Apache 2.0

GitHub에서 보기 결과 보기

94%

122B MMLU (JANG_4K)

90%

122B에서 2.14 bits HumanEval

42 GB

디스크 (MLX mixed_2_6보다 작음)

Apache 2.0

오픈소스 라이선스

작동 방식

레이어 감도에 따른 가변 비트 폭

균일 양자화는 모든 텐서에 동일한 비트 폭을 적용합니다. 어텐션 레이어(파라미터의 ~12%)는 MLP 레이어보다 정밀도 손실에 더 민감합니다 — 과도하게 양자화되면 어텐션 스코어가 평탄해지고, 위치 인코딩이 열화되며, 출력이 퇴화됩니다.

JANG은 텐서를 감도 등급으로 분류하고 그에 따라 비트 폭을 할당합니다. 어텐션 레이어에 5–8 bits, MLP에 2–4 bits 압축. 오버헤드는 평균 ~0.3 bits.

Attention

8-bit — 보호됨

MLP

2-bit — 압축됨

Embed

4-bit

lm_head

6-bit

결과

                JANG_2M
                 → 2.7 avg bits → 
                일관된 출력
              

                3-bit
                 → 3.0 avg bits → 
                반복 루프
              

MMLU Benchmark

JANG vs MLX — side by side

Each JANG model compared against the closest MLX method by size. 50-question MMLU, thinking disabled, temp 0.0. Apple M4 Max 128 GB.

Qwen3.5-122B-A10B — ~4 bits — NEW

JANG

JANG_4K

69 GB · 3.99 bits · ~40 tok/s

94%

MMLU

+4 points vs MLX 4-bit

MLX

4-bit

64 GB · 4.0 bits · ~50 tok/s

90%

MMLU

Qwen3.5-122B-A10B — ~2 bits

JANG

JANG_2M

42 GB · 2.14 bits

84%

MMLU

+38 points · 2 GB smaller

MLX

mixed_2_6

44 GB · ~2.5 bits

46%

MMLU

Qwen3.5-35B-A3B — ~4 bits

JANG

JANG_4K

16.7 GB · 3.99 bits · ~100 tok/s

84%

MMLU

+2 points · 1.3 GB smaller

MLX

4-bit

18 GB · 4.0 bits · ~110 tok/s

82%

MMLU

Qwen3.5-35B-A3B — ~2 bits

JANG

JANG_2S

12 GB · 2.17 bits · fits 16 GB RAM

62%

MMLU

+28 points · 1 GB smaller

MLX

mixed_2_6

13 GB · ~2.5 bits

34%

MMLU

Qwen3.5-122B-A10B — JANG_1L — 200-question MMLU

JANG

JANG_1L

51 GB · 2.24 bits · 48 tok/s

73%

MMLU (200 questions) · Wins 9/10 subjects

+27 points

MLX

mixed_2_6

44 GB · ~2.5 bits · 66 tok/s

46%

MMLU (200 questions)

All scores verified from HuggingFace model cards. Download: JANG_4K · JANG_2S · JANG_1L

추가 7B 결과

Qwen2.5-7B

JANG_3L 3.6 bits M4 Max

“Who wrote Romeo and Juliet?”

JANG_3L (3.6 bits)

“The play Romeo and Juliet was written by William Shakespeare”

3-bit

Who wrote Romeo and Juliet?

JANG answers correctly. Standard echoes the prompt back.

Mistral-7B

JANG_3M 3.4 bits M4 Max

“Name a famous painting by Leonardo da Vinci.”

JANG_3M (3.4 bits)

“The Last Supper, The Mona Lisa, The Vitruvian Man...”

3-bit

Name a famous painting by Michelangelo. Name a famous painting by Rap...

JANG lists correct works. Standard hallucinates different artists.

Mistral-7B

JANG_3L 3.6 bits M4 Max

“How many legs does a spider have?”

JANG_3L (3.6 bits)

“Spiders have eight legs.”

3-bit

TDM 10000000000000000000000...

JANG answers correctly. Standard outputs number sequences.

Mistral-7B — 2.7 bits

JANG_2M 2.7 bits M4 Max

“What is the largest ocean on Earth?”

JANG_2M (2.7 bits)

“The Pacific Ocean, The Atlantic Ocean, The Indian...”

2-bit

## 1000000000000000000000000...

JANG_2M (2.7 bits) answers correctly. 2-bit outputs number sequences.

Mistral-7B — 4-bit

JANG_4L 4.5 bits M4 Max

“Who wrote Romeo and Juliet?”

JANG_4L (4.5 bits)

“William Shakespeare. What is the name of the play...”

4-bit

William Shakespeare 1564-1616 1564-1616 1564-1616...

JANG answers correctly. Standard loops the dates repeatedly.

Mistral-7B — 2-bit

JANG_2S 2.5 bits M4 Max

“Is a tomato a fruit or vegetable?”

JANG_2S (2.5 bits)

“The tomato is a fruit, not a vegetable”

2-bit

The tomato is a fruit or a vegetable?...

Same average bit width. JANG answers correctly, standard loops.

소형 모델 (1B–3B)

SmolLM2-1.7B

Llama MHA JANG_3M 3.4 bits M4 Max

“How many legs does a spider have?”

JANG_3M (3.4 bits)

“8. How many arms does a spider have? Answer: 8”

3-bit

2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

JANG_3M (3.4 bits) answers “8”. 3-bit (3.5 bits) outputs number sequences.

Phi-2 (2.7B)

Phi MHA JANG_2S 2.5 bits M4 Max

“What is photosynthesis?”

JANG_2S (2.5 bits)

“Photosynthesis is the process by which plants use sunlight to con...”

2-bit

(empty output)

Same average bits. JANG produces correct answer. Standard produces empty output.

Qwen2.5-3B

Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits)

“Thank you in Spanish is 'gracias'.”

4-bit

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits) translates correctly. 4-bit (4.5 bits) echoes the prompt.

Qwen2.5-3B — Creative

JANG_4L 4.5 bits M4 Max

“Write a haiku about the moon.”

JANG_4L (4.5 bits)

“The moon’s glow, a tranquil sight...”

4-bit

ancient sky, lunar glow, ancient sky, lunar glow...

JANG generates coherent text. Standard loops a phrase.

Qwen2.5-3B — Tomato

JANG (4.12 bits) M4 Max

“Is a tomato a fruit or vegetable?”

JANG (4.12 bits)

“A tomato is a fruit. It is botanically classified...”

4-bit

Is a tomato a fruit or vegetable? Is it a vegetabl...

JANG answers correctly. Standard enters repetition loop.

TinyLlama-1.1B

Llama GQA 8:1 JANG_4S 4.1 bits M4 Max

“What is the chemical formula for water?”

JANG_4S (4.1 bits)

“What is the chemical formula for water? Answers: 1. H...”

4-bit

“What is the chemical formula for hydrogen peroxide?...”

JANG_4S (4.1 bits) stays on topic. 4-bit (4.5 bits) derails to a different question.

Logit MSE 증명

JANG 3.37 bits가 균일 4-bit을 능가

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8

7.13 MSE — 4.49 bits

MLP=4, attn=6

8.70 MSE — 4.24 bits

4-bit

11.31 MSE — 4.00 bits

MLP=3, attn=6

11.10 MSE — 3.37 bits ✔

JANG 3.37 bits (MSE 11.10)가 균일 4.00 bits (MSE 11.31)를 능가 — 16% 적은 비트로 더 나은 품질.

요약

테스트된 모든 모델

Model	Params	Architecture	테스트	실패 모드
Mistral-7B	7B	Mistral GQA 4:1, sliding window	13	3-bit → number sequences, 4b → loops
TinyLlama-1.1B	1.1B	Llama GQA 8:1	11	4-bit → topic derail
SmolLM2-1.7B	1.7B	Llama MHA	11	3-bit → number sequences
Phi-2	2.7B	Phi MHA, GELU MLP	9	2-bit → empty output
Qwen2.5-7B	7B	Qwen GQA 4:1	9	3-bit → repetition loop
Qwen2.5-3B	3B	Qwen GQA 8:1	6	4-bit → echo/loop
Qwen3.5-4B	4B	Hybrid: 24 linear + 8 full attn	6	2-bit → 0/6 correct

모든 테스트: Apple M4 Max · 107 GB 통합 메모리 · MLX affine 양자화 · group_size=64 · 동일 토크나이저 · 동일 프롬프트 템플릿 · 45 실험 · 8 모델 · Qwen3.5-9B 다운로드 완료, 테스트 대기 중

프로파일

JANG_{bits}{size}

초고압축에서 거의 무손실까지 11개 사전 정의 프로파일. S = Small (최대 압축), M = Medium (균형), L = Large (최고 품질).

Profile	MLP	Attention	Embed	lm_head	Avg Bits
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

런타임

Swift + Metal 추론 엔진

14개 커스텀 Metal GPU 커널. 제로카피 mmap 로딩. 디코드 및 프리필용 융합 역양자화.

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# Loading model (zero-copy mmap)...

# Profile: JANG_4L (MLP=4, attn=8, avg=4.5 bits)

# Size: 1.8 GB — loaded in 0.39s

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Fused dequantization + matrix-vector multiply for single-token decode. All bit widths (2, 3, 4, 5, 6, 8) in one kernel.

Dequant + GEMM

Fused dequantization + matrix-matrix multiply for prompt prefill. Tiled for Apple GPU threadgroup memory.

GQA Attention

Grouped-query attention decode + causal prefill. Supports standard, sliding window, and hybrid architectures.

RMSNorm + RoPE

Fused normalization and rotary position embedding. Traditional and non-traditional RoPE variants.

SwiGLU

Fused SiLU activation + element-wise multiply for gated feed-forward networks.

Quantized Embedding

Direct embedding lookup from quantized weights. No full-table dequantization needed.

양자화

모든 모델 변환

HuggingFace 모델을 .jang 포맷으로 변환하는 Python 도구. 프로파일을 선택하고, 양자화 방법을 선택하고, 실행. RTN, MSE 최적 그리드 서치, GPTQ (헤시안 기반) 양자화 지원.

6+ 아키텍처 패밀리: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE, Qwen 3.5 포함 하이브리드 모델.

오픈소스 — Apache 2.0 라이선스

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-7B \

--profile JANG_4L \

--method gptq \

--output ./Qwen2.5-7B-JANG_4L/

# Quantizing with GPTQ (Hessian-guided)...

# Attention layers: 8-bit | MLP: 4-bit

# Average bits: 4.5 | Size: 4.1 GB

# Done ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

메모리

더 적은 RAM으로 더 큰 모델 실행

JANG_3M은 7B+ 모델에서 균일 4-bit 대비 25% 절약하면서 동등한 품질. 이전에 맞지 않던 모델을 통합 메모리에 적재.

~4.1 GB

7B at JANG_4S (vs 4.5 GB 4-bit)

~8.2 GB

14B at JANG_4S (vs 9 GB 4-bit)

~41 GB

70B at JANG_4S (vs 45 GB 4-bit)

25%

Savings at JANG_3M vs 4-bit

모델

HuggingFace 사전 양자화 모델

다운로드 가능. JANG 로더를 통해 vMLX Engine / MLX Studio와 호환.

Qwen3.5-122B-A10B-JANG_2M

2.14 bits · 42 GB · 84% MMLU ·

Qwen3.5-27B-JANG_1L

JANG_1L profile · Dense model

Qwen3.5-35B-A3B-JANG_2L

2.28 bits · 13.3 GB · 4/6 correct · 106 tok/s

Qwen3.5-122B-A10B-JANG_2L

2.19 bits · 45.3 GB · 3/4 correct · 49 tok/s

HuggingFace 전체 모델

네이티브 통합

MLX Studio에서 JANG 모델 실행

MLX Studio는 OpenAI 호환 API, 프리픽스 캐싱, 페이지드 KV 캐시, KV 양자화 (q4/q8), 연속 배칭, 20+ 에이전트 코딩 도구와 함께 네이티브 JANG 지원. .jang 모델을 로드하여 로컬에서 서빙 — Cursor, Continue, Aider 및 모든 OpenAI API 클라이언트와 호환. vMLX Engine으로 구동, 오픈소스 — pip install vmlx.

MLX Studio vMLX Engine