Código abierto

JANG

The GGUF for MLX

397 mil millones de parámetros en un Mac de 128 GB. 92% MMLU. MLX ni siquiera puede cargarlo.

JANG_1L comprime el modelo 397B en 112 GB — un MacBook Pro de 128 GB puede ejecutarlo con razonamiento a 86.5% MMLU. MLX a 2-3 bits produce NaN. ¿MiniMax 230B? MLX obtiene 26.5% en todos los niveles de bits. ¿Nemotron-H 120B? MLX 3-bit está roto. JANG es la única forma de ejecutar estos modelos cuantizados en Apple Silicon.

JANG asigna más bits a attention y menos a MLP, manteniendo los modelos coherentes donde standard quantization produce basura o NaN. Misma velocidad, mismos kernels Metal — solo mejor salida. Código abierto bajo Apache 2.0.

Asignación de bits según importancia Precisión mixta de 2-bit a 8-bit 14 kernels GPU Metal personalizados Motor Swift + Metal Ancho de bits variable por bloque Código abierto · Apache 2.0

Ver en GitHub Ver resultados

397B

Modelo más grande — cabe en Mac de 128 GB

92%

MMLU en 397B (JANG_2L)

93%

MMLU en Nemotron-H 120B (JANG_4M)

Apache 2.0

Licencia de código abierto

Cómo funciona

Anchos de bits variables según la sensibilidad de las capas

La standard quantization aplica el mismo ancho de bits a cada tensor. Las capas de attention (~12% de los parámetros) son más sensibles a la pérdida de precisión que las capas MLP — cuando se cuantiza de forma demasiado agresiva, las puntuaciones de attention se aplanan, la codificación posicional se degrada y la salida degenera.

JANG clasifica los tensores en niveles de sensibilidad y asigna anchos de bits según corresponda. Las capas de attention reciben 5–8 bits mientras que MLP se comprime a 2–4 bits. El costo adicional es de ~0.3 bits extra en promedio.

Attention

8-bit — protegido

MLP

2-bit — comprimido

Embed

4-bit

lm_head

6-bit

Result

JANG_2M
 → 2.7 avg bits → 
coherent output

3-bit
 → 3.0 avg bits → 
repetition loops

MMLU Benchmark

JANG vs MLX — comparación directa

Cada modelo JANG comparado con el método MLX más cercano por tamaño. MMLU de 200 preguntas (20 por materia × 10 materias), thinking/reasoning activado donde se indica, temp 0.0. Apple M4 Max 128 GB / M4 Ultra 256 GB.

Qwen3.5-397B-A17B — 397 billion parameters — JANG vs MLX

JANG

JANG_1L

112 GB disk · 120 GB GPU peak · 36 tok/s · FITS 128 GB MACS

86.5%

MMLU (200q, reasoning) · 173/200

397B intelligence on a laptop

MLX

2-bit / 3-bit

Cannot run — NaN output

NaN

Model too complex for standard quantization

JANG

JANG_2L

187 GB disk · 197 GB GPU · 36 tok/s · M4 Ultra 256 GB

92%

MMLU (200q, reasoning) · 184/200

Near-FP16 quality at 2.x bits

MLX

4-bit

~280 GB · requires massive machines

94%

MMLU (200q, reasoning)

397B on a 128 GB Mac — first ever. JANG_1L at 112 GB disk (120 GB GPU peak) fits on a 128 GB MacBook Pro and scores 86.5% MMLU with reasoning. MLX at 2-bit and 3-bit produces NaN — the model is too complex for standard quantization at low bit widths. MLX 4-bit runs at 94% but needs ~280 GB, far beyond any laptop. JANG_2L at 187 GB hits 92% on an M4 Ultra 256 GB.

Nemotron-3-Super-120B-A12B — NVIDIA Hybrid Mamba-2 SSM + Latent MoE + Attention

JANG

JANG_4M

63 GB · 55 tok/s

93%

MMLU (200q, reasoning) · 186/200

First Nemotron-H on Apple Silicon

MLX

3-bit

Broken

—

Cannot produce valid output

JANG

JANG_2L

43 GB · 52 tok/s · fits 64 GB Macs

86%

MMLU (200q, reasoning) · 172/200

120B on a 64 GB Mac

First working Nemotron-H quantization for Apple Silicon. NVIDIA’s hybrid architecture combines Mamba-2 SSM, Latent MoE, and standard attention — MLX 3-bit is broken on it. JANG_4M at 63 GB scores 93% MMLU with reasoning at 55 tok/s. JANG_2L fits on a 64 GB Mac at 43 GB with 86% MMLU.

MiniMax-M2.5 (230B) — JANG vs MLX

JANG

JANG_2L

82.5 GB · 2.10 bits · 0.9s per question

74.0%

MMLU (200q) · 148/200

+47.5 points · MLX broken at ALL bit levels

MLX

4-bit

119.8 GB · 4.0 bits · 0.9s per question

26.5%

MMLU (200q) · 53/200

MLX is completely broken on MiniMax at every bit level — 4-bit (26.5%), 3-bit (24.5%), and 2-bit (25%) all score near random. JANG_2L at just 2.10 bits is the only way to run MiniMax quantized on Apple Silicon.

Per-subject breakdown — MiniMax-M2.5 (230B) — all methods

Materia	JANG_2L	MLX 4-bit	MLX 3-bit	MLX 2-bit
Abstract Algebra	10/20	3/20	2/20	5/20
Anatomy	15/20	7/20	5/20	5/20
Astronomy	20/20	7/20	6/20	4/20
College CS	13/20	4/20	5/20	6/20
College Physics	13/20	8/20	6/20	6/20
HS Biology	18/20	4/20	5/20	6/20
HS Chemistry	18/20	4/20	5/20	5/20
HS Mathematics	8/20	6/20	6/20	3/20
Logical Fallacies	18/20	5/20	4/20	5/20
World Religions	15/20	5/20	5/20	5/20
Total	148/200 (74%)	53/200 (26.5%)	49/200 (24.5%)	50/200 (25%)

JANG wins all 10 subjects against all MLX methods. MLX 4-bit, 3-bit, and 2-bit all score near random (25%). Root cause: MLX generates meta-commentary instead of direct answers on this model.

Qwen3.5-122B-A10B — ~4 bits

JANG

JANG_4K

71 GB · 3.99 bits · ~40 tok/s

86%

MMLU (200q) · 172/200

+1 point vs MLX 4-bit

MLX

4-bit

64 GB · 4.0 bits · ~50 tok/s

85%

MMLU (200q) · 170/200

Per-subject breakdown — 122B ~4 bits

Materia	JANG_4K	MLX 4-bit
Abstract Algebra	16/20	15/20
Anatomy	19/20	18/20
Astronomy	19/20	19/20
College CS	15/20	15/20
College Physics	14/20	14/20
HS Biology	19/20	19/20
HS Chemistry	18/20	18/20
HS Mathematics	14/20	14/20
Logical Fallacies	19/20	19/20
World Religions	19/20	19/20
Total	172/200 (86%)	170/200 (85%)

JANG wins 2 subjects, ties 8. Neck-and-neck at ~4 bits.

Qwen3.5-122B-A10B — ~2 bits

JANG

JANG_2S

44 GB · 2.11 bits · ~45 tok/s

79%

MMLU (200q) · 158/200

+22.5 points

MLX

2-bit

36 GB · 2.0 bits · ~52 tok/s

56.5%

MMLU (200q) · 113/200

Per-subject breakdown — 122B ~2 bits

Materia	JANG_2S	MLX 2-bit
Abstract Algebra	9/20	9/20
Anatomy	18/20	11/20
Astronomy	20/20	16/20
College CS	14/20	8/20
College Physics	15/20	10/20
HS Biology	19/20	15/20
HS Chemistry	18/20	13/20
HS Mathematics	11/20	4/20
Logical Fallacies	16/20	13/20
World Religions	18/20	14/20
Total	158/200 (79%)	113/200 (56.5%)

JANG wins 9 of 10 subjects, ties 1 (Abstract Algebra).

Qwen3.5-35B-A3B — ~4 bits

JANG

JANG_4K

20.1 GB · 3.99 bits · ~100 tok/s

77.5%

MMLU (200q) · 155/200

+2 points

MLX

4-bit

18.2 GB · 4.0 bits · ~110 tok/s

75.5%

MMLU (200q) · 151/200

Per-subject breakdown — 35B ~4 bits

Materia	JANG_4K	MLX 4-bit
Abstract Algebra	12/20	10/20
Anatomy	17/20	16/20
Astronomy	18/20	18/20
College CS	14/20	15/20
College Physics	14/20	13/20
HS Biology	18/20	18/20
HS Chemistry	17/20	17/20
HS Mathematics	10/20	8/20
Logical Fallacies	18/20	19/20
World Religions	17/20	17/20
Total	155/200 (77.5%)	151/200 (75.5%)

JANG wins 4 subjects, loses 2 (College CS, Logical Fallacies), ties 4.

Qwen3.5-35B-A3B — ~2 bits

JANG

JANG_2S

12.8 GB · 2.17 bits · fits 16 GB RAM

65.5%

MMLU (200q) · 131/200

+25 points

MLX

2-bit

12.8 GB · ~2.5 bits

~40%

MMLU (est. from 34% at 50q)

Per-subject breakdown — 35B ~2 bits (JANG only)

Materia	JANG_2S	MLX 2-bit
Abstract Algebra	8/20	—
Anatomy	14/20	—
Astronomy	19/20	—
College CS	14/20	—
College Physics	11/20	—
HS Biology	16/20	—
HS Chemistry	14/20	—
HS Mathematics	5/20	—
Logical Fallacies	14/20	—
World Religions	16/20	—
Total	131/200 (65.5%)	~40% (est.)

MLX 2-bit 200q not yet tested. Estimate based on 34% at 50 questions.

Test methodology & conditions

MMLU: 200-question subset (10 subjects × 20 questions each), thinking disabled, temperature 0.0.
Hardware: Apple M4 Max 128 GB unified memory.
Quantization: MLX affine quantization, group_size=64. JANG uses variable bit widths via quant_predicate.
Models: All methods use the same base model weights. JANG stays quantized in GPU memory using MLX’s native quantized_matmul — no float16 expansion.
Reproducibility: All scores verified from HuggingFace model cards. Code at github.com/jjang-ai/jangq.

Download: All models on HuggingFace — 397B, Nemotron-H 120B, 122B, 35B, MiniMax 230B, and more

Pruebas QA Prompt

Comparación triple en prompts básicos

Comparación en 6 preguntas factuales. Todos los métodos usan los kernels Metal nativos de MLX. Temperature 0.0, máximo 80 tokens. M4 Max 128 GB.

Qwen3.5-122B-A10B — JANG_1L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, top-8, 10B active, Hybrid JANG_1L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_1L · 2.24 bits

46.0 GB RAM · 48 tok/s

MLX mixed_2_6 · ~2.2 bits

44.9 GB RAM · 66 tok/s

2-bit · 2.0 bits

35.6 GB RAM · 67 tok/s

“What is 2+2?”

✓ “2+2 is 4”

∼ “2+2=4” then repeats

∼ “2+2=4” then loops

“Is a tomato a fruit?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: empty think tag

✗ 2-bit: rephrases, no answer

“What is photosynthesis?”

✓ “plants use energy of sun to make food”

✗ Degenerate output

✗ “Photos-sense y=y”

“Three planets larger?”

∼ JANG: uses <think> (partial)

∼ mixed_2_6: uses <think> (partial)

✗ Misreads question

“Who wrote Romeo and Juliet?”

∼ JANG: uses <think> (partial)

✗ mixed_2_6: double think tag

∼ 2-bit: uses <think> (partial)

“Capital of France?”

✓ “Paris”

✓ mixed_2_6: “Paris”

✓ 2-bit: “Paris” with details

JANG_1L: 3 correctas, 3 parciales, 0 fallidas · mixed_2_6: 1 correcta, 1 parcial, 4 fallidas · 2-bit: 1 correcta, 2 parciales, 3 fallidas

MLX’s mixed_2_6 mode protects select v_proj and down_proj layers at 6-bit, but does not account for GatedDeltaNet linear attention layers, MoE expert routing tensors, or hybrid architecture components. JANG’s tier system classifies these architecture-specific tensors explicitly.

MiniMax-M2.5 (230B) — JANG_2S (2.06 bits)

MoE 256 experts, top-8, 10B active JANG_2S · 2.06 bits Mac Studio M4 Ultra 192 GB

JANG_2S · 2.06 bits
81.6 GB GPU · 50 tok/s
JANG_2L · 2.10 bits
82.5 GB RAM · 74% MMLU (200q)

JANG_2S: 3/6 correctas a 2.06 bits · modelo 230B en 81.6 GB · 50 tok/s JANG_2L: 74% MMLU (200 preguntas) a 82.5 GB RAM — 3 veces superior a MLX 4-bit a 120 GB

Qwen3.5-35B-A3B — JANG_2L vs MLX mixed_2_6 vs 2-bit

MoE 256 experts, Hybrid GDN+FA JANG_2L vs mixed_2_6 vs 2-bit M4 Max 128 GB

JANG_2L · 2.28 bits

13.3 GB RAM · 100 tok/s

MLX mixed_2_6 · ~2.2 bits

12.8 GB RAM · 120 tok/s

2-bit · 2.0 bits

10.1 GB RAM · 128 tok/s

“What is 2+2?”

✓ “2+2 equals 4”

✗ “2+2=4” then loops

✗ Number sequences

“Is a tomato a fruit?”

✗ JANG: loops

∼ mixed_2_6: partial reasoning

✗ 2-bit: degenerate

“What is photosynthesis?”

✓ “convert light energy”

✗ “I cannot respond”

✗ “6 6 6”

“Three planets larger?”

✓ “Jupiter, Saturn, Uranus”

✗ “Antina” loops

✗ Number sequences

“Who wrote Romeo and Juliet?”

∼ JANG: “Shakespeare” (partial)

✗ mixed_2_6: contradicts itself

✗ 2-bit: degenerate

“Capital of France?”

✓ “Paris” with details

✗ Never answers

∼ 2-bit: “Paris” partial

JANG_2L: 4 correctas, 1 parcial, 1 fallida · mixed_2_6: 0 correctas, 1 parcial, 5 fallidas · 2-bit: 0 correctas, 1 parcial, 5 fallidas

On this hybrid MoE model, MLX mixed_2_6 does not improve over 2-bit. The mixed_2_6 heuristic targets v_proj and down_proj in standard transformer layers but misses GatedDeltaNet attention and MoE routing tensors that are critical for this architecture.

Qwen3.5-122B-A10B — 122 mil millones de parámetros, comparación directa

MoE 256 experts, top-8, 10B active JANG_2L vs 2-bit M4 Max 128 GB

JANG_2L · 2.19 bits
45.3 GB RAM · 38–49 tok/s
2-bit · 2.0 bits
35.6 GB RAM · 52–65 tok/s

“What is photosynthesis?”

“process by which green plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose”

“Photos-sense” then “y = y = y” degenerate

“Three planets larger than Earth?”

Uses <think> reasoning tags, lists Jupiter with details

Misreads as “larger than Earth’s moon”, rambles

“Capital of France?”

“Paris” with government details

“Paris, on the banks of the River Seine” — both correct

“What is 2+2?”

“2+2 is 4.” (then repeats) — PARTIAL

“2+2=4” then “2. 2. 2.” loops

JANG: 3/4 correctas  ·  2-bit: 1/4 correctas  ·  45.3 vs 35.6 GB GPU  ·  <code style="font-size:0.72rem"><think></code> razonamiento preservado a 2.19 bits

Todos los modelos comparados

Tamaño, velocidad y puntuaciones — JANG vs MLX

Modelo	Método	Bits	Tamaño	MMLU
Qwen3.5-397B-A17B	JANG_2L	~2.x	187 GB	92%
	JANG_1L	~2.2	112 GB	86.5%
	MLX 4-bit	4.0	~280 GB	94%
	MLX 2-bit / 3-bit	2-3	—	NaN

Nemotron-3-Super-120B	JANG_4M	~4.2	63 GB	93%
	JANG_2L	~2.x	43 GB	86%
	MLX 3-bit	3.0	—	Broken

Qwen3.5-122B-A10B	JANG_2M	2.14	44.7 GB	79%
	JANG_1L	2.24	46.0 GB	73%
	JANG_2L	2.19	45.3 GB	—
	MLX mixed_2_6	~2.5	45 GB	46%
	2-bit	2.0	36 GB	56.5%

Qwen3.5-35B-A3B	JANG_4K	3.99	20.1 GB	77.5%
	MLX 4-bit	4.0	18.2 GB	75.5%
	JANG_4S	4.04	20.4 GB	82%
	JANG_2S	2.17	12.8 GB	65.5%
	JANG_2L v2	2.28	13.3 GB	56%
	MLX mixed_2_6	~2.5	12.8 GB	~40%

MiniMax-M2.5 (230B)	JANG_2S	2.06	81.6 GB	—
	JANG_2L	2.10	82.5 GB	74%
	MLX 4-bit	4.0	119.8 GB	26.5%
	MLX 2-bit	2.0	66.6 GB	25.0%

Apple M4 Max 128 GB / M4 Ultra 256 GB · MMLU: 200-question (10 subjects × 20), reasoning enabled for 397B and Nemotron, thinking disabled for others · 2026-03

Qwen3.5-397B: JANG_1L at 112 GB (120 GB GPU peak) fits on 128 GB Macs — 86.5% MMLU with reasoning, 36 tok/s. JANG_2L at 187 GB hits 92% on M4 Ultra 256 GB. MLX 2/3-bit: NaN. MLX 4-bit: 94% but ~280 GB.

Nemotron-3-Super-120B: JANG_4M at 63 GB scores 93% MMLU, 55 tok/s. JANG_2L at 43 GB scores 86%, fits 64 GB Macs. MLX 3-bit: broken. First working Nemotron-H quantization for Apple Silicon.

MiniMax-M2.5 (230B): JANG_2L scores 74% MMLU at 82.5 GB vs MLX 4-bit at 26.5% (119.8 GB). MLX broken at ALL bit levels (26.5%, 24.5%, 25%). JANG is the only way to run MiniMax quantized.

Pipeline verification: JANG_4S matches MLX 4-bit exactly on 35B MMLU (82% = 82%), confirming the quantization pipeline is lossless at matched bit widths.

397B

Modelo más grande probado

Familias de arquitecturas probadas

tok/s (Nemotron 120B, JANG_4M)

0.3s

Tiempo de carga (modelo 3B, mmap)

Resultados anteriores

Comparaciones de modelos densos (1B–7B)

Comparaciones en el límite de degradación — el ancho de bits donde la standard quantization comienza a producir salida degenerada. Mismos prompts, misma temperature, mismo modelo. Todo en M4 Max.

Qwen3.5-4B (Arquitectura híbrida)

Hybrid: 24 linear + 8 full attn JANG_2S 2.5 eff. bits M4 Max · 107 GB

At 2.5 effective bits, JANG_2S gets 6/6 correct while 2-bit gets 0/6. JANG protects the 8 critical full-attention layers at 6-bit while compressing the 24 linear-attention layers and all MLP at 2-bit.

“What is 2+2?”

JANG: “The answer is 4.”

2-bit: “2+2? 2+2? 2+2?”

“Is a tomato a fruit?”

JANG: “A tomato is a fruit, not a vegetable.”

2-bit: “1 1 1 1 1 1 1 1”

“Who wrote Romeo and Juliet?”

JANG: Answers correctly

2-bit: “10, 10, 10, 10”

“What is photosynthesis?”

JANG: Correct definition

2-bit: Garbled text

“How many legs does a spider have?”

JANG: Answers correctly

2-bit: “10, 10, 10”

“Largest ocean on Earth?”

JANG: “The Pacific Ocean.”

2-bit: Infinite loop

Destacados — modelos 7B

Mistral-7B-v0.3

Mistral GQA 4:1 JANG_3M 3.4 bits M4 Max

"¿Qué es la fotosíntesis?"

JANG_3M (3.4 bits)

“Photosynthesis is the process by which plants and some other organisms...”

3-bit (3.5 bits)

10000000000000000000000000000...

JANG_3M a 3.4 bits produce salida correcta. 3-bit (3.5 bits) produce secuencias de números.

Qwen2.5-7B

Qwen GQA 4:1 JANG_3L 3.6 bits M4 Max

"¿Cuánto es 2+2?"

JANG_3L (3.6 bits)

“The answer is 4.”

3-bit (3.5 bits)

Assistant Assistant Assistant Assistant Assistant...

JANG_3L (3.6 bits) responde correctamente. 3-bit (3.5 bits) entra en bucle de repetición.

Mistral-7B — 4-bit

Mistral GQA 4:1 JANG_4S 4.1 bits M4 Max

"¿Cuánto es 2+2?"

JANG_4S (4.1 bits)

“The answer is 4. But what if...”

4-bit (4.5 bits)

4. What is 2+2? 4. What is 2+2? 4...

JANG_4S (4.1 bits) responde correctamente. 4-bit (4.5 bits) repite la pregunta en bucle.

Mistral-7B — 2-bit

Mistral GQA 4:1 JANG_2S 2.5 bits M4 Max

"Nombre tres planetas del sistema solar."

JANG_2S (2.5 bits)

“1. Jupiter 2. Mars 3. Saturn”

2-bit (2.5 bits)

is a new planet, and it is a new planet...

JANG_2S (2.5 bits) enumera tres planetas. 2-bit (2.5 bits) entra en bucle de repetición.

Más resultados 7B

Qwen2.5-7B

JANG_3L 3.6 bits M4 Max

"¿Quién escribió Romeo y Julieta?"

JANG_3L (3.6 bits)

“The play Romeo and Juliet was written by William Shakespeare”

3-bit

Who wrote Romeo and Juliet?

JANG responde correctamente. Standard repite el prompt.

Mistral-7B

JANG_3M 3.4 bits M4 Max

"Nombre una pintura famosa de Leonardo da Vinci."

JANG_3M (3.4 bits)

“The Last Supper, The Mona Lisa, The Vitruvian Man...”

3-bit

Name a famous painting by Michelangelo. Name a famous painting by Rap...

JANG enumera las obras correctas. Standard alucina con artistas diferentes.

Mistral-7B

JANG_3L 3.6 bits M4 Max

"¿Cuántas patas tiene una araña?"

JANG_3L (3.6 bits)

“Spiders have eight legs.”

3-bit

TDM 10000000000000000000000...

JANG responde correctamente. Standard produce secuencias de números.

Mistral-7B — 2.7 bits

JANG_2M 2.7 bits M4 Max

"¿Cuál es el océano más grande de la Tierra?"

JANG_2M (2.7 bits)

“The Pacific Ocean, The Atlantic Ocean, The Indian...”

2-bit

## 1000000000000000000000000...

JANG_2M (2.7 bits) responde correctamente. 2-bit produce secuencias de números.

Mistral-7B — 4-bit

JANG_4L 4.5 bits M4 Max

"¿Quién escribió Romeo y Julieta?"

JANG_4L (4.5 bits)

“William Shakespeare. What is the name of the play...”

4-bit

William Shakespeare 1564-1616 1564-1616 1564-1616...

JANG responde correctamente. Standard repite las fechas en bucle.

Mistral-7B — 2-bit

JANG_2S 2.5 bits M4 Max

"¿El tomate es una fruta o una verdura?"

JANG_2S (2.5 bits)

“The tomato is a fruit, not a vegetable”

2-bit

The tomato is a fruit or a vegetable?...

Mismo ancho de bits promedio. JANG responde correctamente, standard entra en bucle.

Modelos más pequeños (1B–3B)

SmolLM2-1.7B

Llama MHA JANG_3M 3.4 bits M4 Max

"¿Cuántas patas tiene una araña?"

JANG_3M (3.4 bits)

“8. How many arms does a spider have? Answer: 8”

3-bit

2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

JANG_3M (3.4 bits) responde "8". 3-bit (3.5 bits) produce secuencias de números.

Phi-2 (2.7B)

Phi MHA JANG_2S 2.5 bits M4 Max

"¿Qué es la fotosíntesis?"

JANG_2S (2.5 bits)

“Photosynthesis is the process by which plants use sunlight to con...”

2-bit

(empty output)

Mismos bits promedio. JANG produce la respuesta correcta. Standard produce salida vacía.

Qwen2.5-3B

Qwen GQA 8:1 JANG_4S 4.1 bits M4 Max

"Traduce 'thank you' al español."

JANG_4S (4.1 bits)

“Thank you in Spanish is 'gracias'.”

4-bit

“Translate 'thank you' to Spanish.”

JANG_4S (4.1 bits) traduce correctamente. 4-bit (4.5 bits) repite el prompt.

Qwen2.5-3B — Creativo

JANG_4L 4.5 bits M4 Max

"Escribe un haiku sobre la luna."

JANG_4L (4.5 bits)

“The moon’s glow, a tranquil sight...”

4-bit

ancient sky, lunar glow, ancient sky, lunar glow...

JANG genera texto coherente. Standard repite una frase en bucle.

Qwen2.5-3B — Tomate

JANG (4.12 bits) M4 Max

"¿El tomate es una fruta o una verdura?"

JANG (4.12 bits)

“A tomato is a fruit. It is botanically classified...”

4-bit

Is a tomato a fruit or vegetable? Is it a vegetabl...

JANG responde correctamente. Standard entra en bucle de repetición.

TinyLlama-1.1B

Llama GQA 8:1 JANG_4S 4.1 bits M4 Max

"¿Cuál es la fórmula química del agua?"

JANG_4S (4.1 bits)

“What is the chemical formula for water? Answers: 1. H...”

4-bit

“What is the chemical formula for hydrogen peroxide?...”

JANG_4S (4.1 bits) se mantiene en el tema. 4-bit (4.5 bits) se desvía a una pregunta diferente.

Prueba Logit MSE

JANG a 3.37 bits supera a 4-bit

Qwen2.5-3B · “What is 2+2?” · Logit MSE vs bf16 reference · Lower is better

Apple M4 Max · 107 GB unified memory · affine quantization · group_size=64

MLP=4, attn=8

7.13 MSE — 4.49 bits

MLP=4, attn=6

8.70 MSE — 4.24 bits

4-bit

11.31 MSE — 4.00 bits

MLP=3, attn=6

11.10 MSE — 3.37 bits ✔

JANG at 3.37 bits (MSE 11.10) beats 4.00 bits (MSE 11.31) — 16% fewer bits with better quality.

Resumen

Todos los modelos probados

Modelo	Parámetros	Arquitectura	Pruebas	Modo de fallo
Qwen3.5-397B-A17B	397B	MoE, Hybrid	MMLU	MLX 2/3-bit → NaN
Nemotron-3-Super-120B	120B	Hybrid Mamba-2 SSM + Latent MoE + Attn	MMLU	MLX 3-bit → broken
MiniMax-M2.5	230B	MoE 256 experts, top-8	MMLU	MLX all bits → random (25%)
Qwen3.5-122B-A10B	122B	MoE 256 experts, Hybrid	MMLU	2-bit → 56.5%, mixed_2_6 → 46%
Qwen3.5-35B-A3B	35B	MoE 256 experts, Hybrid GDN+FA	MMLU+QA	2-bit → degenerate, mixed_2_6 → broken
Qwen3.5-4B	4B	Hybrid: 24 linear + 8 full attn	6	2-bit → 0/6 correct
Mistral-7B	7B	Mistral GQA 4:1, sliding window	13	3-bit → number sequences
Qwen2.5-7B	7B	Qwen GQA 4:1	9	3-bit → repetition loop
Qwen2.5-3B	3B	Qwen GQA 8:1	6	4-bit → echo/loop
SmolLM2-1.7B	1.7B	Llama MHA	11	3-bit → number sequences
TinyLlama-1.1B	1.1B	Llama GQA 8:1	11	4-bit → topic derail
Phi-2	2.7B	Phi MHA, GELU MLP	9	2-bit → empty output

Apple M4 Max 128 GB / M4 Ultra 256 GB · MLX affine quantization · group_size=64 · same tokenizer · same prompt template · 12 models · 1B to 397B

Perfiles

JANG_{bits}{size}

11 perfiles predefinidos desde ultra-comprimido hasta casi sin pérdida. S = Small (máxima compresión), M = Medium (equilibrado), L = Large (mejor calidad).

Perfil	MLP	Attention	Embed	lm_head	Bits promedio
JANG_1L	2-bit	8-bit	8-bit	8-bit	~2.2
JANG_2S	2-bit	6-bit	4-bit	6-bit	~2.5
JANG_2M	2-bit	8-bit	4-bit	8-bit	~2.7
JANG_2L	2-bit	8-bit	6-bit	8-bit	~2.9
JANG_3S	3-bit	4-bit	4-bit	6-bit	~3.1
JANG_3M	3-bit	6-bit	4-bit	6-bit	~3.4
JANG_3L	3-bit	8-bit	4-bit	8-bit	~3.6
JANG_4S	4-bit	5-bit	4-bit	6-bit	~4.1
JANG_4M	4-bit	6-bit	4-bit	6-bit	~4.2
JANG_4L	4-bit	8-bit	4-bit	8-bit	~4.5
JANG_6M	6-bit	8-bit	6-bit	8-bit	~6.2

Motor

Motor de inferencia Swift + Metal

14 kernels GPU Metal personalizados. Carga mmap sin copia. Decuantización fusionada para decode y prefill.

jang — Terminal

$ jang run --model Qwen2.5-3B-JANG_4L.jang

# Cargando modelo (zero-copy mmap)...

# Perfil: JANG_4L (MLP=4, attn=8, prom=4.5 bits)

# Tamaño: 1.8 GB — cargado en 0.39s

> What is photosynthesis?

Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water. It generally involves the green pigment chlorophyll and generates oxygen as a byproduct.

Dequant + GEMV

Decuantización fusionada + multiplicación matriz-vector para decode de un solo token. Todos los anchos de bits (2, 3, 4, 5, 6, 8) en un solo kernel.

Dequant + GEMM

Decuantización fusionada + multiplicación matriz-matriz para prefill de prompt. Optimizado con tiles para memoria threadgroup de GPU Apple.

GQA Attention

Decode de grouped-query attention + prefill causal. Soporta arquitecturas estándar, sliding window e híbridas.

RMSNorm + RoPE

Normalización fusionada con rotary position embedding. Variantes de RoPE tradicionales y no tradicionales.

SwiGLU

Activación SiLU fusionada + multiplicación por elemento para redes feed-forward con compuertas.

Embedding cuantizado

Búsqueda directa de embedding desde pesos cuantizados. No requiere decuantización de tabla completa.

Cuantizar

Convierte cualquier modelo

Herramientas Python para convertir modelos de HuggingFace al formato .jang. Seleccione un perfil, elija su método de cuantización y ejecute. Soporta RTN, búsqueda de cuadrícula MSE-óptima y cuantización GPTQ (guiada por Hessian).

6+ familias de arquitecturas: Llama, Qwen, Gemma, Phi, Mistral, Mamba/SSM, MoE y modelos híbridos incluyendo Qwen 3.5.

Código abierto — Licencia Apache 2.0

jang-tools

$ pip install jang-tools

$ jang convert --model Qwen/Qwen2.5-7B \

--profile JANG_4L \

--method gptq \

--output ./Qwen2.5-7B-JANG_4L/

# Cuantizando con GPTQ (guiado por Hessian)...

# Capas Attention: 8-bit | MLP: 4-bit

# Bits promedio: 4.5 | Tamaño: 4.1 GB

# Listo ✔

MLX Studio — JANG Converter

JANG Model Converter showing all quantization profiles

Memoria

Ejecuta modelos más grandes con menos RAM

JANG_3M ahorra un 25% frente a 4-bit con calidad comparable en modelos 7B+. Ejecute modelos en unified memory que antes no cabían.

~4.1 GB

7B en JANG_4S (vs 4.5 GB 4-bit)

~8.2 GB

14B en JANG_4S (vs 9 GB 4-bit)

~41 GB

70B en JANG_4S (vs 45 GB 4-bit)

25%

Ahorro en JANG_3M vs 4-bit

Modelos

Modelos precuantizados en HuggingFace

Listos para descargar. Compatible con vMLX Engine / MLX Studio a través del cargador JANG.

Qwen3.5-397B-A17B-JANG_1L

112 GB · 86.5% MMLU · 36 tok/s · Fits 128 GB Mac

Qwen3.5-397B-A17B-JANG_2L

187 GB · 92% MMLU · 36 tok/s · M4 Ultra 256 GB

Nemotron-3-Super-120B-JANG_4M

63 GB · 93% MMLU · 55 tok/s

Nemotron-3-Super-120B-JANG_2L

43 GB · 86% MMLU · 52 tok/s · Fits 64 GB Mac

Qwen3.5-122B-A10B-JANG_4K

3.99 bits · 71 GB · 86% MMLU (200q) · ~40 tok/s

Qwen3.5-122B-A10B-JANG_2S

2.11 bits · 44 GB · 79% MMLU (200q) · ~45 tok/s

Qwen3.5-35B-A3B-JANG_4K

3.99 bits · 20.1 GB · 77.5% MMLU (200q) · ~100 tok/s

Qwen3.5-35B-A3B-JANG_2S

2.17 bits · 12.8 GB · 65.5% MMLU (200q) · Fits 16 GB RAM

Todos los modelos en HuggingFace

Integración nativa

Ejecute modelos JANG en MLX Studio

MLX Studio cuenta con soporte nativo de JANG con API compatible con OpenAI, prefix caching, paged KV cache, KV quantization (q4/q8), continuous batching y más de 20 herramientas de codificación agénticas. Cargue cualquier modelo .jang y sírvalo localmente — funciona con Cursor, Continue, Aider y cualquier cliente de API OpenAI. Impulsado por vMLX Engine, ahora código abierto — pip install vmlx.

MLX Studio vMLX Engine