
The blazing fast pure-Swift LLM/VLM server for Apple Silicon.
No Python. No cloud. No limits.
Native macOS menu bar app. Full-featured chat UI. Admin dashboard. Everything you need, nothing you don't.




Every layer of the stack is optimized for Apple Silicon — from GPU kernels to memory management.
No Python runtime, no GIL, no FFI bridge. Direct Metal GPU access via MLX — compiled to native code, zero overhead.
Multiple sequences share a single GPU forward pass per decode step. Up to 8 concurrent requests with priority-aware scheduling.
Block-level paged KV cache with SHA-1 chain hashing. Cross-session reuse saves ~90% prefill on repeated prompts.
Per-model KV cache quantization: 2/3/4/6/8-bit with auto-recommendation. 4-bit gives 4× memory savings with minimal quality loss.
N-gram pattern matching drafts up to 5 tokens ahead — zero secondary model needed. Verified in a single forward pass.
Temperature, top-p, top-k, min-p are JIT-compiled via MLX compile() — not interpreted, not Python-implemented.
Configurable per-model KV cache quantization — auto-recommended based on model size and available memory.
| Quantization | Compression | Use Case |
|---|---|---|
| 2-bit | 8.0× | Extreme memory pressure |
| 3-bit | 5.33× | High memory pressure |
| 4-bit | 4.0× | Balanced (recommended) |
| 6-bit | 2.67× | Quality-sensitive |
| 8-bit | 2.0× | Minimal quality loss |
Run 50+ model families — Llama, Qwen, Gemma, DeepSeek, Mistral — natively on your Mac.
Pure Swift on Apple Silicon. No Python overhead. Native Metal GPU acceleration.
Llama 3, Qwen 2/2.5/3, Gemma 2/3, Phi 4, Mistral, DeepSeek, and more.
Drop-in compatible. Point any tool at localhost and it just works.
Send images with messages. Supports Qwen2-VL, Gemma3, LLaVA, and others.
Force JSON, regex, or GBNF grammar. Schema validation built in.
Automatic tool detection across 7 format families. No fine-tuning needed.
Download. Install. Download a model. Start chatting.
Grab the latest .dmg from GitHub Releases.
Open the .dmg → Drag NovaMLX to Applications.
Launch NovaMLX → Browse models → Pick one and download.
Select your model → Start chatting with local inference.
Once your model is running, connect it to your favorite tools — OpenClaw, Cherry Studio, Claude Code, Open Code, and more.
See details →NovaMLX isn't just open source — it's developed on demand. Need a new model architecture, a custom quantization strategy, or an integration with your workflow? Open an issue, and we ship it.
Describe the feature or integration you need in a GitHub issue.
We evaluate, prioritize, and implement — often within days.
New release, new capability. Update NovaMLX and it's there.