Production-grade, pure-Swift LLM/VLM inference server for Apple Silicon. OpenAI- and Anthropic-compatible APIs. Native macOS menu bar app. Built on MLX.
NovaMLX runs LLM and VLM inference directly on Apple Silicon GPU via MLX, with zero external dependencies on Python or remote services.
| Feature | Details |
|---|---|
| Backends | MLX (Apple Silicon GPU), lazy evaluation, unified memory |
| Model formats | SafeTensors (4-bit, 8-bit, FP16 quantization) |
| 50+ architectures | Llama 3/3.1, Mistral/Mixtral, Qwen 2/2.5/3, Gemma 2/3, Phi 3.5/4, StarCoder2, and more |
| Sampling | Temperature, Top-P, Top-K, Min-P, Frequency/Presence/Repetition penalty, Seed |
| Streaming | SSE token-by-token streaming for all generation endpoints |
| Max tokens | Configurable per request or per model |
| Fused batch decode | Concurrent multi-sequence decode on shared KV cache |
| Speculative decoding | N-gram pattern-based and draft-model token prediction for faster generation |
The inference worker runs as an independent subprocess, communicating with the main API server via JSON messages over stdin/stdout.
| Feature | Details |
|---|---|
| Crash isolation | Worker crash doesn't bring down the API server — auto-restart by supervisor |
| Memory isolation | Worker gets its own process memory budget, independent of main app |
| Communication | Bidirectional JSON message protocol over stdin/stdout |
| Health reporting | Periodic memory stats pushed to parent every 5 seconds |
| Code signing | Ad-hoc signed for macOS process integrity compliance |
| Supervisor | Monitors worker health, restarts on unexpected exit, handles backpressure |
Full vision-language model (VLM) pipeline for image understanding.
| Feature | Details |
|---|---|
| Image inputs | Base64 data URIs, HTTP URLs, local file paths |
| VLM architectures | Qwen2-VL, Qwen2.5-VL, Qwen3-VL, LLaVA, Gemma3, Phi-3-Vision, Pixtral, Molmo, Idefics3, InternVL, PaliGemma, DeepSeek-VL2, and more |
| Vision feature cache | In-memory LRU (20 entries) + optional SSD persistence, SHA-256 hashing, per-model isolation |
| API | Standard OpenAI image_url content parts in chat messages |
# Describe an image via base64 data URI
curl http://localhost:6590/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-4-e4b-it-4bit",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAA...=="}}
]
}],
"max_tokens": 100
}'
# Describe an image via HTTP URL
curl http://localhost:6590/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-4-e4b-it-4bit",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
]
}]
}'
The model names used above are examples. Browse and download more models through the app's HuggingFace browser or the admin API at
/admin/models/download.
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/models |
List available models |
POST |
/v1/chat/completions |
Chat completions (streaming + non-streaming) |
POST |
/v1/completions |
Legacy text completions (streaming + non-streaming) |
POST |
/v1/embeddings |
Text embeddings |
POST |
/v1/rerank |
Document reranking |
POST |
/v1/responses |
OpenAI Responses API |
GET |
/v1/responses/{id} |
Retrieve stored response |
DELETE |
/v1/responses/{id} |
Delete stored response |
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/messages |
Anthropic Messages API (streaming + non-streaming) |
POST |
/v1/messages/count_tokens |
Token counting |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check (status, GPU memory, loaded models, MCP) |
GET |
/v1/stats |
Session and all-time inference metrics |
POST |
/v1/mcp/execute |
Execute MCP tool call |
Character-level finite-state machine constraining output to valid JSON. Tracks 12 internal states for objects, arrays, strings, numbers, and literals.
Full JSON Schema parsing into a type tree supporting:
object (with required field tracking)array, string, integer, number, boolean, nullenum (string value restriction)anyOf, oneOf, allOf compositionAutomatic detection and parsing of tool call output across 7 format families:
| Format | Pattern |
|---|---|
| XML | <tool>{"name": "...", "arguments": {...}}</tool> or <function=name>... |
| Bracket | [TOOL_CALLS] [{"name": ...}] |
| Marker | <|tool_call|>...<|/tool_call|> |
| Namespaced XML | <ns:tool_call><invoke name="...">... |
| GLM | ☐function_name☐key■value■ delimiter pairs |
| Gemma | call:functionName {key: value} |
| Thinking fallback | Extracts tool calls from within <think...> blocks |
A stream filter suppresses tool markup during streaming so end users see clean content.
Real-time control token detection and filtering to keep generation output clean and prevent runaway loops.
Detects turn separator patterns in generated text and forces EOS token emission. Critical for models (e.g., Qwen3.6) that use semantic turn markers but rarely emit configured EOS tokens.
| Feature | Details |
|---|---|
| Pattern matching | Configurable stop patterns (e.g., `< |
| State machine | Active → StopDetected → Done lifecycle |
| EOS forcing | Masks all non-EOS logits to -inf on pattern detection |
| Accumulation | Tracks full decoded sequence, not just individual tokens |
Semantic thinking tags pass through to end users while protocol-level control tokens are stripped:
| Behavior | Example |
|---|---|
| Pass through | <think>Reasoning here</think> — user sees thinking content |
| Filtered | `< |
| Implicit open | ThinkingParser handles responses that close </think> without opening tag |
| Partial buffering | Buffers incomplete control tokens in SSE streams to prevent leaked fragments |
Block-level paged KV cache inspired by vLLM, enabling cross-session prefix reuse.
| Feature | Details |
|---|---|
| Block size | 64 tokens (configurable) |
| Hash algorithm | SHA-1 chain hashing (parent hash + tokens + model name) |
| Block pool | Doubly-linked free list, reference counting, copy-on-write forking |
| SSD persistence | SafeTensors format, 16-bucket sharded directories, async GCD write queue |
| SSD capacity | 100 GB default with LRU eviction |
| Cache types | KVCacheSimple, RotatingKVCache, QuantizedKVCache, ChunkedKVCache, MambaCache, ArraysCache, CacheList |
| Stats | Hits, misses, tokens saved, evictions, shared blocks, SSD block count/size |
Configurable per-model KV cache quantization for memory-constrained scenarios.
| Bits | Compression Ratio | Use Case |
|---|---|---|
| 2-bit | 8.0× | Extreme memory pressure |
| 3-bit | 5.33× | High memory pressure |
| 4-bit | 4.0× | Balanced (recommended default) |
| 6-bit | 2.67× | Quality-sensitive |
| 8-bit | 2.0× | Minimal quality loss |
GET /admin/api/turboquant — view active configurationsQuantizedKVCache with optimized Metal kernelsPriority-aware request batching for high-throughput concurrent inference.
| Feature | Details |
|---|---|
| Max batch size | 8 (configurable) |
| Priority levels | Low, Normal, High |
| Preemption | Lower-priority requests preempted for higher-priority |
| Metrics | Active requests, queue depth, total queued/completed/preempted, peak active, average wait time |
| Abortion | Per-request cancellation support |
Two complementary approaches accelerate generation by predicting and verifying multiple tokens in parallel.
Pattern-based draft token generation built into FusedBatchScheduler:
| Feature | Details |
|---|---|
| Method | N-gram pattern matching from recent context window |
| Draft length | Configurable number of candidate tokens per step |
| Verification | Target model verifies all draft tokens in parallel |
| Overhead | Zero secondary model — purely algorithmic |
Dedicated smaller draft model via SpeculativeTokenIterator:
| Feature | Details |
|---|---|
| Draft model | Smaller, faster model generates candidate tokens |
| Target model | Full-size model verifies candidates in a single forward pass |
| Throughput gain | Significant speedup for memory-bandwidth-bound generation |
| Use case | Large models where each token generation is memory-limited |
Persistent conversational sessions with KV cache reuse across turns.
| Feature | Details |
|---|---|
| Max sessions | 64 concurrent |
| Session TTL | 1800 seconds (auto-eviction) |
| Persistence | Save/restore KV cache to SafeTensors files |
| Forking | Deep-copy a session's KV cache into a new independent session |
| Admin API | List, delete, save, fork sessions |
POST /v1/embeddings (OpenAI-compatible)</s></s> document)POST /v1/rerank (Cohere/Jina-compatible)Multi-server MCP client with tool discovery and execution.
| Feature | Details |
|---|---|
| Transports | stdio (subprocess), sse (HTTP SSE), streamable-http (HTTP POST) |
| Protocol version | 2024-11-05 |
| Tool discovery | Automatic tools/list on connection |
| Namespacing | Tools namespaced as {server}__{tool} to avoid collisions |
| Timeouts | Per-server configurable |
| Admin API | Configure, enable/disable, list tools and server status |
Built-in support for external AI coding agents via auto-generated configuration. Agents connect through NovaMLX's OpenAI-compatible API as their LLM backend.
| Agent | Description | Config Format |
|---|---|---|
| OpenClaw | Open-source AI agent framework with plugin system and tool use | JSON (~/.openclaw/openclaw.json) |
| Hermes Agent | Autonomous AI agent with multi-step reasoning and tool execution | YAML (~/.hermes/config.yaml) |
| OpenCode | Terminal-based AI programming assistant with code editing | Environment variable |
| Feature | Details |
|---|---|
| Auto-detection | Scans $PATH and common install locations for agent binaries |
| Config generation | One-click config with correct port, API key, and model settings |
| Copy to clipboard | Quick copy for pasting into agent config files |
| Install link | Direct link to each agent's GitHub repository |
| GUI | Dedicated Agents tab in the menu bar app |
| Method | Endpoint | Description |
|---|---|---|
GET |
/admin/models |
List all models (downloaded, loaded status) |
POST |
/admin/models/download |
Download from HuggingFace |
POST |
/admin/models/load |
Load model into GPU memory |
POST |
/admin/models/unload |
Evict from GPU memory |
DELETE |
/admin/models/{id} |
Delete model files |
POST |
/admin/models/discover |
Scan filesystem for new models |
GET |
/admin/models/{id}/settings |
Get per-model settings |
PUT |
/admin/models/{id}/settings |
Update per-model settings |
GET |
/admin/api/memory |
Memory state and pressure monitoring |
System-level memory pressure monitoring with automatic model eviction via ProcessMemoryEnforcer.
Actor-based memory pressure manager with configurable policies:
| Feature | Details |
|---|---|
| Polling | 1-second memory pressure monitoring loop |
| Soft limit | Triggers warning log when exceeded |
| Hard limit | Forces immediate eviction of unpinned models |
| Auto mode | Reserves max(4GB, min(8GB, physMem/5)) for system, rest available to NovaMLX |
| Percent mode | User-defined percentage of physical RAM (e.g., 80%) |
| Fixed mode | Absolute byte limit (e.g., 24GB, 4096MB) |
| OS integration | Responds to macOS memory pressure notifications |
| GUI config | Mode picker in Settings with conditional inputs for percent/fixed values |
Per-sequence KV cache budget with admission control:
| Feature | Details |
|---|---|
| Reservation | Each sequence reserves KV cache budget before admission |
| Release | Budget freed on sequence completion |
| Admission control | Rejects requests when available budget insufficient |
| TurboQuant boost | KV quantization effectively doubles admission capacity |
Every inference parameter can be overridden per model, persisted to model_settings.json:
| Setting | Description |
|---|---|
max_context_window |
Override context length |
max_tokens |
Max generation tokens |
temperature |
Sampling temperature |
top_p / top_k / min_p |
Sampling strategy |
frequency_penalty / presence_penalty / repetition_penalty |
Repetition control |
seed |
Deterministic generation |
ttl_seconds |
Auto-unload after idle |
model_alias |
Short alias for model ID |
is_pinned |
Exempt from eviction |
is_default |
Default model for API requests |
kv_bits / kv_group_size |
TurboQuant configuration |
thinking_budget |
Max thinking/reasoning tokens |
display_name / description |
Human-readable metadata |
Browse, search, and download models directly from HuggingFace.
| Feature | Details |
|---|---|
| Search | Query HF Hub with optional MLX-only filter, sortable by trending/downloads |
| Model info | Architecture, tags, file listing, license |
| Downloads | Async file-by-file with progress tracking and cancellation |
| Dashboard UI | Search box, results table, download progress with cancel buttons |
| Auto-register | Downloaded models automatically discovered and registered |
Remote inference proxy that makes cloud-hosted models appear alongside local models.
| Feature | Details |
|---|---|
| Protocol | OpenAI-compatible API passthrough |
| Streaming | Full SSE streaming passthrough with token-by-token relay |
| Model discovery | Cloud models listed in /v1/models alongside local models |
| Configuration | Remote endpoint URL and API key per cloud model |
| Use case | Access models too large for local GPU, or add remote fallback capacity |
POST /admin/api/bench/start, GET /admin/api/bench/status, POST /admin/api/bench/cancelPOST /admin/api/ppl/start, GET /admin/api/ppl/status, POST /admin/api/ppl/cancelFull-featured web chat application — no admin auth required.
Operational dashboard — requires admin auth.
Native SwiftUI menu bar extra with brain icon.
| Feature | Details |
|---|---|
| Status tab | Running indicator, server address, loaded models, GPU memory, active requests, uptime, tok/s |
| Models tab | Loaded model list with indicators, downloaded count, disk usage |
| Agents tab | External agent detection (OpenClaw, Hermes, OpenCode), config generation, install links |
| Settings tab | View/Edit config toggle, memory limit configuration, model settings, server management |
| Polling | 2-second stats refresh |
| Window | 280×200pt floating panel |
Full i18n system supporting 9 languages across all GUI views and web interfaces.
| Language | Code |
|---|---|
| English | en |
| 简体中文 | zh-Hans |
| 繁體中文 (香港) | zh-Hant-HK |
| 繁體中文 (台灣) | zh-Hant-TW |
| 日本語 | ja |
| 한국어 | ko |
| Français | fr |
| Deutsch | de |
| Русский | ru |
| Feature | Details |
|---|---|
| Pattern | L10n.tr("key.path") with compile-time key validation |
| Coverage | Settings, Agents, Chat, Menu Bar, Dashboard — all user-facing strings |
| Fallback | English fallback for missing translations |
| Web UI | Chat and Dashboard also fully internationalized |
| Layer | Implementation |
|---|---|
| API auth | Bearer token on port 6590 (APIKeyAuthMiddleware) |
| Admin auth | Bearer token or X-Admin-Key header on port 6591 (AdminAuthMiddleware) |
| CORS | Access-Control-Allow-Origin: * with full method/header support |
| Request ID | x-request-id header pass-through or auto-generation |
| Error handling | OpenAI-compatible JSON error bodies with proper HTTP status codes |
| Admin isolation | Admin API disabled entirely when no API keys configured |
| Worker signing | Ad-hoc code signing for worker subprocess to satisfy macOS process integrity |
Install NovaMLX via Homebrew with automatic updates.
brew tap cnshsliu/nova
brew install --cask novamlx
| Feature | Details |
|---|---|
| Tap | cnshsliu/nova Homebrew tap |
| Cask | novamlx — installs .app bundle to /Applications |
| Updates | brew upgrade --cask novamlx fetches latest GitHub Release |
| Version check | Built-in update checker queries GitHub Releases API |
| Parameter | Default | Description |
|---|---|---|
host |
127.0.0.1 |
Bind address |
port |
6590 |
Inference API port |
adminPort |
6591 |
Admin API port |
apiKeys |
[] |
API keys (empty = auth disabled) |
maxConcurrentRequests |
16 |
Max concurrent requests |
requestTimeout |
300s |
Request timeout |
contextScalingTarget |
nil |
Scale reported token counts for billing compatibility |
maxProcessMemory |
auto |
Process memory limit: auto, percent (80%), or absolute (24GB, 4096MB) |
maxRequestSizeMB |
100 |
Maximum request body size in megabytes |
| Parameter | Default | Description |
|---|---|---|
kvCacheCapacity |
1024 |
KV cache block capacity |
maxMemoryMB |
2048 |
Max GPU memory for models |
maxConcurrent |
8 |
Max concurrent inference tasks |
| Parameter | Default | Description |
|---|---|---|
blockSize |
64 |
Tokens per cache block |
maxBlocks |
4096 |
Maximum cache blocks |
ssdMaxSizeBytes |
100 GB |
Max SSD cache size |
┌──────────────────────────────────────────────────────────┐
│ macOS Menu Bar App │
│ Status • Models • Agents • Settings • Dashboard │
├──────────────────────────────────────────────────────────┤
│ NovaMLX API Server │
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
│ │ Inference API │ │ Admin API │ │
│ │ (port 6590) │ │ (port 6591) │ │
│ │ • OpenAI │ │ • Model Management │ │
│ │ • Anthropic │ │ • Per-Model Settings │ │
│ │ • Embeddings │ │ • Benchmarking │ │
│ │ • Reranking │ │ • HuggingFace Browser │ │
│ │ • MCP Tools │ │ • Memory Monitoring │ │
│ │ • Cloud Proxy │ │ • Dashboard UI │ │
│ └────────┬─────────┘ └──────────────┬───────────────┘ │
│ └──────────┬────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Inference Service │ │
│ │ • Continuous Batcher • Session Manager │ │
│ │ • Speculative Decode • Engine Pool │ │
│ │ • Model Settings • MCP Client Manager │ │
│ └────────────────────────┬─────────────────────────┘ │
│ ▼ │
│ ┌──────────────────────┬──────────────────────────┐ │
│ │ MLX Engine Process │ Worker Subprocess │ │
│ │ • LLM/VLM Generation│ • stdin/stdout JSON IPC │ │
│ │ • Structured Output │ • ProcessMemoryEnforcer │ │
│ │ • TurboQuant │ • WorkerSupervisor │ │
│ │ • Prefix/SSD Cache │ • GPU-Isolated Memory │ │
│ │ • Vision Feature │ │ │
│ │ • TurnStopProcessor │ │ │
│ │ • ControlTokenFilter │ │ │
│ └────────────────────┬─┴──────────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ MLX / Apple Silicon GPU │ │
│ │ • Lazy Evaluation • Unified Memory • Metal │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
# Build
swift build -c release
# Run with API key
./build/release/NovaMLX --api-key sk-your-key
# Chat completion
curl http://localhost:6590/v1/chat/completions \
-H "Authorization: Bearer sk-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "your-model",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'