NovaMLX — Feature Reference

Production-grade, pure-Swift LLM/VLM inference server for Apple Silicon. OpenAI- and Anthropic-compatible APIs. Native macOS menu bar app. Built on MLX.

Inference Engine
Worker Subprocess Isolation
Multi-Modal Support
API Compatibility
Structured Output
Tool Calling
Control Token & Thinking Filtering
KV Cache & Prefix Sharing
TurboQuant KV Compression
Continuous Batching
Speculative Decoding
Session Management
Embeddings & Reranking
MCP — Model Context Protocol
Agent Integration
Model Management
Memory Management
Per-Model Settings
HuggingFace Integration
Cloud Model Proxy
Observability & Benchmarking
Web Interfaces
macOS Menu Bar App
Internationalization
Security & Middleware
Homebrew Distribution
Configuration

Inference Engine

NovaMLX runs LLM and VLM inference directly on Apple Silicon GPU via MLX, with zero external dependencies on Python or remote services.

Feature	Details
Backends	MLX (Apple Silicon GPU), lazy evaluation, unified memory
Model formats	SafeTensors (4-bit, 8-bit, FP16 quantization)
50+ architectures	Llama 3/3.1, Mistral/Mixtral, Qwen 2/2.5/3, Gemma 2/3, Phi 3.5/4, StarCoder2, and more
Sampling	Temperature, Top-P, Top-K, Min-P, Frequency/Presence/Repetition penalty, Seed
Streaming	SSE token-by-token streaming for all generation endpoints
Max tokens	Configurable per request or per model
Fused batch decode	Concurrent multi-sequence decode on shared KV cache
Speculative decoding	N-gram pattern-based and draft-model token prediction for faster generation

Worker Subprocess Isolation

The inference worker runs as an independent subprocess, communicating with the main API server via JSON messages over stdin/stdout.

Feature	Details
Crash isolation	Worker crash doesn't bring down the API server — auto-restart by supervisor
Memory isolation	Worker gets its own process memory budget, independent of main app
Communication	Bidirectional JSON message protocol over stdin/stdout
Health reporting	Periodic memory stats pushed to parent every 5 seconds
Code signing	Ad-hoc signed for macOS process integrity compliance
Supervisor	Monitors worker health, restarts on unexpected exit, handles backpressure

Full vision-language model (VLM) pipeline for image understanding.

Feature	Details
Image inputs	Base64 data URIs, HTTP URLs, local file paths
VLM architectures	Qwen2-VL, Qwen2.5-VL, Qwen3-VL, LLaVA, Gemma3, Phi-3-Vision, Pixtral, Molmo, Idefics3, InternVL, PaliGemma, DeepSeek-VL2, and more
Vision feature cache	In-memory LRU (20 entries) + optional SSD persistence, SHA-256 hashing, per-model isolation
API	Standard OpenAI `image_url` content parts in chat messages

Vision Usage Example

# Describe an image via base64 data URI
curl http://localhost:6590/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-4-e4b-it-4bit",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What do you see in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAA...=="}}
      ]
    }],
    "max_tokens": 100
  }'

# Describe an image via HTTP URL
curl http://localhost:6590/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-4-e4b-it-4bit",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image."},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

The model names used above are examples. Browse and download more models through the app's HuggingFace browser or the admin API at /admin/models/download.

API Compatibility

OpenAI-Compatible Endpoints (Port 6590)

Method	Endpoint	Description
`GET`	`/v1/models`	List available models
`POST`	`/v1/chat/completions`	Chat completions (streaming + non-streaming)
`POST`	`/v1/completions`	Legacy text completions (streaming + non-streaming)
`POST`	`/v1/embeddings`	Text embeddings
`POST`	`/v1/rerank`	Document reranking
`POST`	`/v1/responses`	OpenAI Responses API
`GET`	`/v1/responses/{id}`	Retrieve stored response
`DELETE`	`/v1/responses/{id}`	Delete stored response

Anthropic-Compatible Endpoints

Method	Endpoint	Description
`POST`	`/v1/messages`	Anthropic Messages API (streaming + non-streaming)
`POST`	`/v1/messages/count_tokens`	Token counting

Utility Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check (status, GPU memory, loaded models, MCP)
`GET`	`/v1/stats`	Session and all-time inference metrics
`POST`	`/v1/mcp/execute`	Execute MCP tool call

Structured Output

JSON Mode (`response_format: json_object`)

Character-level finite-state machine constraining output to valid JSON. Tracks 12 internal states for objects, arrays, strings, numbers, and literals.

Schema-Guided Mode (`response_format: json_schema`)

Full JSON Schema parsing into a type tree supporting:

object (with required field tracking)
array, string, integer, number, boolean, null
enum (string value restriction)
anyOf, oneOf, allOf composition
Schema state tracked alongside JSON state for field-level enforcement
Token-level logit masking based on allowed characters per state

Tool Calling

Automatic detection and parsing of tool call output across 7 format families:

Format	Pattern
XML	`<tool>{"name": "...", "arguments": {...}}</tool>` or `<function=name>...`
Bracket	`[TOOL_CALLS] [{"name": ...}]`
Marker	`<\|tool_call\|>...<\|/tool_call\|>`
Namespaced XML	`<ns:tool_call><invoke name="...">...`
GLM	`☐function_name☐key■value■` delimiter pairs
Gemma	`call:functionName {key: value}`
Thinking fallback	Extracts tool calls from within `<think...>` blocks

A stream filter suppresses tool markup during streaming so end users see clean content.

Control Token & Thinking Filtering

Real-time control token detection and filtering to keep generation output clean and prevent runaway loops.

TurnStopProcessor

Detects turn separator patterns in generated text and forces EOS token emission. Critical for models (e.g., Qwen3.6) that use semantic turn markers but rarely emit configured EOS tokens.

Feature	Details
Pattern matching	Configurable stop patterns (e.g., `<
State machine	Active → StopDetected → Done lifecycle
EOS forcing	Masks all non-EOS logits to `-inf` on pattern detection
Accumulation	Tracks full decoded sequence, not just individual tokens

Thinking Tag Architecture

Semantic thinking tags pass through to end users while protocol-level control tokens are stripped:

Behavior	Example
Pass through	`<think>Reasoning here</think>` — user sees thinking content
Filtered	`<
Implicit open	`ThinkingParser` handles responses that close `</think>` without opening tag
Partial buffering	Buffers incomplete control tokens in SSE streams to prevent leaked fragments

Control Token Filtering Pipeline

Generic streaming filter scrubs control tokens across all agent output formats
Multi-step agent conversation protection — no control token leakage between turns
Regex-only token matching option for custom token patterns

Block-level paged KV cache inspired by vLLM, enabling cross-session prefix reuse.

Feature	Details
Block size	64 tokens (configurable)
Hash algorithm	SHA-1 chain hashing (parent hash + tokens + model name)
Block pool	Doubly-linked free list, reference counting, copy-on-write forking
SSD persistence	SafeTensors format, 16-bucket sharded directories, async GCD write queue
SSD capacity	100 GB default with LRU eviction
Cache types	KVCacheSimple, RotatingKVCache, QuantizedKVCache, ChunkedKVCache, MambaCache, ArraysCache, CacheList
Stats	Hits, misses, tokens saved, evictions, shared blocks, SSD block count/size

TurboQuant KV Compression

Configurable per-model KV cache quantization for memory-constrained scenarios.

Bits	Compression Ratio	Use Case
2-bit	8.0×	Extreme memory pressure
3-bit	5.33×	High memory pressure
4-bit	4.0×	Balanced (recommended default)
6-bit	2.67×	Quality-sensitive
8-bit	2.0×	Minimal quality loss

Group size: 32, 64 (default), 128
Auto-recommendation: Based on model size / available memory ratio and context length
Admin endpoint: GET /admin/api/turboquant — view active configurations
Leverages: mlx-swift-lm's QuantizedKVCache with optimized Metal kernels

Continuous Batching

Priority-aware request batching for high-throughput concurrent inference.

Feature	Details
Max batch size	8 (configurable)
Priority levels	Low, Normal, High
Preemption	Lower-priority requests preempted for higher-priority
Metrics	Active requests, queue depth, total queued/completed/preempted, peak active, average wait time
Abortion	Per-request cancellation support

Speculative Decoding

Two complementary approaches accelerate generation by predicting and verifying multiple tokens in parallel.

N-gram Speculative Decoding

Pattern-based draft token generation built into FusedBatchScheduler:

Feature	Details
Method	N-gram pattern matching from recent context window
Draft length	Configurable number of candidate tokens per step
Verification	Target model verifies all draft tokens in parallel
Overhead	Zero secondary model — purely algorithmic

Draft-Model Speculative Decoding

Dedicated smaller draft model via SpeculativeTokenIterator:

Feature	Details
Draft model	Smaller, faster model generates candidate tokens
Target model	Full-size model verifies candidates in a single forward pass
Throughput gain	Significant speedup for memory-bandwidth-bound generation
Use case	Large models where each token generation is memory-limited

Session Management

Persistent conversational sessions with KV cache reuse across turns.

Feature	Details
Max sessions	64 concurrent
Session TTL	1800 seconds (auto-eviction)
Persistence	Save/restore KV cache to SafeTensors files
Forking	Deep-copy a session's KV cache into a new independent session
Admin API	List, delete, save, fork sessions

Embeddings & Reranking

Embedding Service

Models: BERT, XLM-RoBERTa, ModernBERT, SigLIP, ColQwen2.5, NomicBERT, Jina v3
Batching: Pad-and-stack with attention masking
Output: Normalized pooled embeddings
Endpoint: POST /v1/embeddings (OpenAI-compatible)

Reranker Service

Method: Cross-encoder pair scoring (query </s></s> document)
Normalization: Min-max score normalization across documents
Top-N filtering: Return only top N results
Endpoint: POST /v1/rerank (Cohere/Jina-compatible)

MCP — Model Context Protocol

Multi-server MCP client with tool discovery and execution.

Feature	Details
Transports	`stdio` (subprocess), `sse` (HTTP SSE), `streamable-http` (HTTP POST)
Protocol version	`2024-11-05`
Tool discovery	Automatic `tools/list` on connection
Namespacing	Tools namespaced as `{server}__{tool}` to avoid collisions
Timeouts	Per-server configurable
Admin API	Configure, enable/disable, list tools and server status

Agent Integration

Built-in support for external AI coding agents via auto-generated configuration. Agents connect through NovaMLX's OpenAI-compatible API as their LLM backend.

Agent	Description	Config Format
OpenClaw	Open-source AI agent framework with plugin system and tool use	JSON (`~/.openclaw/openclaw.json`)
Hermes Agent	Autonomous AI agent with multi-step reasoning and tool execution	YAML (`~/.hermes/config.yaml`)
OpenCode	Terminal-based AI programming assistant with code editing	Environment variable

Feature	Details
Auto-detection	Scans `$PATH` and common install locations for agent binaries
Config generation	One-click config with correct port, API key, and model settings
Copy to clipboard	Quick copy for pasting into agent config files
Install link	Direct link to each agent's GitHub repository
GUI	Dedicated Agents tab in the menu bar app

Model Management

Admin Endpoints (Port 6591, Bearer auth)

Method	Endpoint	Description
`GET`	`/admin/models`	List all models (downloaded, loaded status)
`POST`	`/admin/models/download`	Download from HuggingFace
`POST`	`/admin/models/load`	Load model into GPU memory
`POST`	`/admin/models/unload`	Evict from GPU memory
`DELETE`	`/admin/models/{id}`	Delete model files
`POST`	`/admin/models/discover`	Scan filesystem for new models
`GET`	`/admin/models/{id}/settings`	Get per-model settings
`PUT`	`/admin/models/{id}/settings`	Update per-model settings
`GET`	`/admin/api/memory`	Memory state and pressure monitoring

Model Eviction

LRU eviction of unpinned models when memory budget exceeded
TTL-based auto-unload (configurable per model)
OS memory pressure handler: Critical → evict all unpinned; Warning → evict to 1 unpinned
Periodic monitoring (60s): GPU memory >80% triggers eviction

Memory Management

System-level memory pressure monitoring with automatic model eviction via ProcessMemoryEnforcer.

ProcessMemoryEnforcer

Actor-based memory pressure manager with configurable policies:

Feature	Details
Polling	1-second memory pressure monitoring loop
Soft limit	Triggers warning log when exceeded
Hard limit	Forces immediate eviction of unpinned models
Auto mode	Reserves `max(4GB, min(8GB, physMem/5))` for system, rest available to NovaMLX
Percent mode	User-defined percentage of physical RAM (e.g., `80%`)
Fixed mode	Absolute byte limit (e.g., `24GB`, `4096MB`)
OS integration	Responds to macOS memory pressure notifications
GUI config	Mode picker in Settings with conditional inputs for percent/fixed values

Memory Budget Tracking

Per-sequence KV cache budget with admission control:

Feature	Details
Reservation	Each sequence reserves KV cache budget before admission
Release	Budget freed on sequence completion
Admission control	Rejects requests when available budget insufficient
TurboQuant boost	KV quantization effectively doubles admission capacity

Per-Model Settings

Every inference parameter can be overridden per model, persisted to model_settings.json:

Setting	Description
`max_context_window`	Override context length
`max_tokens`	Max generation tokens
`temperature`	Sampling temperature
`top_p` / `top_k` / `min_p`	Sampling strategy
`frequency_penalty` / `presence_penalty` / `repetition_penalty`	Repetition control
`seed`	Deterministic generation
`ttl_seconds`	Auto-unload after idle
`model_alias`	Short alias for model ID
`is_pinned`	Exempt from eviction
`is_default`	Default model for API requests
`kv_bits` / `kv_group_size`	TurboQuant configuration
`thinking_budget`	Max thinking/reasoning tokens
`display_name` / `description`	Human-readable metadata

HuggingFace Integration

Browse, search, and download models directly from HuggingFace.

Feature	Details
Search	Query HF Hub with optional MLX-only filter, sortable by trending/downloads
Model info	Architecture, tags, file listing, license
Downloads	Async file-by-file with progress tracking and cancellation
Dashboard UI	Search box, results table, download progress with cancel buttons
Auto-register	Downloaded models automatically discovered and registered

Cloud Model Proxy

Remote inference proxy that makes cloud-hosted models appear alongside local models.

Feature	Details
Protocol	OpenAI-compatible API passthrough
Streaming	Full SSE streaming passthrough with token-by-token relay
Model discovery	Cloud models listed in `/v1/models` alongside local models
Configuration	Remote endpoint URL and API key per cloud model
Use case	Access models too large for local GPU, or add remote fallback capacity

Observability & Benchmarking

Performance Benchmark

Prompt lengths: Configurable (default: 1024, 4096)
Metrics: TTFT (ms), generation tok/s, prefill tok/s, peak memory (GB), end-to-end latency (s)
Endpoints: POST /admin/api/bench/start, GET /admin/api/bench/status, POST /admin/api/bench/cancel

Perplexity Evaluation

Metrics: Per-text perplexity, token count, average perplexity
Endpoints: POST /admin/api/ppl/start, GET /admin/api/ppl/status, POST /admin/api/ppl/cancel

Inference Metrics

Session + persistent all-time counters
Total requests, tokens generated, average tok/s
Per-model request counts, cache hit rate
GPU memory tracking

System Monitoring

Device info: chip, variant, memory, GPU/CPU cores, OS version
Update checker: GitHub Releases API
Stats clearing: Session and all-time

Web Interfaces

Chat UI (`/chat`)

Full-featured web chat application — no admin auth required.

Dark theme with purple accent
SSE streaming responses with stop button
Model selector
System prompt configuration modal
Conversation history (localStorage, max 50)
Markdown rendering with code syntax highlighting and copy buttons
Image upload (drag-drop, paste, file picker) for VLM
Thinking/reasoning bubble display
Image lightbox modal
API key input with persistent storage
Mobile-responsive sidebar

Admin Dashboard (`/admin/dashboard`)

Operational dashboard — requires admin auth.

Status cards (health, GPU memory)
Device info table
MCP servers status
Benchmark runner
Perplexity runner
HuggingFace model browser (search, download, progress)
Download task management with cancel
Stats management (clear session / all-time)
Auto-refresh every 5 seconds
Responsive layout

Native SwiftUI menu bar extra with brain icon.

Feature	Details
Status tab	Running indicator, server address, loaded models, GPU memory, active requests, uptime, tok/s
Models tab	Loaded model list with indicators, downloaded count, disk usage
Agents tab	External agent detection (OpenClaw, Hermes, OpenCode), config generation, install links
Settings tab	View/Edit config toggle, memory limit configuration, model settings, server management
Polling	2-second stats refresh
Window	280×200pt floating panel

Internationalization

Full i18n system supporting 9 languages across all GUI views and web interfaces.

Language	Code
English	`en`
简体中文	`zh-Hans`
繁體中文 (香港)	`zh-Hant-HK`
繁體中文 (台灣)	`zh-Hant-TW`
日本語	`ja`
한국어	`ko`
Français	`fr`
Deutsch	`de`
Русский	`ru`

Feature	Details
Pattern	`L10n.tr("key.path")` with compile-time key validation
Coverage	Settings, Agents, Chat, Menu Bar, Dashboard — all user-facing strings
Fallback	English fallback for missing translations
Web UI	Chat and Dashboard also fully internationalized

Security & Middleware

Layer	Implementation
API auth	Bearer token on port 6590 (`APIKeyAuthMiddleware`)
Admin auth	Bearer token or `X-Admin-Key` header on port 6591 (`AdminAuthMiddleware`)
CORS	`Access-Control-Allow-Origin: *` with full method/header support
Request ID	`x-request-id` header pass-through or auto-generation
Error handling	OpenAI-compatible JSON error bodies with proper HTTP status codes
Admin isolation	Admin API disabled entirely when no API keys configured
Worker signing	Ad-hoc code signing for worker subprocess to satisfy macOS process integrity

Homebrew Distribution

Install NovaMLX via Homebrew with automatic updates.

brew tap cnshsliu/nova
brew install --cask novamlx

Feature	Details
Tap	`cnshsliu/nova` Homebrew tap
Cask	`novamlx` — installs `.app` bundle to `/Applications`
Updates	`brew upgrade --cask novamlx` fetches latest GitHub Release
Version check	Built-in update checker queries GitHub Releases API

Configuration

Server Configuration (`ServerConfig`)

Parameter	Default	Description
`host`	`127.0.0.1`	Bind address
`port`	`6590`	Inference API port
`adminPort`	`6591`	Admin API port
`apiKeys`	`[]`	API keys (empty = auth disabled)
`maxConcurrentRequests`	`16`	Max concurrent requests
`requestTimeout`	`300s`	Request timeout
`contextScalingTarget`	`nil`	Scale reported token counts for billing compatibility
`maxProcessMemory`	`auto`	Process memory limit: `auto`, percent (`80%`), or absolute (`24GB`, `4096MB`)
`maxRequestSizeMB`	`100`	Maximum request body size in megabytes

Engine Configuration

Parameter	Default	Description
`kvCacheCapacity`	`1024`	KV cache block capacity
`maxMemoryMB`	`2048`	Max GPU memory for models
`maxConcurrent`	`8`	Max concurrent inference tasks

Prefix Cache Configuration

Parameter	Default	Description
`blockSize`	`64`	Tokens per cache block
`maxBlocks`	`4096`	Maximum cache blocks
`ssdMaxSizeBytes`	`100 GB`	Max SSD cache size

Architecture Overview

┌──────────────────────────────────────────────────────────┐
│                    macOS Menu Bar App                     │
│     Status • Models • Agents • Settings • Dashboard      │
├──────────────────────────────────────────────────────────┤
│                   NovaMLX API Server                      │
│  ┌─────────────────┐  ┌──────────────────────────────┐  │
│  │  Inference API   │  │        Admin API              │  │
│  │  (port 6590)     │  │        (port 6591)            │  │
│  │  • OpenAI        │  │  • Model Management           │  │
│  │  • Anthropic     │  │  • Per-Model Settings         │  │
│  │  • Embeddings    │  │  • Benchmarking               │  │
│  │  • Reranking     │  │  • HuggingFace Browser        │  │
│  │  • MCP Tools     │  │  • Memory Monitoring          │  │
│  │  • Cloud Proxy   │  │  • Dashboard UI               │  │
│  └────────┬─────────┘  └──────────────┬───────────────┘  │
│           └──────────┬────────────────┘                   │
│                      ▼                                    │
│  ┌──────────────────────────────────────────────────┐    │
│  │            Inference Service                      │    │
│  │  • Continuous Batcher  • Session Manager          │    │
│  │  • Speculative Decode  • Engine Pool              │    │
│  │  • Model Settings      • MCP Client Manager       │    │
│  └────────────────────────┬─────────────────────────┘    │
│                           ▼                               │
│  ┌──────────────────────┬──────────────────────────┐     │
│  │   MLX Engine Process │  Worker Subprocess        │     │
│  │  • LLM/VLM Generation│  • stdin/stdout JSON IPC  │     │
│  │  • Structured Output  │  • ProcessMemoryEnforcer  │     │
│  │  • TurboQuant         │  • WorkerSupervisor       │     │
│  │  • Prefix/SSD Cache   │  • GPU-Isolated Memory    │     │
│  │  • Vision Feature     │                           │     │
│  │  • TurnStopProcessor  │                           │     │
│  │  • ControlTokenFilter │                           │     │
│  └────────────────────┬─┴──────────────────────────┘     │
│                       ▼                                   │
│  ┌──────────────────────────────────────────────────┐    │
│  │          MLX / Apple Silicon GPU                  │    │
│  │  • Lazy Evaluation  • Unified Memory  • Metal    │    │
│  └──────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────┘

Quick Start

# Build
swift build -c release

# Run with API key
./build/release/NovaMLX --api-key sk-your-key

# Chat completion
curl http://localhost:6590/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Full Feature Reference