An LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a ~2.5 GB image that runs 3x faster than the status quo.
docker pull avarok/atlas-gb10:alpha-2.8 vLLM ships 20+ GB of Python, PyTorch, and 200+ dependencies. Atlas ships a single ~2.5 GB binary. That simplicity is the speed.
Compiled from HTTP to kernel dispatch. No interpreter, no GIL, no JIT warm-up.
Hand-tuned attention, MoE, GDN, and Mamba-2 kernels for Blackwell SM120/121. NVFP4 and FP8 with native tensor cores.
Multi-Token Prediction generates multiple tokens per forward pass. Up to 3x throughput over single-token decoding.
Every model gets hand-tuned CUDA kernels. We expand based on what the community runs. All models ship with OpenAI-compatible tool calling.
| Model | Parameters | Quantization | Architecture | Throughput |
|---|---|---|---|---|
| Qwen3.5-35B-A3B MTPFP8 | 35B (3B active) | NVFP4 / FP8 | GDN + Attention + MoE | ~130 tok/s |
| Qwen3.5-122B-A10B MTPEP=2 | 122B (10B active) | NVFP4 | GDN + Attention + MoE | ~50-54 tok/s |
| Qwen3.5-27B | 27B (dense) | NVFP4 | GDN + Attention (Dense) | ~15 tok/s |
| Qwen3-Next-80B-A3B | 80B (3B active) | NVFP4 | SSM + Attention + MoE | ~82 tok/s |
| Qwen3-Coder-Next FP8 | 80B (3B active) | FP8 | SSM + Attention + MoE | ~58 tok/s |
| Qwen3-VL-30B | 30B (3B active) | NVFP4 | Attention + MoE (Vision) | ~100 tok/s |
| Gemma 4 31B | 31B (dense) | NVFP4 | Dense Transformer | ~12 tok/s |
| Gemma 4 26B | 26B (3.8B active) | NVFP4 | MoE (128 experts, top-8) | ~35 tok/s |
| Nemotron-3 Super 120B FP8 | 120B (12B active) | NVFP4 / FP8 | Mamba-2 + MoE | ~24 tok/s |
| Nemotron-3 Nano 30B FP8 | 30B (3.5B active) | NVFP4 / FP8 | Mamba-2 + MoE | ~100 tok/s |
| Mistral Small 4 119B | 119B (6.5B active) | NVFP4 | MLA + MoE | ~26 tok/s |
Don't take our word for it. Pull the image. Run it on your DGX Spark. See the difference.
docker pull avarok/atlas-gb10:alpha-2.8 docker run -d --gpus all --ipc=host -p 8888:8888 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ avarok/atlas-gb10:alpha-2.8 serve \ Sehyo/Qwen3.5-35B-A3B-NVFP4 \ --speculative --scheduling-policy slai \ --max-seq-len 131072 --max-batch-size 1 \ --max-prefill-tokens 0
OpenAI compatible at http://localhost:8888/v1. Works with Claude Code, Cline, OpenCode, Open WebUI, and any OpenAI-compatible client.
Real feedback from DGX Spark owners running Atlas. Check out our first post on r/LocalLLaMA that started it all.
103 tok/s sustained on the 35B, startup in 15 seconds. Night and day compared to vLLM's 10-minute torch.compile cycle. Then tried the 122B, 43.8 tok/s with MTP, a 41% speedup over our vLLM hybrid, same hardware, 2-minute startup.
Testing atlas-qwen3.5-35b for over an hour on a PNY DGX Spark in an agentic workflow. Super impressed. Spark is actually awesome with Atlas.
I've grown tired of vLLM and have been hoping for something. I was really surprised and impressed. I'm so glad I bought Spark because I came across this.
115 tok/s on Spark is actually nuts. This speed is insane, amazing work.
We optimize for your use case. Reach out with model requests, hardware setups, or partnership ideas.
Free and open source. Coming soon.
We don't chase every architecture at once. We do each one properly, with kernels that hit the hardware ceiling rather than emulate around it.
Optimized for DGX Spark today. ASUS Ascent GX10 compatibility confirmed by the community. Strix Halo port in exploration. RTX 6000 Pro Blackwell on the horizon. Same kernel philosophy, adapted per chip.
Every model gets its own hand-tuned CUDA kernels. No generic fallbacks. We profile, optimize, and validate at the register level. If a model matters to you, it matters to us.
MiniMax 2.7 is next. Model support is driven entirely by what the community asks for. We're in Discord every day listening. Tell us what you're running and we'll optimize for your use case.
Free and open source release coming soon. We want to make sure what we release is something people can actually build on, not just a dump.
Vision support live for Qwen3-VL. Audio and additional modalities on the roadmap. The goal is proper kernel-level support for each modality.
OpenAI + Anthropic API compatibility on the same port. Tool calling, structured output, multi-turn. Works with Claude Code, Cline, OpenCode, and Open WebUI out of the box.