Pure CUDA + Rust. Zero python dependencies, zero complex recipes.

Inference at
unimaginable speeds

An LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a ~2.5 GB image that runs 3x faster than the status quo.

$ docker pull avarok/atlas-gb10:alpha-2.8
~2.5 GB image. Run command below.
130
tok/s peak (Qwen3.5-35B)
~2.5GB
total image size
<2min
cold start time
3.1x
faster than vLLM
Faster by Design

Clean architecture beats bloat

vLLM ships 20+ GB of Python, PyTorch, and 200+ dependencies. Atlas ships a single ~2.5 GB binary. That simplicity is the speed.

Atlas

Image size ~2.5 GB
Cold start <2 min
Runtime Rust + CUDA
Dependencies None

vLLM

Image size 20+ GB
Cold start ~10 min
Runtime Python + PyTorch
Dependencies 200+ packages

Pure Rust + CUDA

Compiled from HTTP to kernel dispatch. No interpreter, no GIL, no JIT warm-up.

🔧

Custom CUDA Kernels

Hand-tuned attention, MoE, GDN, and Mamba-2 kernels for Blackwell SM120/121. NVFP4 and FP8 with native tensor cores.

🔮

MTP Speculative Decoding

Multi-Token Prediction generates multiple tokens per forward pass. Up to 3x throughput over single-token decoding.

Qwen3.5-35B (NVFP4) on DGX Spark
Single GPU, batch=1. Atlas with MTP K=2.
Atlas
vLLM
Average (diverse workloads)
Atlas
111.4 tok/s
3.0x
vLLM
37.5 tok/s
Peak (short context)
Atlas
130 tok/s
3.3x
vLLM
~38 tok/s
Qwen3.5-122B (NVFP4) on a single DGX Spark
122B parameter model, single node. ~54 tok/s with EP=2.
Atlas
vLLM
Decode throughput
Atlas
~50 tok/s
3.3x
vLLM
~15 tok/s
Supported Models

Model matrix

Every model gets hand-tuned CUDA kernels. We expand based on what the community runs. All models ship with OpenAI-compatible tool calling.

ModelParametersQuantizationArchitectureThroughput
Qwen3.5-35B-A3B MTPFP835B (3B active)NVFP4 / FP8GDN + Attention + MoE~130 tok/s
Qwen3.5-122B-A10B MTPEP=2122B (10B active)NVFP4GDN + Attention + MoE~50-54 tok/s
Qwen3.5-27B 27B (dense)NVFP4GDN + Attention (Dense)~15 tok/s
Qwen3-Next-80B-A3B 80B (3B active)NVFP4SSM + Attention + MoE~82 tok/s
Qwen3-Coder-Next FP880B (3B active)FP8SSM + Attention + MoE~58 tok/s
Qwen3-VL-30B 30B (3B active)NVFP4Attention + MoE (Vision)~100 tok/s
Gemma 4 31B 31B (dense)NVFP4Dense Transformer~12 tok/s
Gemma 4 26B 26B (3.8B active)NVFP4MoE (128 experts, top-8)~35 tok/s
Nemotron-3 Super 120B FP8120B (12B active)NVFP4 / FP8Mamba-2 + MoE~24 tok/s
Nemotron-3 Nano 30B FP830B (3.5B active)NVFP4 / FP8Mamba-2 + MoE~100 tok/s
Mistral Small 4 119B 119B (6.5B active)NVFP4MLA + MoE~26 tok/s
All benchmarks on single DGX Spark (GB10) unless noted. EP=2 = Expert Parallelism across two nodes. MTP = Multi-Token Prediction speculative decoding.
Try It Yourself

Up and running in two commands

Don't take our word for it. Pull the image. Run it on your DGX Spark. See the difference.

Qwen3.5-35B, 130 tok/s on a single Spark
docker pull avarok/atlas-gb10:alpha-2.8

docker run -d --gpus all --ipc=host -p 8888:8888 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:alpha-2.8 serve \
  Sehyo/Qwen3.5-35B-A3B-NVFP4 \
  --speculative --scheduling-policy slai \
  --max-seq-len 131072 --max-batch-size 1 \
  --max-prefill-tokens 0

OpenAI compatible at http://localhost:8888/v1. Works with Claude Code, Cline, OpenCode, Open WebUI, and any OpenAI-compatible client.

Community

What people are saying

Real feedback from DGX Spark owners running Atlas. Check out our first post on r/LocalLLaMA that started it all.

103 tok/s sustained on the 35B, startup in 15 seconds. Night and day compared to vLLM's 10-minute torch.compile cycle. Then tried the 122B, 43.8 tok/s with MTP, a 41% speedup over our vLLM hybrid, same hardware, 2-minute startup.
ronald_15496, #general
Testing atlas-qwen3.5-35b for over an hour on a PNY DGX Spark in an agentic workflow. Super impressed. Spark is actually awesome with Atlas.
PersonWhoThinks, r/LocalLLaMA
I've grown tired of vLLM and have been hoping for something. I was really surprised and impressed. I'm so glad I bought Spark because I came across this.
tetsuro59, #general
115 tok/s on Spark is actually nuts. This speed is insane, amazing work.
ikkiho, Waste_Ad9929, r/LocalLLaMA
Contact

Get in touch

We optimize for your use case. Reach out with model requests, hardware setups, or partnership ideas.

Discord

Fastest way to reach us.

discord.gg/DwF3brBMpw

Email

Partnerships and enterprise.

Open Source

Free and open source. Coming soon.

Roadmap

Built for the community

We don't chase every architecture at once. We do each one properly, with kernels that hit the hardware ceiling rather than emulate around it.

🌐

Hardware Expansion

Optimized for DGX Spark today. ASUS Ascent GX10 compatibility confirmed by the community. Strix Halo port in exploration. RTX 6000 Pro Blackwell on the horizon. Same kernel philosophy, adapted per chip.

💡

Kernel Philosophy

Every model gets its own hand-tuned CUDA kernels. No generic fallbacks. We profile, optimize, and validate at the register level. If a model matters to you, it matters to us.

📢

Community-Driven

MiniMax 2.7 is next. Model support is driven entirely by what the community asks for. We're in Discord every day listening. Tell us what you're running and we'll optimize for your use case.

🛠

Open Source

Free and open source release coming soon. We want to make sure what we release is something people can actually build on, not just a dump.

🎨

Multimodal

Vision support live for Qwen3-VL. Audio and additional modalities on the roadmap. The goal is proper kernel-level support for each modality.

🎯

Agentic-Ready

OpenAI + Anthropic API compatibility on the same port. Tool calling, structured output, multi-turn. Works with Claude Code, Cline, OpenCode, and Open WebUI out of the box.