Pure CUDA + Rust. Zero python dependencies, zero complex recipes.

Inference at
unimaginable speeds

An LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a ~2.5 GB image that runs 3x faster than the status quo.

$ uvx sparkrun setup install
sparkrun pulls & runs the ~2.5 GB Atlas image for you. Run command below.
130
tok/s peak (Qwen3.5-35B)
~2.5GB
total image size
<2min
cold start time
3.1x
faster than vLLM
Faster by Design

Clean architecture beats bloat

vLLM ships 20+ GB of Python, PyTorch, and 200+ dependencies. Atlas ships a single ~2.5 GB binary. That simplicity is the speed.

Atlas

Image size ~2.5 GB
Cold start <2 min
Runtime Rust + CUDA
Dependencies None

vLLM

Image size 20+ GB
Cold start ~10 min
Runtime Python + PyTorch
Dependencies 200+ packages
โšก

Pure Rust + CUDA

Compiled from HTTP to kernel dispatch. No interpreter, no GIL, no JIT warm-up.

๐Ÿ”ง

Custom CUDA Kernels

Hand-tuned attention, MoE, GDN, and Mamba-2 kernels for Blackwell SM120/121. NVFP4 and FP8 with native tensor cores.

๐Ÿ”ฎ

MTP Speculative Decoding

Multi-Token Prediction generates multiple tokens per forward pass. Up to 3x throughput over single-token decoding.

Qwen3.5-35B (NVFP4) on DGX Spark
Single GPU, batch=1. Atlas with MTP K=2.
Atlas
vLLM
Average (diverse workloads)
Atlas
111.4 tok/s
3.0x
vLLM
37.5 tok/s
Peak (short context)
Atlas
130 tok/s
3.3x
vLLM
~38 tok/s
Qwen3.5-122B (NVFP4) on a single DGX Spark
122B parameter model, single node. ~54 tok/s with EP=2.
Atlas
vLLM
Decode throughput
Atlas
~50 tok/s
3.3x
vLLM
~15 tok/s
Supported Models

Model matrix

Every model gets hand-tuned CUDA kernels. Pick a vendor, then a model family; every recipe maps to a single sparkrun recipe you can copy and run as-is.

All recipes are the single source of truth in atlas-recipes. Run any of them with sparkrun. EP=2 = Expert Parallelism across two GB10 nodes.
Try It Yourself

Up and running in one command

Don't take our word for it. One command on your DGX Spark and you're serving. The quickstart script installs sparkrun only if it isn't already present, then runs the recipe. sparkrun pulls & runs the Atlas image for you using your existing Docker/Podman + NVIDIA container runtime.

Run with sparkrun โ€” Qwen3.6-35B-A3B FP8 + MTP on a single Spark
$ curl -fsSL https://atlasinference.io/quickstart.sh | sh

OpenAI compatible at http://localhost:8888/v1. Works with Claude Code, Cline, OpenCode, Open WebUI, and any OpenAI-compatible client.

Community

What people are saying

Real feedback from DGX Spark owners running Atlas. Check out our first post on r/LocalLLaMA that started it all.

103 tok/s sustained on the 35B, startup in 15 seconds. Night and day compared to vLLM's 10-minute torch.compile cycle. Then tried the 122B, 43.8 tok/s with MTP, a 41% speedup over our vLLM hybrid, same hardware, 2-minute startup.
ronald_15496, #general
Testing atlas-qwen3.5-35b for over an hour on a PNY DGX Spark in an agentic workflow. Super impressed. Spark is actually awesome with Atlas.
PersonWhoThinks, r/LocalLLaMA
I've grown tired of vLLM and have been hoping for something. I was really surprised and impressed. I'm so glad I bought Spark because I came across this.
tetsuro59, #general
115 tok/s on Spark is actually nuts. This speed is insane, amazing work.
ikkiho, Waste_Ad9929, r/LocalLLaMA
Contact

Get in touch

We optimize for your use case. Reach out with model requests, hardware setups, or partnership ideas.

Discord

Fastest way to reach us.

discord.gg/DwF3brBMpw

Email

Partnerships and enterprise.

Open Source

AGPL-3.0. Fork it, ship it.

github.com/Avarok-Cybersecurity/atlas
Roadmap

Built for the community

We don't chase every architecture at once. We do each one properly, with kernels that hit the hardware ceiling rather than emulate around it.

๐ŸŒ

Hardware Expansion

Optimized for DGX Spark today. ASUS Ascent GX10 compatibility confirmed by the community. Strix Halo port in exploration. RTX 6000 Pro Blackwell on the horizon. Same kernel philosophy, adapted per chip.

๐Ÿ’ก

Kernel Philosophy

Every model gets its own hand-tuned CUDA kernels. No generic fallbacks. We profile, optimize, and validate at the register level. If a model matters to you, it matters to us.

๐Ÿ“ข

Community-Driven

MiniMax M2.7 just landed. Model support is driven entirely by what the community asks for. We're in Discord every day listening. Tell us what you're running and we'll optimize for your use case.

๐Ÿ› 

Open Source

Free and open source release coming soon. We want to make sure what we release is something people can actually build on, not just a dump.

๐ŸŽจ

Multimodal

Vision support live for Qwen3-VL. Audio and additional modalities on the roadmap. The goal is proper kernel-level support for each modality.

๐ŸŽฏ

Agentic-Ready

OpenAI + Anthropic API compatibility on the same port. Tool calling, structured output, multi-turn. Works with Claude Code, Cline, OpenCode, and Open WebUI out of the box.