Token Factory: inference API for open models

OpenAI-compatible API to start fast.
Dedicated GPUs, post-training, and workload optimization when you scale.
Start in minutes. Engineer it for production when it matters.

Quick start Docs

Build. Customize. Optimize. Deploy

Start building

Optimize for real workloads

Customize the model

Integrate with your stack

APIs: LiteLLM, Aisuite, and more
Frameworks: LlamaIndex, Langchain
Agents: Agno, Crew AI, Pydantic, and more

Architecture and build walkthroughs

Official Token Factory playlist on YouTube

Learn about Nebius Token Factory from the makers

Go to playlist

Build with Token Factory playlist on YouTube

Discover AI projects, tools, and success stories created with Token Factory.

Go to playlist

In-depth Technical Resources

Production inference is not just serving a model. These guides break down the architecture behind real-world workloads.

Why large MoE models break latency budgets and what speculative decoding changes in production systems

Read the blog

The invisible architecture behind great chat apps

Read the blog

Routing in LLM inference is the difference between scaling and stalling

Read the blog

Ship faster with the community

Get help and connect with other builders.

Docs Blog Cook book Discord community

Subscribe to our newsletter

Get builder updates: new releases, cookbooks, events, and more