About Models and Inference
The NeMo Platform provides APIs for registering external model providers and routing inference requests through a unified gateway.
Model Registry
Models service manages model entities and model providers.
Core Objects
Model — A registered model within the platform, referencing a specific model like nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. Models are made available by a hosted provider (NVIDIA Build, OpenAI, and so on) and are served via a ModelProvider.
ModelProvider — A routable inference host registered for an external API (NVIDIA Build, OpenAI, and so on). All inference requests route through a ModelProvider which serves one or more Models.
Model Providers
Model providers connect the platform to external inference APIs such as NVIDIA Build or OpenAI. The workflow is:
- Store the API key as a secret in the platform
- Create a model provider pointing to the external API with the secret reference
- Route inference through the gateway using provider or model entity routing
nemo setup pre-configures a default/nvidia-build model provider during local startup.
This provider routes inference requests to models hosted on build.nvidia.com using the API base URL https://integrate.api.nvidia.com
and the NGC API key with Public API Endpoints permissions provided during deployment.
You can verify this provider exists by running nemo inference providers list --workspace default.
The tutorials in these docs use this provider for inference, but you can alternatively create your own and use it instead.
Add External Providers
Register external inference APIs like NVIDIA Build or OpenAI.
NVIDIA Build
By default, the platform pre-configures an external provider for NVIDIA Build named nvidia-build in the system workspace.
The example below demonstrates how to recreate it in your own workspace.
For disambiguation purposes, this example names the manually-created version my-nvidia-build.
CLI
Python SDK
OpenAI
CLI
Python SDK
Anthropic
Anthropic’s /v1/messages API expects the API key in an X-Api-Key: header (not Authorization: Bearer) and requires an anthropic-version header on every request. Use --auth-header-format (Jinja2 template, must contain exactly one {{ auth_secret }} variable) to override the default Authorization: Bearer {{ auth_secret }} and pass the API-version pin via --default-extra-headers. Without these, Anthropic rejects every request with 401.
CLI
Python SDK
{{ auth_secret }} is substituted with the resolved secret value at request time.
Inference Gateway
Inference Gateway is a Layer 7 reverse proxy providing unified access to all inference endpoints. It supports three routing patterns:
Routing Patterns
All patterns use /-/ as a separator. Everything after /-/ is forwarded to the backend unchanged.
Path Examples
Use nemo inference get-url to print the correct base URL for your workspace
without hand-assembling the path. Add --provider <name> or
--virtual-model <name> to get the URL for the corresponding proxy route.
SDK Helper Methods
Set up the CLI or Python SDK first:
CLI
Python SDK
The SDK provides convenience methods for OpenAI compatibility:
Verifying Gateway Reachability
The platform exposes two health endpoints at the root of the platform router. The inference gateway is reachable through the same platform host, so these are the canonical checks for “is the gateway up?”:
/v1/health/ready and /v1/health/live are NIM endpoints, not gateway
endpoints. If a probe at /v1/health/... against the gateway returns 404 it’s
working as designed — use /health/ready instead.
API Reference
For complete API details, refer to the Inference Gateway API Reference and SDK Reference.