Running Inference with Guardrails
NeMo Guardrails applies safety checks to inference requests through VirtualModels. When your application sends a request to a VirtualModel with guardrails middleware, the plugin runs input and output rails around the model call automatically. You use the standard IGW OpenAI-compatible endpoint — no separate guardrails endpoint is needed.
Platform-managed agents are the canonical consumer: pointing an agent’s llms block at a guarded VirtualModel entity is how you secure the agent’s model path. See Secure Agents for the agent-side wiring. The rest of this page applies to any client calling a guarded VirtualModel, agent or not.
Prerequisites
- A guardrail configuration exists in your workspace. See Manage Configurations to create one.
- A VirtualModel is configured with
nemo-guardrailsmiddleware. See Architecture for wiring details.
If you do not have access to GPUs, you can use NIMs hosted on build.nvidia.com. Refer to Setup for instructions on configuring a ModelProvider that routes requests to an external endpoint.
For the examples below, create a guardrail configuration and VirtualModel:
Create a VirtualModel that applies the guardrail configuration:
CLI
Python SDK
Inference Endpoint
Inference requests go to the standard IGW OpenAI-compatible endpoint:
Set the model field to your VirtualModel’s entity reference (workspace/name format). IGW resolves the VirtualModel, runs the guardrails middleware pipeline, and proxies to the backend model.
Chat Completions
CLI
curl
Python SDK
Model Routing
The model field in your request must reference a VirtualModel entity (workspace/name format). IGW resolves the VirtualModel, applies its middleware pipeline, and proxies to the backend model specified by the VirtualModel’s default_model_entity.
Task models in your guardrail configuration (content safety, topic control, etc.) must reference Model Entities using the same workspace/name format. The plugin resolves their endpoints through IGW’s route table.
For VirtualModel wiring (request_middleware/response_middleware, entity-backed vs inline configs), refer to Architecture. For an overview of how Model Entities and Model Providers fit together, refer to About Models and Inference.
Inline Configuration
Instead of referencing a stored config entity via config_id, you can embed the guardrail configuration directly in the VirtualModel’s middleware entry using config:
CLI
Python SDK
The inline config shares the same LLMRails pool when its content hash matches an existing entry. The plugin warms inline configs on VirtualModel upsert too — it walks the VM’s middleware entries, stabilizes each source, and dedups by content hash — so the first request after upsert finds a hot pool just like entity-backed configs do. See Architecture for the cache details.
The optional name field sets the diagnostic label in logs (appears as <inline:my-inline-config>).
Streaming Output
Streaming reduces time-to-first-token (TTFT) by returning chunks as they are generated. When output rails are configured, the plugin applies safety checks to chunks of tokens as they stream from the model.
Configuration
Enable streaming in your guardrail config’s output rails. The streaming property supports the following fields:
If the request sets stream: true but the guardrail config has output flows with streaming.enabled: false, IGW returns HTTP 400 with a message instructing you to set rails.output.streaming.enabled=true. Either enable streaming on the rails config, or send non-streaming requests to that VirtualModel.
Streaming Chat Completions
curl
CLI
Blocked Content Detection
When content is blocked during streaming, the stream includes an error chunk:
Guardrails Request Options
You can include a guardrails field in the request body to control logging and response format. This field is optional and does not affect which rails are applied — rail selection is determined by the guardrail configuration on the VirtualModel.
Log Options
The guardrails.options.log object controls what diagnostic information is included in the response:
When debugging an unexpected block or pass-through, start with
activated_rails to confirm which rails ran. Add llm_calls when you need to
inspect the raw model output that a rail parser consumed. Add
internal_events when you need the lower-level execution trace to understand
which actions ran before the final allow or block decision.
llm_calls can include raw prompts and completions, including user data or other sensitive content. Consider enabling it for scoped debugging
and disabling it or, if needed, redacting captured data before storing or using
logs in production environments.
Return as Choice
For clients that do not handle extra response fields, configure the request to return guardrail data as a choice in the choices list with the role guardrails_data:
Custom HTTP Headers
The plugin uses two different header-forwarding paths for main models and task models. Understanding the split matters when you want a custom header (for tenancy, observability, etc.) to reach a specific upstream.
Main Model Calls
The main model is the one IGW resolves from default_model_entity and that handles generation. When the plugin builds the per-request main LLM, it forwards a curated allowlist from the inbound request:
- All headers starting with
x-orX-— NeMo Platform principal headers,x-otel-*, and any customX-Fooheaders your client sets. - W3C Trace Context headers:
traceparent,tracestate,baggage.
Authorization and other non-allowlisted headers are intentionally dropped — IGW handles auth.
You can also pin static defaults on a type: "main" entry in the guardrail configuration via parameters.default_headers. Request-time headers override configuration defaults for the same header name (case-insensitive).
Task Model Calls
Task models (content safety, topic control, embeddings, etc.) follow a different rule. The cached LangChain client is shared across requests, so arbitrary inbound headers are not forwarded to task models. Instead, each task-model call merges in specific service headers derived from the current request context:
traceparent/tracestatefor distributed tracing.X-NMP-Principal-Id/X-NMP-Principal-On-Behalf-Offor service-principal authorization.
If you need a static header on every call to a specific task model — for example a tenant tag or an upstream API key — declare it in that model’s parameters.default_headers: