The Basics | NVIDIA NeMo Platform

This tutorial demonstrates the fundamentals of Data Designer by generating a product review dataset.

For more detail about column behavior, see the open-source library’s version of this tutorial.

Prerequisites

Ensure you have completed the tutorials prerequisites. This tutorial uses an Inference Gateway provider, so local CLI run and NeMo Services execution both need access to the Inference Gateway API in a running NeMo Services cluster.

Part 1: Build the Configuration

Use the data_designer.config package to define your dataset schema. This configuration code is the same across the plugin execution modes.

Build the configuration once, then choose whether to execute with CLI run, CLI submit, or the SDK.

Define Models

Start by defining the models you want to use:

1 import data_designer.config as dd
2 
3 MODEL_ALIAS = "text"
4 
5 model_configs = [
6     dd.ModelConfig(
7         provider="default/nvidia-build",
8         model="nvidia/nemotron-3-nano-30b-a3b",  # Use the `served_model_name` from the provider
9         alias=MODEL_ALIAS,
10         inference_parameters=dd.ChatCompletionInferenceParams(
11             temperature=1.0,
12             top_p=1.0,
13         ),
14     )
15 ]
16 
17 config_builder = dd.DataDesignerConfigBuilder(model_configs)

Add Columns

Define the columns for your dataset. The library documentation explains these column types in detail.

1 # Product category sampler
2 config_builder.add_column(
3     dd.SamplerColumnConfig(
4         name="product_category",
5         sampler_type=dd.SamplerType.CATEGORY,
6         params=dd.CategorySamplerParams(
7             values=[
8                 "Electronics",
9                 "Clothing",
10                 "Home & Kitchen",
11                 "Books",
12                 "Home Office",
13             ],
14         ),
15     )
16 )
17 
18 # Product subcategory sampler (conditional on category)
19 config_builder.add_column(
20     dd.SamplerColumnConfig(
21         name="product_subcategory",
22         sampler_type=dd.SamplerType.SUBCATEGORY,
23         params=dd.SubcategorySamplerParams(
24             category="product_category",
25             values={
26                 "Electronics": [
27                     "Smartphones",
28                     "Laptops",
29                     "Headphones",
30                     "Cameras",
31                     "Accessories",
32                 ],
33                 "Clothing": [
34                     "Men's Clothing",
35                     "Women's Clothing",
36                     "Winter Coats",
37                     "Activewear",
38                     "Accessories",
39                 ],
40                 "Home & Kitchen": [
41                     "Appliances",
42                     "Cookware",
43                     "Furniture",
44                     "Decor",
45                     "Organization",
46                 ],
47                 "Books": [
48                     "Fiction",
49                     "Non-Fiction",
50                     "Self-Help",
51                     "Textbooks",
52                     "Classics",
53                 ],
54                 "Home Office": [
55                     "Desks",
56                     "Chairs",
57                     "Storage",
58                     "Office Supplies",
59                     "Lighting",
60                 ],
61             },
62         ),
63     )
64 )
65 
66 # Target age range
67 config_builder.add_column(
68     dd.SamplerColumnConfig(
69         name="target_age_range",
70         sampler_type=dd.SamplerType.CATEGORY,
71         params=dd.CategorySamplerParams(
72             values=["18-25", "25-35", "35-50", "50-65", "65+"]
73         ),
74     )
75 )
76 
77 # Customer details using Faker
78 config_builder.add_column(
79     dd.SamplerColumnConfig(
80         name="customer",
81         sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
82         params=dd.PersonFromFakerSamplerParams(age_range=[18, 70], locale="en_US"),
83     )
84 )
85 
86 # Star rating
87 config_builder.add_column(
88     dd.SamplerColumnConfig(
89         name="number_of_stars",
90         sampler_type=dd.SamplerType.UNIFORM,
91         params=dd.UniformSamplerParams(low=1, high=5),
92         convert_to="int",  # Convert the sampled float to an integer
93     )
94 )
95 
96 # Review style
97 config_builder.add_column(
98     dd.SamplerColumnConfig(
99         name="review_style",
100         sampler_type=dd.SamplerType.CATEGORY,
101         params=dd.CategorySamplerParams(
102             values=["rambling", "brief", "detailed", "structured with bullet points"],
103             weights=[1, 2, 2, 1],
104         ),
105     )
106 )
107 
108 # LLM-generated product name
109 config_builder.add_column(
110     dd.LLMTextColumnConfig(
111         name="product_name",
112         prompt=(
113             "You are a helpful assistant that generates product names. DO NOT add quotes around the product name.\n\n"
114             "Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
115             "on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
116             "{{ target_age_range }} years old. Respond with only the product name, no other text."
117         ),
118         model_alias=MODEL_ALIAS,
119     )
120 )
121 
122 # LLM-generated customer review
123 config_builder.add_column(
124     dd.LLMTextColumnConfig(
125         name="customer_review",
126         prompt=(
127             "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
128             "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
129             "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
130             "The style of the review should be '{{ review_style }}'. "
131             "Respond with only the review, no other text."
132         ),
133         model_alias=MODEL_ALIAS,
134     )
135 )

Part 2: Execute

Now execute your configuration. You can run locally through the CLI, submit to NeMo Services, or call the Data Designer API from the SDK.

Local CLI Execution

Save the configuration in a Python file such as product_reviews.py and expose a load_config_builder() function that returns the config_builder.

1 def load_config_builder() -> dd.DataDesignerConfigBuilder:
2     return config_builder

Preview locally:

$ nemo data-designer preview run product_reviews.py --num-records 5

Generate a larger dataset locally:

$ nemo data-designer create run product_reviews.py --num-records 30

This workload runs in the local CLI process, but because the configuration references default/nvidia-build, it still communicates with the Inference Gateway API.

NeMo Services CLI Execution

Submit the same configuration to NeMo Services when you want service-managed execution:

$ nemo data-designer preview submit product_reviews.py --workspace default --num-records 5
$ nemo data-designer create submit product_reviews.py --workspace default --profile default --num-records 30

SDK Data Designer API Execution

The DataDesignerResource is your SDK interface for Data Designer API execution. You can access it from an existing SDK instance:

1 import os
2 from nemo_platform import NeMoPlatform
3 
4 base_url = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
5 client = NeMoPlatform(base_url=base_url, workspace="default")
6 
7 data_designer = client.data_designer

Previewing the Dataset

Use the preview method for API-backed rapid iteration. Generate a small sample, inspect the results, adjust your configuration, and repeat:

1 preview = data_designer.preview(config_builder)
2 
3 # Display a random sample record
4 preview.display_sample_record()
5 
6 # Access the full preview dataset as a pandas DataFrame
7 df = preview.dataset
8 print(df.head())
9 
10 # View statistical analysis
11 preview.analysis.to_report()

More about preview results

The PreviewResults object returned by client.data_designer.preview stores all its fields in memory; nothing is persisted to disk by default. Use standard Python methods to save any preview data you want to keep around longer term. For example, the dataset is a regular Pandas DataFrame and can be saved to disk via methods like to_csv or to_parquet.

Iterate: Adjust column configurations, prompts, or parameters in your config_builder, then run preview again until you’re satisfied with the results.

Scaling Up with Jobs

When you’re happy with the preview, create a larger service-managed generation job:

1 # Defaulting to 30 for demo speed purposes. Happy with the output? Scale it up!
2 job = data_designer.create(config_builder, num_records=30)
3 
4 # Block until the job completes
5 job.wait_until_done()
6 
7 # Download the generated artifacts
8 results = job.download_artifacts()
9 
10 # Load the dataset as a pandas DataFrame
11 dataset = results.load_dataset()
12 print(dataset.head())
13 
14 # Load the full analysis report
15 analysis = results.load_analysis()
16 analysis.to_report()

More about job results

The Data Designer library writes several artifacts to disk when running a full generation job, including the final dataset as parquet. When a Data Designer job runs through NeMo Services, the entire working directory of artifacts produced by the library is saved as a job result. The download_artifacts method downloads this artifacts directory (stored as a .tar.gz archive), unarchives it, and returns a DataDesignerJobResults object that can be used to load results into memory as DataFrames or other objects for programmatic inspection.

By default, download_artifacts saves the artifacts to a relative local directory named after the job. An alternative path can be passed to download_artifacts.

What Happens Under the Hood

When you use CLI run:

Local Execution: The Data Designer workload runs in the CLI process.
Resource Resolution: The workload can use local resources, NeMo resources, or both.
Generation: Data Designer resolves dependencies and generates records in the local environment.

When you use CLI submit or the SDK today:

Configuration Validation: The service validates your configuration and resolves column dependencies
NeMo Services Execution: Preview runs through the Data Designer API; create runs as a service-managed job
Inference Routing: LLM calls are routed through Inference Gateway to your configured model providers
Artifact Storage: Job datasets and analysis reports are stored in job artifact storage
Job Completion: You can monitor job status and load results when complete

Next Steps

Seed data: Learn how to use external datasets in the seeding tutorial
Execution modes: Learn more about local and NeMo Services execution in Execution Modes
Column types: Explore all available column types in the library documentation
Advanced features: Learn about processors and validation