The Basics

View as Markdown

This tutorial demonstrates the fundamentals of Data Designer by generating a product review dataset.

For more detail about column behavior, see the open-source library’s version of this tutorial.

Prerequisites

Ensure you have completed the tutorials prerequisites. This tutorial uses an Inference Gateway provider, so local CLI run and NeMo Services execution both need access to the Inference Gateway API in a running NeMo Services cluster.

Part 1: Build the Configuration

Use the data_designer.config package to define your dataset schema. This configuration code is the same across the plugin execution modes.

Build the configuration once, then choose whether to execute with CLI run, CLI submit, or the SDK.

Define Models

Start by defining the models you want to use:

1import data_designer.config as dd
2
3MODEL_ALIAS = "text"
4
5model_configs = [
6 dd.ModelConfig(
7 provider="default/nvidia-build",
8 model="nvidia/nemotron-3-nano-30b-a3b", # Use the `served_model_name` from the provider
9 alias=MODEL_ALIAS,
10 inference_parameters=dd.ChatCompletionInferenceParams(
11 temperature=1.0,
12 top_p=1.0,
13 ),
14 )
15]
16
17config_builder = dd.DataDesignerConfigBuilder(model_configs)

Add Columns

Define the columns for your dataset. The library documentation explains these column types in detail.

1# Product category sampler
2config_builder.add_column(
3 dd.SamplerColumnConfig(
4 name="product_category",
5 sampler_type=dd.SamplerType.CATEGORY,
6 params=dd.CategorySamplerParams(
7 values=[
8 "Electronics",
9 "Clothing",
10 "Home & Kitchen",
11 "Books",
12 "Home Office",
13 ],
14 ),
15 )
16)
17
18# Product subcategory sampler (conditional on category)
19config_builder.add_column(
20 dd.SamplerColumnConfig(
21 name="product_subcategory",
22 sampler_type=dd.SamplerType.SUBCATEGORY,
23 params=dd.SubcategorySamplerParams(
24 category="product_category",
25 values={
26 "Electronics": [
27 "Smartphones",
28 "Laptops",
29 "Headphones",
30 "Cameras",
31 "Accessories",
32 ],
33 "Clothing": [
34 "Men's Clothing",
35 "Women's Clothing",
36 "Winter Coats",
37 "Activewear",
38 "Accessories",
39 ],
40 "Home & Kitchen": [
41 "Appliances",
42 "Cookware",
43 "Furniture",
44 "Decor",
45 "Organization",
46 ],
47 "Books": [
48 "Fiction",
49 "Non-Fiction",
50 "Self-Help",
51 "Textbooks",
52 "Classics",
53 ],
54 "Home Office": [
55 "Desks",
56 "Chairs",
57 "Storage",
58 "Office Supplies",
59 "Lighting",
60 ],
61 },
62 ),
63 )
64)
65
66# Target age range
67config_builder.add_column(
68 dd.SamplerColumnConfig(
69 name="target_age_range",
70 sampler_type=dd.SamplerType.CATEGORY,
71 params=dd.CategorySamplerParams(
72 values=["18-25", "25-35", "35-50", "50-65", "65+"]
73 ),
74 )
75)
76
77# Customer details using Faker
78config_builder.add_column(
79 dd.SamplerColumnConfig(
80 name="customer",
81 sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
82 params=dd.PersonFromFakerSamplerParams(age_range=[18, 70], locale="en_US"),
83 )
84)
85
86# Star rating
87config_builder.add_column(
88 dd.SamplerColumnConfig(
89 name="number_of_stars",
90 sampler_type=dd.SamplerType.UNIFORM,
91 params=dd.UniformSamplerParams(low=1, high=5),
92 convert_to="int", # Convert the sampled float to an integer
93 )
94)
95
96# Review style
97config_builder.add_column(
98 dd.SamplerColumnConfig(
99 name="review_style",
100 sampler_type=dd.SamplerType.CATEGORY,
101 params=dd.CategorySamplerParams(
102 values=["rambling", "brief", "detailed", "structured with bullet points"],
103 weights=[1, 2, 2, 1],
104 ),
105 )
106)
107
108# LLM-generated product name
109config_builder.add_column(
110 dd.LLMTextColumnConfig(
111 name="product_name",
112 prompt=(
113 "You are a helpful assistant that generates product names. DO NOT add quotes around the product name.\n\n"
114 "Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
115 "on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
116 "{{ target_age_range }} years old. Respond with only the product name, no other text."
117 ),
118 model_alias=MODEL_ALIAS,
119 )
120)
121
122# LLM-generated customer review
123config_builder.add_column(
124 dd.LLMTextColumnConfig(
125 name="customer_review",
126 prompt=(
127 "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
128 "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
129 "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
130 "The style of the review should be '{{ review_style }}'. "
131 "Respond with only the review, no other text."
132 ),
133 model_alias=MODEL_ALIAS,
134 )
135)

Part 2: Execute

Now execute your configuration. You can run locally through the CLI, submit to NeMo Services, or call the Data Designer API from the SDK.

Local CLI Execution

Save the configuration in a Python file such as product_reviews.py and expose a load_config_builder() function that returns the config_builder.

1def load_config_builder() -> dd.DataDesignerConfigBuilder:
2 return config_builder

Preview locally:

$nemo data-designer preview run product_reviews.py --num-records 5

Generate a larger dataset locally:

$nemo data-designer create run product_reviews.py --num-records 30

This workload runs in the local CLI process, but because the configuration references default/nvidia-build, it still communicates with the Inference Gateway API.

NeMo Services CLI Execution

Submit the same configuration to NeMo Services when you want service-managed execution:

$nemo data-designer preview submit product_reviews.py --workspace default --num-records 5
$nemo data-designer create submit product_reviews.py --workspace default --profile default --num-records 30

SDK Data Designer API Execution

The DataDesignerResource is your SDK interface for Data Designer API execution. You can access it from an existing SDK instance:

1import os
2from nemo_platform import NeMoPlatform
3
4base_url = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
5client = NeMoPlatform(base_url=base_url, workspace="default")
6
7data_designer = client.data_designer

Previewing the Dataset

Use the preview method for API-backed rapid iteration. Generate a small sample, inspect the results, adjust your configuration, and repeat:

1preview = data_designer.preview(config_builder)
2
3# Display a random sample record
4preview.display_sample_record()
5
6# Access the full preview dataset as a pandas DataFrame
7df = preview.dataset
8print(df.head())
9
10# View statistical analysis
11preview.analysis.to_report()

The PreviewResults object returned by client.data_designer.preview stores all its fields in memory; nothing is persisted to disk by default. Use standard Python methods to save any preview data you want to keep around longer term. For example, the dataset is a regular Pandas DataFrame and can be saved to disk via methods like to_csv or to_parquet.

Iterate: Adjust column configurations, prompts, or parameters in your config_builder, then run preview again until you’re satisfied with the results.

Scaling Up with Jobs

When you’re happy with the preview, create a larger service-managed generation job:

1# Defaulting to 30 for demo speed purposes. Happy with the output? Scale it up!
2job = data_designer.create(config_builder, num_records=30)
3
4# Block until the job completes
5job.wait_until_done()
6
7# Download the generated artifacts
8results = job.download_artifacts()
9
10# Load the dataset as a pandas DataFrame
11dataset = results.load_dataset()
12print(dataset.head())
13
14# Load the full analysis report
15analysis = results.load_analysis()
16analysis.to_report()

The Data Designer library writes several artifacts to disk when running a full generation job, including the final dataset as parquet. When a Data Designer job runs through NeMo Services, the entire working directory of artifacts produced by the library is saved as a job result. The download_artifacts method downloads this artifacts directory (stored as a .tar.gz archive), unarchives it, and returns a DataDesignerJobResults object that can be used to load results into memory as DataFrames or other objects for programmatic inspection.

By default, download_artifacts saves the artifacts to a relative local directory named after the job. An alternative path can be passed to download_artifacts.

What Happens Under the Hood

When you use CLI run:

  1. Local Execution: The Data Designer workload runs in the CLI process.
  2. Resource Resolution: The workload can use local resources, NeMo resources, or both.
  3. Generation: Data Designer resolves dependencies and generates records in the local environment.

When you use CLI submit or the SDK today:

  1. Configuration Validation: The service validates your configuration and resolves column dependencies
  2. NeMo Services Execution: Preview runs through the Data Designer API; create runs as a service-managed job
  3. Inference Routing: LLM calls are routed through Inference Gateway to your configured model providers
  4. Artifact Storage: Job datasets and analysis reports are stored in job artifact storage
  5. Job Completion: You can monitor job status and load results when complete

Next Steps