The Basics
This tutorial demonstrates the fundamentals of Data Designer by generating a product review dataset.
For more detail about column behavior, see the open-source library’s version of this tutorial.
Prerequisites
Ensure you have completed the tutorials prerequisites. This tutorial uses an Inference Gateway provider, so local CLI run and NeMo Services execution both need access to the Inference Gateway API in a running NeMo Services cluster.
Part 1: Build the Configuration
Use the data_designer.config package to define your dataset schema. This configuration code is the same across the plugin execution modes.
Build the configuration once, then choose whether to execute with CLI run, CLI submit, or the SDK.
Define Models
Start by defining the models you want to use:
Add Columns
Define the columns for your dataset. The library documentation explains these column types in detail.
Part 2: Execute
Now execute your configuration. You can run locally through the CLI, submit to NeMo Services, or call the Data Designer API from the SDK.
Local CLI Execution
Save the configuration in a Python file such as product_reviews.py and expose a load_config_builder() function that returns the config_builder.
Preview locally:
Generate a larger dataset locally:
This workload runs in the local CLI process, but because the configuration references default/nvidia-build, it still communicates with the Inference Gateway API.
NeMo Services CLI Execution
Submit the same configuration to NeMo Services when you want service-managed execution:
SDK Data Designer API Execution
The DataDesignerResource is your SDK interface for Data Designer API execution. You can access it from an existing SDK instance:
Previewing the Dataset
Use the preview method for API-backed rapid iteration. Generate a small sample, inspect the results, adjust your configuration, and repeat:
More about preview results
The PreviewResults object returned by client.data_designer.preview stores all its fields in memory; nothing is persisted to disk by default.
Use standard Python methods to save any preview data you want to keep around longer term.
For example, the dataset is a regular Pandas DataFrame and can be saved to disk via methods like to_csv or to_parquet.
Iterate: Adjust column configurations, prompts, or parameters in your config_builder, then run preview again until you’re satisfied with the results.
Scaling Up with Jobs
When you’re happy with the preview, create a larger service-managed generation job:
More about job results
The Data Designer library writes several artifacts to disk when running a full generation job, including the final dataset as parquet.
When a Data Designer job runs through NeMo Services, the entire working directory of artifacts produced by the library is saved as a job result.
The download_artifacts method downloads this artifacts directory (stored as a .tar.gz archive),
unarchives it, and returns a DataDesignerJobResults object that can be used to load results into memory as DataFrames or other objects for programmatic inspection.
By default, download_artifacts saves the artifacts to a relative local directory named after the job.
An alternative path can be passed to download_artifacts.
What Happens Under the Hood
When you use CLI run:
- Local Execution: The Data Designer workload runs in the CLI process.
- Resource Resolution: The workload can use local resources, NeMo resources, or both.
- Generation: Data Designer resolves dependencies and generates records in the local environment.
When you use CLI submit or the SDK today:
- Configuration Validation: The service validates your configuration and resolves column dependencies
- NeMo Services Execution: Preview runs through the Data Designer API; create runs as a service-managed job
- Inference Routing: LLM calls are routed through Inference Gateway to your configured model providers
- Artifact Storage: Job datasets and analysis reports are stored in job artifact storage
- Job Completion: You can monitor job status and load results when complete
Next Steps
- Seed data: Learn how to use external datasets in the seeding tutorial
- Execution modes: Learn more about local and NeMo Services execution in Execution Modes
- Column types: Explore all available column types in the library documentation
- Advanced features: Learn about processors and validation