Seeding with External Datasets | NVIDIA NeMo Platform

This tutorial demonstrates how to use external datasets as seed data for synthetic data generation in Data Designer.

For more detail about seed dataset behavior, see the open-source library’s version of this tutorial.

Seed Sources by Execution Mode

Seed source support depends on where the workload executes:

Seed source	CLI `run`	CLI `submit` / SDK today	Use case
Local files or DataFrames	Supported	Not supported	Fast local iteration with files available to the CLI process.
HuggingFace	Supported	Supported	Publicly available datasets or private HuggingFace datasets.
Files API Filesets	Supported when NeMo Services access is configured	Supported	Shared seed data stored through the Files API.

run versus submit controls where the workload executes. A local run can still read Files API Filesets if the configuration references them and NeMo Services access is configured.

HuggingFace Datasets

Use HuggingFaceSeedSource to load data from HuggingFace:

1 import data_designer.config as dd
2 
3 # Public dataset
4 dd.HuggingFaceSeedSource(path="datasets/username/dataset/data/*.parquet")
5 
6 # Private dataset (requires token)
7 dd.HuggingFaceSeedSource(
8     path="datasets/username/dataset/data/*.parquet",
9     token="default/hf-token",  # Reference to a Secrets API secret
10 )

Files API Filesets

Use FilesetFileSeedSource to load data through the Files API. This works in CLI run, CLI submit, and SDK execution when NeMo Services access is configured:

1 from data_designer_nemo.fileset_file_seed_source import FilesetFileSeedSource
2 
3 FilesetFileSeedSource(
4     path="default/my-fileset#data.parquet"  # Format: workspace/fileset#file-path
5 )

Path format:

Fully qualified: workspace/fileset-name#file-path (recommended)
Implicit workspace: fileset-name#file-path (uses client’s workspace)

Prerequisites

Ensure you have completed the tutorials prerequisites. This tutorial uses an Inference Gateway provider, so local CLI run and NeMo Services execution both need access to the Inference Gateway API in a running NeMo Services cluster.

Example: Medical Notes from Symptom Data

This example generates realistic patient medical notes by seeding with publicly available symptom-to-diagnosis data. It uploads the seed data to a Files API Fileset so the same configuration can run locally through CLI run or through NeMo Services execution.

Step 1: Upload Seed Data

Upload the symptom-to-diagnosis dataset to a Files API Fileset:

1 import os
2 import tempfile
3 import urllib.request
4 from nemo_platform import NeMoPlatform
5 
6 WORKSPACE = "default"
7 FILESET_NAME = "seed-data"
8 FILE_PATH = "symptom_to_diagnosis.csv"
9 
10 base_url = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
11 client = NeMoPlatform(base_url=base_url, workspace=WORKSPACE)
12 
13 # Create fileset
14 client.files.filesets.create(name=FILESET_NAME)
15 
16 # Download and upload seed data
17 with tempfile.NamedTemporaryFile(suffix=".csv") as tmpfile:
18     url = "https://raw.githubusercontent.com/NVIDIA/GenerativeAIExamples/refs/heads/main/nemo/NeMo-Data-Designer/data/gretelai_symptom_to_diagnosis.csv"
19     urllib.request.urlretrieve(url, tmpfile.name)
20 
21     client.files.upload(
22         fileset=FILESET_NAME,
23         local_path=tmpfile.name,
24         remote_path=FILE_PATH,
25     )

Step 2: Build Configuration

Define your models and create a config builder:

1 import data_designer.config as dd
2 
3 MODEL_ALIAS = "text"
4 
5 model_configs = [
6     dd.ModelConfig(
7         provider="default/nvidia-build",
8         model="nvidia/nemotron-3-nano-30b-a3b",  # Use the `served_model_name` from the provider
9         alias=MODEL_ALIAS,
10         inference_parameters=dd.ChatCompletionInferenceParams(
11             temperature=1.0,
12             top_p=1.0,
13         ),
14     )
15 ]
16 
17 config_builder = dd.DataDesignerConfigBuilder(model_configs)

Step 3: Configure Seed Dataset

Add the seed data to your configuration:

1 from data_designer_nemo.fileset_file_seed_source import FilesetFileSeedSource
2 
3 config_builder.with_seed_dataset(
4     FilesetFileSeedSource(path=f"{WORKSPACE}/{FILESET_NAME}#{FILE_PATH}")
5 )

What this does: The seed dataset’s columns (diagnosis, patient_summary, etc.) are automatically added to your dataset and available for use in other columns.

Step 4: Add Synthetic Columns

Add columns that reference and extend the seed data:

1 # Patient details
2 config_builder.add_column(
3     dd.SamplerColumnConfig(
4         name="patient_sampler",
5         sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
6         params=dd.PersonFromFakerSamplerParams(),
7     )
8 )
9 
10 # Doctor details
11 config_builder.add_column(
12     dd.SamplerColumnConfig(
13         name="doctor_sampler",
14         sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
15         params=dd.PersonFromFakerSamplerParams(),
16     )
17 )
18 
19 # Patient ID
20 config_builder.add_column(
21     dd.SamplerColumnConfig(
22         name="patient_id",
23         sampler_type=dd.SamplerType.UUID,
24         params=dd.UUIDSamplerParams(
25             prefix="PT-",
26             short_form=True,
27             uppercase=True,
28         ),
29     )
30 )
31 
32 # Extract patient name from sampler
33 config_builder.add_column(
34     dd.ExpressionColumnConfig(
35         name="first_name",
36         expr="{{ patient_sampler.first_name }}",
37     )
38 )
39 
40 config_builder.add_column(
41     dd.ExpressionColumnConfig(
42         name="last_name",
43         expr="{{ patient_sampler.last_name }}",
44     )
45 )
46 
47 config_builder.add_column(
48     dd.ExpressionColumnConfig(
49         name="dob",
50         expr="{{ patient_sampler.birth_date }}",
51     )
52 )
53 
54 # Symptom onset date
55 config_builder.add_column(
56     dd.SamplerColumnConfig(
57         name="symptom_onset_date",
58         sampler_type=dd.SamplerType.DATETIME,
59         params=dd.DatetimeSamplerParams(start="2024-01-01", end="2024-12-31"),
60     )
61 )
62 
63 # Visit date (1-30 days after symptom onset)
64 config_builder.add_column(
65     dd.SamplerColumnConfig(
66         name="date_of_visit",
67         sampler_type=dd.SamplerType.TIMEDELTA,
68         params=dd.TimeDeltaSamplerParams(
69             dt_min=1,
70             dt_max=30,
71             reference_column_name="symptom_onset_date",
72         ),
73     )
74 )
75 
76 # Physician name
77 config_builder.add_column(
78     dd.ExpressionColumnConfig(
79         name="physician",
80         expr="Dr. {{ doctor_sampler.last_name }}",
81     )
82 )
83 
84 # LLM-generated physician notes
85 config_builder.add_column(
86     dd.LLMTextColumnConfig(
87         name="physician_notes",
88         prompt="""\
89 You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},
90 who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.
91 The date of today's visit is {{ date_of_visit }}.
92 
93 {{ patient_summary }}
94 
95 Write careful notes about your visit with {{ first_name }},
96 as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.
97 
98 Format the notes as a busy doctor might.
99 Respond with only the notes, no other text.
100 """,
101         model_alias=MODEL_ALIAS,
102     )
103 )

Note: The diagnosis and patient_summary variables come from the seed dataset columns.

Step 5: Execute

Because this example uses a Files API Fileset and an Inference Gateway provider, even local CLI execution communicates with NeMo Services APIs.

For CLI execution, save the configuration in medical_notes.py and expose a load_config_builder() function that returns the config_builder.

1 def load_config_builder() -> dd.DataDesignerConfigBuilder:
2     return config_builder

Preview locally:

$ nemo data-designer preview run medical_notes.py --num-records 5

Generate a larger dataset locally:

$ nemo data-designer create run medical_notes.py --num-records 30

Submit to NeMo Services:

$ nemo data-designer preview submit medical_notes.py --workspace default --num-records 5
$ nemo data-designer create submit medical_notes.py --workspace default --profile default --num-records 30

You can also execute through the SDK service path.

Create a client:

1 # Using the client instance from Step 1
2 data_designer = client.data_designer

Previewing the Dataset

Use the preview method for rapid iteration:

1 preview = data_designer.preview(config_builder)
2 
3 # Display a random sample record
4 preview.display_sample_record()
5 
6 # Access the full preview dataset as a pandas DataFrame
7 df = preview.dataset
8 print(df.head())
9 
10 # View statistical analysis
11 preview.analysis.to_report()

More about preview results

The PreviewResults object returned by client.data_designer.preview stores all its fields in memory; nothing is persisted to disk by default. Use standard Python methods to save any preview data you want to keep around longer term. For example, the dataset is a regular Pandas DataFrame and can be saved to disk via methods like to_csv or to_parquet.

Generating the Full Dataset

When you’re satisfied with the preview, submit a larger generation job:

1 # Defaulting to 30 for demo speed purposes. Happy with the output? Scale it up!
2 job = data_designer.create(config_builder, num_records=30)
3 
4 # Block until the job completes
5 job.wait_until_done()
6 
7 # Download the generated artifacts
8 results = job.download_artifacts()
9 
10 # Load the dataset as a pandas DataFrame
11 dataset = results.load_dataset()
12 print(dataset.head())
13 
14 # Load the full analysis report
15 analysis = results.load_analysis()
16 analysis.to_report()

More about job results

The Data Designer library writes several artifacts to disk when running a full generation job, including the final dataset as parquet. When a Data Designer job runs through NeMo Services, the entire working directory of artifacts produced by the library is saved as a job result. The download_artifacts method downloads this artifacts directory (stored as a .tar.gz archive), unarchives it, and returns a DataDesignerJobResults object that can be used to load results into memory as DataFrames or other objects for programmatic inspection.

By default, download_artifacts saves the artifacts to a relative local directory named after the job. An alternative path can be passed to download_artifacts.

How Seeding Works

When you configure a seed dataset:

Automatic Column Addition: All columns from the seed data are automatically added to your dataset schema
Dependency Resolution: Data Designer resolves dependencies between seed columns and synthetic columns
Execution Order: Seed data is loaded first, then synthetic columns are generated row-by-row
Row Alignment: Each generated row corresponds to one row from the seed dataset

Example: If your seed data has 100 rows with columns diagnosis and patient_summary, and you request 100 records, each generated record will include the seed columns plus any synthetic columns you defined.

Next Steps

Execution modes: Learn more about local and NeMo Services execution in Execution Modes
Column types: Explore all available column types in the library documentation
Processors: Transform your data with processors in the library documentation