Manage Files

View as Markdown

NeMo Platform provides a file storage interface through the Files service. The Files service supports multiple storage backends and can be used to store datasets for training, evaluation results, model artifacts, and other files.

Concepts

  • Fileset: A named container that holds files.

Filesets are uniquely identified by a name within a given workspace.

  • Storage Backend: Each fileset is backed by a storage backend where the files are actually persisted. Supported backends include:
  • local: Local filesystem storage (default, read/write)
  • s3: Amazon S3 or S3-compatible storage such as MinIO (read/write)
  • ngc: NVIDIA GPU Cloud storage (read-only)
  • huggingface: HuggingFace Hub repositories (read-only)

Read-only backends allow you to create a fileset that acts as a handle to external resources. This provides a unified interface to access files from different sources using the same SDK methods, and allows other platform services to reference external data through a fileset.

  • Purpose: A fileset field that indicates the intended use. Each purpose enables specific metadata fields under the corresponding key. Select a tab below to see the available metadata fields for each purpose:

    Use purpose="generic" (default) for other files that don’t fit the dataset or model categories.

    Metadata fields: No purpose-specific metadata fields.

    These fields are merged into the model entity spec by the model-spec background task.

  • Custom Fields: Arbitrary key-value data attached to a fileset via custom_fields for user-defined metadata.


Managing Filesets

Fileset management operations (create, retrieve, list, delete) are available through the CLI (nemo files filesets) or the SDK (client.files.filesets).

CLI commands use the workspace from your current context by default. Use --workspace to specify a different workspace:

$nemo files filesets list --workspace my-workspace

Creating Filesets

Creating a fileset involves specifying a name and workspace. You can optionally provide a description, purpose, and custom storage configuration.

$nemo files filesets create my-files \
>--description "Training data for model fine-tuning"
1{
2 "id": "fileset-TeufFfapeKBrMtpBb42zdv",
3 "created_at": "2026-01-20T03:00:00",
4 "custom_fields": {},
5 "description": "Training data for model fine-tuning",
6 "metadata": {
7 "dataset": null
8 },
9 "name": "my-files",
10 "project": "",
11 "purpose": "generic",
12 "storage": {
13 "path": "/var/mnt/filesets/default/my-files",
14 "read_chunk_size": 16777216,
15 "type": "local",
16 "write_buffer_size": 16777216
17 },
18 "updated_at": "2026-01-20T03:00:00",
19 "workspace": "default"
20}

Listing Filesets

List all filesets in a given workspace:

$nemo files filesets list
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name ┃ workspace ┃ created_at ┃
┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ my-files │ default │ 2026-01-20T03:00:00 │
└──────────┴───────────┴────────────────────────────┘

Filter filesets by purpose or storage type:

$# List only dataset filesets
$nemo files filesets list --filter.purpose dataset
$
$# List filesets using local storage
$nemo files filesets list --filter.storage-type local

Use pagination for large result sets:

$# The "-" prefix sorts in descending order (newest first)
$nemo files filesets list --page 1 --page-size 10 --sort "-created_at"

Deleting Filesets

Delete an entire fileset:

$nemo files filesets delete my-files
✓ Deleted successfully

Deleting a fileset is permanent and cannot be undone. For local and s3 storage backends, this also deletes all underlying files.


Managing Files Within Filesets

High-level file operations are available through the CLI (nemo files) or the SDK (client.files), which provide convenient methods for uploading, downloading, and listing files.

For advanced use cases, a fsspec-compatible filesystem is available at client.files.fsspec. Refer to the fsspec documentation for additional methods.

Uploading Files

Upload files to a fileset:

$# Upload a single file
$nemo files upload ./data.jsonl my-files --remote-path training/data.jsonl
$
$# Upload an entire directory
$nemo files upload ./training_data/ my-files --remote-path training/
Uploading ━━━━━━━━━━━━━━━━ 100% • 3/3 files
Completed upload to my-files#training/

Upload without specifying a fileset to auto-create one:

$# Auto-creates a new fileset with a generated name (fileset-<8 hex chars>)
$nemo files upload ./data.jsonl
Uploading ━━━━━━━━━━━━━━━━ 100% • 1/1 files
Completed upload to fileset-a1b2c3d4

If fileset is omitted, a new fileset is automatically created with a unique name following the pattern fileset-<8-hex> (e.g., fileset-a1b2c3d4). The generated name is returned so you can reference it in subsequent operations.

Listing Files

List all files in a fileset:

$nemo files list my-files
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┓
┃ PATH ┃ SIZE ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━┩
│ training/data.jsonl │ 1024 │
│ training/validation.jsonl │ 512 │
└────────────────────────────┴──────┘

List files under a specific directory:

$nemo files list my-files --remote-path training/

Downloading Files

Download files to a local path:

$# Download a single file
$nemo files download my-files --remote-path training/data.jsonl -o ./data.jsonl
$
$# Download an entire directory
$nemo files download my-files --remote-path training/ -o ./training_data/
Downloading ━━━━━━━━━━━━━━━━ 100% • 2/2 files
Downloaded my-files#training/ to './training_data/'

Read file content into memory (SDK only):

1content = client.files.download_content(
2 fileset="my-files",
3 remote_path="config.json",
4)
5print(content.decode("utf-8"))

Deleting Files

Delete files from a fileset:

$nemo files delete my-files --remote-path training/old-data.jsonl
Deleted my-files#training/old-data.jsonl

Using Progress Callbacks

The CLI displays progress bars automatically during uploads and downloads. This section covers custom progress handling in the SDK.

Track progress during large file transfers using the RichProgressCallback context manager:

1from nemo_platform.filesets import RichProgressCallback
2
3# Upload a directory with progress bar
4with RichProgressCallback(description="Uploading dataset") as callback:
5 client.files.upload(
6 fileset="my-files",
7 local_path="./large_dataset/",
8 remote_path="",
9 callback=callback,
10 )
11
12# Download all files from a fileset with progress bar
13with RichProgressCallback(description="Downloading dataset") as callback:
14 client.files.download(
15 fileset="my-files",
16 remote_path="",
17 local_path="./downloaded_data/",
18 callback=callback,
19 )

Use Cases

Using External Storage Backends

Connect to files stored in NVIDIA GPU Cloud (NGC):

$# Create a secret to store your NGC API key
$echo "$NGC_API_KEY" | nemo secrets create my-ngc-api-key --from-file -
$
$# Create a fileset pointing to NGC storage
$nemo files filesets create my-nemotron-personas-dataset-en_us \
>--description "Nemotron Personas USA" \
>--storage '{
>"type": "ngc",
>"org": "nvidia",
>"team": "nemotron-personas",
>"resource": "nemotron-personas-dataset-en_us",
>"version": "0.0.2",
>"api_key_secret": "my-ngc-api-key"
>}'

Connect to a HuggingFace repository:

$# Create a secret to store your HuggingFace token (needed for gated and private repos)
$echo "$HF_TOKEN" | nemo secrets create hf_token --from-file -
$
$# Create a fileset pointing to a HuggingFace repo
$nemo files filesets create hf-dataset \
>--description "Dataset from HuggingFace" \
>--storage '{
>"type": "huggingface",
>"repo_id": "nvidia/Nemotron-Personas-Japan",
>"repo_type": "dataset",
>"token_secret": "hf_token"
>}'

Connect to an S3 bucket or S3-compatible storage (e.g., MinIO, Ceph):

$# Create a fileset backed by S3 storage using SDK credential chain
$# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
$# The "prefix" field is optional - use it to scope the fileset to a folder within the bucket
$nemo files filesets create s3-training-data \
>--description "Training data stored in S3" \
>--storage '{
>"type": "s3",
>"bucket": "my-ml-bucket",
>"prefix": "datasets/training",
>"region": "us-east-1",
>"use_sdk_auth": true
>}'
$
$# Upload data to S3
$nemo files upload ./training_data/ s3-training-data
$
$# Download data from S3
$nemo files download s3-training-data -o ./downloaded_data/

For S3-compatible storage like MinIO, use explicit credentials and a custom endpoint:

$# Create secrets to store your S3 credentials
$echo "$S3_ACCESS_KEY" | nemo secrets create s3_access_key --from-file -
$echo "$S3_SECRET_KEY" | nemo secrets create s3_secret_key --from-file -
$
$nemo files filesets create minio-fileset \
>--description "Data stored in MinIO" \
>--storage '{
>"type": "s3",
>"bucket": "my-bucket",
>"endpoint_url": "http://minio.example.com:9000",
>"region": "us-east-1",
>"use_sdk_auth": false,
>"access_key_id_secret": "s3_access_key",
>"secret_access_key_secret": "s3_secret_key"
>}'