Zero Copy Data Plane

Why Zero Copy Data Plane?

AI workloads waste time and C/GPU cycles repeatedly downloading data from S3/GCS/Azure Blob storage. Zero Copy Data Plane solves this by caching datasets directly on node-local storage, then exposing them to pods as simple read-only bind mounts — no FUSE, no custom filesystems, no application changes. Subsequent jobs or runs needing the same dataset are provisioned onto nodes with the data they need. Runs start instantly, with data served locally at full local disk speeds.

🚀 Instant Startup

Jobs start faster by avoiding repeated remote downloads.

🎯 Higher C/GPU Utilization

Higher utilization by keeping compute fed with local data.

🔌 Zero Application Changes

No SDKs or new APIs — data appears as a normal directory.

🧩 Kubernetes Native

Storage Backend, Dataset, DatasetClaim, NodeDataset Custom Resource Definitions & Controller + Node Agent services.

🔒 Immutable Snapshots

Utilise best practice with immutable datasets and ZCDP content cache.

⚙️ Simple to Operate

No distributed filesystem, no metadata cluster, no heavy caching layer.

How It Works

ZCDP uses a simple, but powerful architecture consisting of a controller, node agents, and a few Kubernetes Custom Resource Definitions.

Controller reconciles CRDs, sends notifications to agents, exposes reports and metrics.
Node Agent syncs remote data to local storage, evicts old datasets, reports status.
Admission Webhook injects bind-mounts into pods automatically.

StorageBackend CRD defines storage endpoints and credentials.
Dataset CRD references a StorageBackend and defines the dataset path.
DatasetClaim defines which pods require a specific dataset.
NodeDataset CRD tracks local cache status on every node.
RefreshDataset CRD simple mechanism to resync a named dataset.

ZCDP can then:

Automatically syncs datasets from object storage to node-local NVMe
Exposes datasets as read-only, zero-copy bind mounts inside pods
Provides immutable, versioned snapshots for reproducible runs
Work with "just a bunch of files" under an object store prefix — no special format required
Require zero changes to application code: the app just reads from a directory like /data/dataset

Use Cases

AI training & fine-tuning
Batch inference pipelines
RAG + feature extraction workflows
Large scale data transformation
ML workloads

The Problem ZCDP Solves

Modern AI and data-intensive workloads on Kubernetes spend enormous time waiting for data, not computing. Typical patterns involve each job re-downloading the same large datasets and model files from S3/GCS/Azure Blob. Pods start slowly, GPUs sit idle, and object storage becomes a bottleneck.

Existing options like distributed filesystems and complex caching layers are often heavyweight, expensive, and require application changes or sidecars. Many teams roll their own partial solution and end up maintaining brittle, bespoke data loaders.

Zero Copy Data Plane exists to remove that bottleneck entirely.

Why Teams Choose ZCDP

GPU efficiency: Higher utilization on expensive hardware.
Developer velocity: Faster iteration cycles without repeated data downloads.
Simplicity: Minimal components, no new filesystem to operate.
Compatibility: Works with any S3-compatible storage and any Kubernetes flavor.
Predictability: Immutable snapshots and deterministic runs.
Get Going Fast: No developer time spent building this kind of thing.
Cost efficiency: Reduced egress / API costs due to download once.

When to use ZCDP

ZCDP is designed for workloads that reuse datasets across runs or nodes. For single-run, small datasets, benefits may be limited.

Architecture Overview

CRDs

Dataset — defines the source URI (e.g., S3 prefix) and snapshot version.
DatasetClaim — declares which pods should get a dataset mounted at what path.
NodeDataset — represents per-node state: whether a given snapshot is present, its size, and which pods are using it.

Controller

The controller watches these CRDs and:

Plans which nodes should host which snapshots
Consumes NodeDataset status reported by agents
Tracks pod usage via annotations and maintains per-node reference counts
Coordinates sync and eviction operations through the node agents
Exposes status and metrics for observability

Node Agent

A lightweight agent runs as a DaemonSet on each node, responsible for:

Syncing data from object storage (e.g., S3) into node-local NVMe
Validating snapshots and promoting them atomically into a ready state
Maintaining a local metadata index (size, last-access, status)
Recording NodeDataset status after ensures and evictions
Exposing stable, read-only directories to be bind-mounted into pods
Evicting unused snapshots when space is needed

Snapshots — Simple but Powerful

A snapshot in ZCDP is simply all the objects under a given prefix in your object store, for example: s3://my-bucket/imagenet/v3/

You do not need a special manifest or format. ZCDP treats the entire prefix as a single immutable snapshot. To create a new snapshot, simply upload a new set of files under a new prefix (e.g., v4/). ZCDP has built in CAS (content-addressable storage) capabilities to deduplicate shared files between snapshots, minimizing egress costs and local storage usage.

Content-Addressable Storage (CAS)

The node agent stores dataset files in a content-addressable cache keyed by checksums. This allows ZCDP to deduplicate shared assets across snapshots, resume interrupted syncs without re-downloading, and verify integrity before promoting a snapshot. The result is faster warm-up for common model weights, lower egress costs, and fewer storage writes on local NVMe.

Zero Copy Explained

"Zero copy" here means:

No copying data into containers on startup
No streaming from network-attached filesystems
No FUSE or user-space filesystems
No custom dataset loader inside your code

The agent writes data once to node-local NVMe, and ZCDP mounts that directory directly into pods using standard kernel bind mounts. Your application just sees a normal directory tree.

Get Started!

Like what you see? Get started with ZCDP today!

Check out the Quick Start Guide to deploy ZCDP in your Kubernetes cluster and run your first workload.