Configuration reference for platform engineers. Use this page as the source of truth for controller, node agent, and backend-specific parameters, cache behavior, and CAS guarantees.
Installation scripts and all CRDs are available in the deploy directory of the https://github.com/dataplatformsolutions/zero-copy-data-plane repository.
Either clone the repository or download one of the release bundles at https://github.com/dataplatformsolutions/zero-copy-data-plane/releases
Zero Copy Data Plane ships an installer script that applies the namespace, CRDs, controller, webhook, in the right order and waits for readiness (including cert-manager).
./deploy/install-all.sh
As an alternative, you can deploy with Helm from deploy/helm/zcdp. This chart assumes cert-manager is already installed in the cluster and keeps configuration intentionally minimal.
helm upgrade --install zcdp ./deploy/helm/zcdp \
--set imageVersion=0.1.0 \
--set managedNamespaces={team-a,team-b}
imageVersion: Docker image tag used for both the controller and agent.managedNamespaces: list of namespaces where the mutating webhook should operate.Webhook scope: The webhook configuration (deploy/manifests/webhook.yaml) uses a namespaceSelector that limits mutation to the test-workload namespace so data-heavy workloads can be rolled out safely. Cluster admins can change this at any time by editing the selector before applying it or by running kubectl edit mutatingwebhookconfiguration zcdp-webhook after installation.
zcdp.io/enable-agent=true label so the DaemonSet in deploy/manifests/agent.yaml is allowed to run there: kubectl label node <name> zcdp.io/enable-agent=true. Repeat per node or use selectors.kubectl label commands against your cluster. Label a single node (kubectl label node NODE zcdp.io/enable-agent=true), label a node pool (kubectl label nodes -l nodepool=worker zcdp.io/enable-agent=true), or script against cloud provider tags. Verify with kubectl get nodes --show-labels.zcdp.io/node-type=gpu on GPU-capable nodes; anything else is treated as CPU by default. Example:
kubectl label node gpu-node-1 zcdp.io/enable-agent=true zcdp.io/node-type=gpu
kubectl label node cpu-node-2 zcdp.io/enable-agent=true zcdp.io/node-type=cpu
zcdp.io/enable-agent=true under the Developer Edition (GPU nodes count toward the same limit). Remove the label on extras to stay under the cap (kubectl label node NODE zcdp.io/enable-agent-).deploy/manifests/agent.yaml includes a nodeSelector of zcdp.io/enable-agent=true, so only labeled nodes are eligible.zcdp.io/enable-agent=true with zcdp.io/node-type=gpu on the nodes you care about; unlabeled nodes will not schedule the agent.Verify the pods are healthy with kubectl get pods -n zcdp-system. Tweak images, resource requests, and environment variables directly in the YAMLs to match your registry and cluster defaults.
Licenses are required for production clusters. Visit the Pricing page to choose a tier, then apply one of the supported delivery methods below.
license.json from the ZCDP portal.zcdp-license in the controller namespace (defaults to
zcdp-system) so the controller can read it via the built-in loader:
kubectl -n zcdp-system create configmap zcdp-license --from-file=license.json
The controller automatically prefers the ConfigMap key license.json over the Secret of the same name when
both are present; if neither exists, it falls back to the Developer Edition limits. The controller periodically
reloads the ConfigMap/Secret on its license interval (default: 10m), so updates no longer require a restart.
To have the controller download and refresh licenses directly from ZCDP HQ, add the license flags to the controller
deployment (env CONTROLLER_ARGS or explicit args in the manifest):
--license-remote \
--license-server-url=https://api.zcdp.io/licenses \
--license-customer-id=<your-customer-id> \
--license-order-id=<your-order-id>
The controller writes the downloaded payload to /var/run/zcdp/license.json (override with
--license-file if needed) and continues to reconcile active licenses on the interval defined by
--license-recalculate-interval (default: 10m).
When observed node counts exceed the licensed limits, the controller logs the overage locally and reports the violation back to ZCDP HQ for auditability and support follow-up. Reduce labeled nodes or upgrade your plan to clear the alarm.
Define provider settings once in a StorageBackend CRD. Datasets refer to the backend by name
via spec.source.storageBackendRef and provide a path that includes the bucket/container
and dataset prefix, keeping credentials and endpoints centralized.
https://s3.amazonaws.com or custom MinIO URL; support for dual-stack and VPC endpoints.AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optional session token; IAM role via IRSA is recommended.region; toggle forcePathStyle for MinIO/ceph.sse for SSE-S3/KMS, insecureSkipTLSVerify for lab clusters, caBundle for private CAs.apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
name: s3-backend
spec:
type: s3
# For production, prefer auth.mode: iam with IRSA instead of static access keys.
auth:
mode: accessKey
accessKeyIdSecretRef:
name: s3-credentials
key: access_key_id
secretAccessKeySecretRef:
name: s3-credentials
key: secret_access_key
s3:
region: us-west-2
endpoint: https://s3.amazonaws.com
forcePathStyle: false
skipTLSVerification: false
<region>.digitaloceanspaces.com; override with spec.digitalOceanSpaces.endpoint for custom domains.digitalocean_spaces_access_key resources; only auth.mode: accessKey is supported.s3://bucket/path.apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
name: do-spaces-backend
spec:
type: digitalOceanSpaces
auth:
mode: accessKey
accessKeyIdSecretRef:
name: do-spaces-credentials
key: access_key_id
secretAccessKeySecretRef:
name: do-spaces-credentials
key: secret_access_key
digitalOceanSpaces:
region: nyc3
endpoint: https://nyc3.digitaloceanspaces.com
impersonateServiceAccount.chunkSize and maxIdleConns when running inside GKE.kmsKeyName.apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
name: gcs-backend
spec:
type: gcs
# For production, prefer auth.mode: workloadIdentity to avoid static keys.
auth:
mode: serviceAccountKey
serviceAccountKeySecretRef:
name: gcs-service-account
key: service-account.json
gcs:
projectId: my-gcp-project
accountName, containerName, and optional endpoint for sovereign clouds.clientId for user-assigned).endpoint for private clouds and per-tenant routing.apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
name: azure-blob-backend
spec:
type: azureBlob
# For production, prefer auth.mode: managedIdentity on AKS instead of client secrets.
auth:
mode: servicePrincipal
clientIdSecretRef:
name: azure-sp
key: client_id
clientSecretSecretRef:
name: azure-sp
key: client_secret
tenantIdSecretRef:
name: azure-sp
key: tenant_id
azureBlob:
accountName: myaccount
containerName: datasets
endpoint: https://myaccount.blob.core.windows.net
timeToLive that expires cold snapshots after a duration.pin=true prevents eviction for golden datasets or base models.Unschedulable continue serving cached data but stop admitting new downloads.The controller now exposes a small operational surface for dashboards and quick troubleshooting.
/status — HTML status page with license usage, node counts, dataset cache reach, and an embedded metrics snapshot./api/status — JSON snapshot of the same data for lightweight UIs or scripts./metrics — Prometheus exposition of controller metrics (licensing, reconciler durations, dataset syncs/evictions).Configure the listener with ZCDP_CONTROLLER_HTTP_ADDR or --http-addr. The default is :8080.
# port-forward the controller for local inspection
kubectl -n zcdp port-forward deploy/zerocopy-controller 8080
# open the HTML page in a browser
open http://localhost:8080/status
# retrieve raw JSON or Prometheus text
curl -s http://localhost:8080/api/status | jq
curl -s http://localhost:8080/metrics | head
Each node agent serves Prometheus metrics on /metrics. Configure the bind address with ZCDP_AGENT_HTTP_ADDR or the --listen-http flag (default :9090). Scrape them directly or through a DaemonSet ServiceMonitor.
# example ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: zcdp-agent
namespace: monitoring
spec:
selector:
matchLabels:
app: zcdp-agent
namespaceSelector:
matchNames: ["zcdp"]
endpoints:
- port: http
path: /metrics
interval: 30s
Minimal field matrix for the core CRDs. All structs follow Kubernetes conventions: metadata drives identity/labels, and spec captures desired state.
| Field | Type | Description |
|---|---|---|
spec.type |
enum | Provider type: s3, gcs, azureBlob, or digitalOceanSpaces. |
spec.auth.mode |
enum | Credential strategy (e.g., iam, accessKey, workloadIdentity, serviceAccountKey, managedIdentity, servicePrincipal). |
spec.auth.*SecretRef |
SecretKeyRef | Secret + key used when the auth mode requires static credentials (access keys, service account JSON, or service principal). |
spec.s3.region |
string | Required for type: s3; standard AWS region for the bucket. |
spec.s3.endpoint |
string | Optional override for S3-compatible endpoints; defaults to the AWS S3 endpoint for the configured region. |
spec.s3.forcePathStyle |
boolean | Forces path-style requests when using custom S3-compatible endpoints. |
spec.s3.skipTLSVerification |
boolean | Disable TLS certificate verification for custom endpoints with self-signed certs. |
spec.gcs.projectId |
string | Optional project hint for type: gcs. |
spec.azureBlob.accountName |
string | Account name for type: azureBlob. |
spec.azureBlob.containerName |
string | Container name for type: azureBlob. |
spec.azureBlob.endpoint |
string | Optional override for the Azure Blob endpoint. |
spec.digitalOceanSpaces.region |
string | Required region for type: digitalOceanSpaces; sets the default Spaces endpoint. |
spec.digitalOceanSpaces.endpoint |
string | Optional override when using a custom CDN or Spaces domain. |
apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
name: s3-ml-artifacts
spec:
type: s3
auth:
mode: iam
s3:
region: us-west-2
endpoint: https://s3.us-west-2.amazonaws.com
| Field | Type | Description |
|---|---|---|
spec.source.storageBackendRef |
string | Name of the StorageBackend that supplies endpoints and credentials. |
spec.source.path |
string | Dataset path including bucket/container and optional prefix. |
spec.cache |
DatasetCacheSpec | Cache behavior for this dataset; see DatasetCacheSpec below for per-node limits. |
spec.prefetchPolicy |
enum | When to prefetch: OnCreate or OnFirstUse. |
spec.evictionPriority |
enum | Eviction priority: Low, Medium, High, or Pinned. |
apiVersion: zcdp.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
spec:
source:
storageBackendRef: s3-ml-artifacts
path: s3://ml-datasets/imagenet
cache:
maxSize: 2Ti
prefetchPolicy: OnFirstUse
evictionPriority: Medium
| Field | Type | Description |
|---|---|---|
maxSize |
string | Soft per-node cap in Kubernetes quantity format, e.g. 500Mi, 1Gi, 2Ti. Supported suffixes include binary (Ki, Mi, Gi, Ti) and decimal (K, M, G, T). |
cache:
maxSize: 500Gi
| Field | Type | Description |
|---|---|---|
spec.datasetRef |
string | Name of the Dataset to mount. |
spec.mountPath |
string | Absolute path inside the container where the dataset should appear. |
spec.podSelector |
LabelSelector | Selector for pods that should get this dataset mounted. |
spec.serviceAccountName |
string | Optional service account name to scope the claim to pods running under that service account. |
apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
name: imagenet-readers
namespace: ml
spec:
datasetRef: imagenet
mountPath: /datasets/imagenet
podSelector:
matchLabels:
app: imagenet-consumer
ZCDP controls dataset access using Kubernetes ServiceAccounts, not individual users. Users authenticate to Kubernetes using SSO, cloud identity, or certificates, and Kubernetes maps them to groups and permissions via RBAC. Workloads themselves do not run as users — they run as ServiceAccounts, which are the stable, enforced identity for pods.
A DatasetClaim can optionally restrict access to a specific ServiceAccount. When serviceAccountName is set, only pods running as that ServiceAccount can access the dataset. When it is omitted, the dataset is available to all workloads in the namespace, which keeps the default experience simple and unchanged. An empty pod service account is treated as default.
This approach allows ZCDP to support least-privilege access when needed, without adding complexity for basic use cases. Platform teams control which ServiceAccounts users may deploy workloads as, while ZCDP grants datasets to those ServiceAccounts. This provides clear separation between user authentication, workload identity, and dataset access.
apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
name: imagenet-trainers
namespace: ml
spec:
datasetRef: imagenet
mountPath: /datasets/imagenet
podSelector:
matchLabels:
app: resnet-trainer
serviceAccountName: trainer
| Field | Type | Description |
|---|---|---|
spec.datasetRef |
string | Dataset being materialized on this node. |
spec.nodeName |
string | Target node; node agent populates after caching or eviction. |
spec.pathKey |
string | Identifies the dataset source used to populate the cache entry. |
spec.desiredSnapshot |
string | Snapshot ID desired on this node. |
spec.activeConsumers |
int32 | Number of pods currently using this dataset on the node. |
status.phase |
enum | Lifecycle: Pending → Syncing → Ready → Evicting → Error → InsufficientCapacity. |
status.message |
string | Human-readable status details. |
status.localPath |
string | Local path where the snapshot is cached (cleared on eviction). |
status.sizeOnDisk |
int64 | Bytes consumed on the node for this dataset; useful for debugging eviction. |
status.lastAccessed |
timestamp | Time of last access on the node. |
status.activeConsumers |
int32 | Number of pods currently using this dataset on the node. |
apiVersion: zcdp.io/v1alpha1
kind: NodeDataset
metadata:
name: imagenet-node-a
spec:
datasetRef: imagenet
nodeName: ip-10-0-12-34
Trigger a best-effort resync of a Dataset across every node currently caching it. The controller fans out refresh requests to node agents, which reconcile local manifests against the backend and remove stale files, redownload missing objects, and update any content that has changed.
| Field | Type | Description |
|---|---|---|
spec.datasetRef |
string | Name of the Dataset to refresh on all nodes with cached copies. |
status.phase |
enum | One of Refreshing, Completed, Partial, or Error summarizing the rollout. |
status.message |
string | Human-readable status details. |
status.refreshedNodes |
int32 | How many nodes successfully processed the refresh. |
status.errors |
int32 | How many nodes failed to refresh (the controller will requeue to retry). |
status.lastRefreshTime |
timestamp | Time of the last reconciliation, useful for auditing when data last synced. |
status.observedGeneration |
int64 | Tracks which RefreshDataset generation has been processed to avoid repeated refreshes unless you change the object again. |
Refreshes are opt-in: the controller only triggers resyncs when a RefreshDataset resource is created or its
spec is updated to bump the generation. Routine dataset changes do not automatically resync existing caches.
apiVersion: zcdp.io/v1alpha1
kind: RefreshDataset
metadata:
name: refresh-sample-model
spec:
datasetRef: sample-model
Apply the manifest to trigger a refresh:
kubectl apply -f refresh-dataset.yaml
# watch progress
kubectl get refreshdataset refresh-sample-model
kubectl describe refreshdataset refresh-sample-model
Use annotations to steer small-footprint datasets without touching shared deployment defaults. These are read by the controller and node agent to adjust behavior per object.
| Annotation Key | Applies To | Effect |
|---|---|---|
zcdp.io/prefetch |
Dataset | Boolean; when true the node agent pre-downloads chunks before claims attach. |
zcdp.io/hot-tier |
Dataset | Enum memory|nvme|disk; pins the dataset to a specific cache class. |
zcdp.io/ttl |
Dataset | Duration (e.g., 6h); evicts after inactivity to keep the footprint small. |
zcdp.io/claim-scope |
DatasetClaim | namespace or cluster; limits which namespaces can bind the dataset. |
zcdp.io/read-only |
DatasetClaim | Boolean; hardens mounts by forcing read-only even if the Dataset allows writes. |
apiVersion: zcdp.io/v1alpha1
kind: Dataset
metadata:
name: tiny-model
annotations:
zcdp.io/prefetch: "true"
zcdp.io/ttl: "4h"
spec:
snapshot: v1
source:
uri: s3://ml-artifacts/tiny-model/
---
apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
name: tiny-model-claim
annotations:
zcdp.io/claim-scope: namespace
spec:
datasetRef:
name: tiny-model
targetPath: /datasets/tiny-model