Documentation

Configuration reference for platform engineers. Use this page as the source of truth for controller, node agent, and backend-specific parameters, cache behavior, and CAS guarantees.

Installation

Installation scripts and all CRDs are available in the deploy directory of the https://github.com/dataplatformsolutions/zero-copy-data-plane repository.

Either clone the repository or download one of the release bundles at https://github.com/dataplatformsolutions/zero-copy-data-plane/releases

Zero Copy Data Plane ships an installer script that applies the namespace, CRDs, controller, webhook, in the right order and waits for readiness (including cert-manager).

./deploy/install-all.sh

As an alternative, you can deploy with Helm from deploy/helm/zcdp. This chart assumes cert-manager is already installed in the cluster and keeps configuration intentionally minimal.

helm upgrade --install zcdp ./deploy/helm/zcdp \
  --set imageVersion=0.1.0 \
  --set managedNamespaces={team-a,team-b}

Webhook scope: The webhook configuration (deploy/manifests/webhook.yaml) uses a namespaceSelector that limits mutation to the test-workload namespace so data-heavy workloads can be rolled out safely. Cluster admins can change this at any time by editing the selector before applying it or by running kubectl edit mutatingwebhookconfiguration zcdp-webhook after installation.

How to make a node ZCDP enabled

  • Give a node the zcdp.io/enable-agent=true label so the DaemonSet in deploy/manifests/agent.yaml is allowed to run there: kubectl label node <name> zcdp.io/enable-agent=true. Repeat per node or use selectors.
  • Where to set labels: Run normal kubectl label commands against your cluster. Label a single node (kubectl label node NODE zcdp.io/enable-agent=true), label a node pool (kubectl label nodes -l nodepool=worker zcdp.io/enable-agent=true), or script against cloud provider tags. Verify with kubectl get nodes --show-labels.
  • GPU vs. CPU tagging: Add zcdp.io/node-type=gpu on GPU-capable nodes; anything else is treated as CPU by default. Example:
    kubectl label node gpu-node-1 zcdp.io/enable-agent=true zcdp.io/node-type=gpu
    kubectl label node cpu-node-2 zcdp.io/enable-agent=true zcdp.io/node-type=cpu
  • Developer Edition limits: Only 5 nodes may carry zcdp.io/enable-agent=true under the Developer Edition (GPU nodes count toward the same limit). Remove the label on extras to stay under the cap (kubectl label node NODE zcdp.io/enable-agent-).

How Pods are Scheduled

  • The DaemonSet in deploy/manifests/agent.yaml includes a nodeSelector of zcdp.io/enable-agent=true, so only labeled nodes are eligible.
  • If you want to restrict the agent further (for example, only to GPU nodes), combine zcdp.io/enable-agent=true with zcdp.io/node-type=gpu on the nodes you care about; unlabeled nodes will not schedule the agent.
  • Replica counts come from the DaemonSet itself: one pod per matching node. Keep the labels limited to the nodes you intend to license and operate.

Verify the pods are healthy with kubectl get pods -n zcdp-system. Tweak images, resource requests, and environment variables directly in the YAMLs to match your registry and cluster defaults.

License Configuration

Licenses are required for production clusters. Visit the Pricing page to choose a tier, then apply one of the supported delivery methods below.

Add a downloaded license (ConfigMap)

  1. Download the issued license.json from the ZCDP portal.
  2. Create a ConfigMap named zcdp-license in the controller namespace (defaults to zcdp-system) so the controller can read it via the built-in loader:
    kubectl -n zcdp-system create configmap zcdp-license --from-file=license.json
    The controller automatically prefers the ConfigMap key license.json over the Secret of the same name when both are present; if neither exists, it falls back to the Developer Edition limits. The controller periodically reloads the ConfigMap/Secret on its license interval (default: 10m), so updates no longer require a restart.

Enable remote license refresh

To have the controller download and refresh licenses directly from ZCDP HQ, add the license flags to the controller deployment (env CONTROLLER_ARGS or explicit args in the manifest):

--license-remote \
--license-server-url=https://api.zcdp.io/licenses \
--license-customer-id=<your-customer-id> \
--license-order-id=<your-order-id>

The controller writes the downloaded payload to /var/run/zcdp/license.json (override with --license-file if needed) and continues to reconcile active licenses on the interval defined by --license-recalculate-interval (default: 10m).

License violation reporting

When observed node counts exceed the licensed limits, the controller logs the overage locally and reports the violation back to ZCDP HQ for auditability and support follow-up. Reduce labeled nodes or upgrade your plan to clear the alarm.

Storage Backend Reference

Define provider settings once in a StorageBackend CRD. Datasets refer to the backend by name via spec.source.storageBackendRef and provide a path that includes the bucket/container and dataset prefix, keeping credentials and endpoints centralized.

S3 / MinIO

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: s3-backend
spec:
  type: s3
  # For production, prefer auth.mode: iam with IRSA instead of static access keys.
  auth:
    mode: accessKey
    accessKeyIdSecretRef:
      name: s3-credentials
      key: access_key_id
    secretAccessKeySecretRef:
      name: s3-credentials
      key: secret_access_key
  s3:
    region: us-west-2
    endpoint: https://s3.amazonaws.com
    forcePathStyle: false
    skipTLSVerification: false

DigitalOcean Spaces

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: do-spaces-backend
spec:
  type: digitalOceanSpaces
  auth:
    mode: accessKey
    accessKeyIdSecretRef:
      name: do-spaces-credentials
      key: access_key_id
    secretAccessKeySecretRef:
      name: do-spaces-credentials
      key: secret_access_key
  digitalOceanSpaces:
    region: nyc3
    endpoint: https://nyc3.digitaloceanspaces.com

Google Cloud Storage

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: gcs-backend
spec:
  type: gcs
  # For production, prefer auth.mode: workloadIdentity to avoid static keys.
  auth:
    mode: serviceAccountKey
    serviceAccountKeySecretRef:
      name: gcs-service-account
      key: service-account.json
  gcs:
    projectId: my-gcp-project

Azure Blob Storage

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: azure-blob-backend
spec:
  type: azureBlob
  # For production, prefer auth.mode: managedIdentity on AKS instead of client secrets.
  auth:
    mode: servicePrincipal
    clientIdSecretRef:
      name: azure-sp
      key: client_id
    clientSecretSecretRef:
      name: azure-sp
      key: client_secret
    tenantIdSecretRef:
      name: azure-sp
      key: tenant_id
  azureBlob:
    accountName: myaccount
    containerName: datasets
    endpoint: https://myaccount.blob.core.windows.net

Cache Eviction Strategies

Content-Addressable Storage (CAS)

Observability: status page and Prometheus

The controller now exposes a small operational surface for dashboards and quick troubleshooting.

Controller endpoints

Configure the listener with ZCDP_CONTROLLER_HTTP_ADDR or --http-addr. The default is :8080.

# port-forward the controller for local inspection
kubectl -n zcdp port-forward deploy/zerocopy-controller 8080

# open the HTML page in a browser
open http://localhost:8080/status

# retrieve raw JSON or Prometheus text
curl -s http://localhost:8080/api/status | jq
curl -s http://localhost:8080/metrics | head

Agent metrics

Each node agent serves Prometheus metrics on /metrics. Configure the bind address with ZCDP_AGENT_HTTP_ADDR or the --listen-http flag (default :9090). Scrape them directly or through a DaemonSet ServiceMonitor.

# example ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: zcdp-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: zcdp-agent
  namespaceSelector:
    matchNames: ["zcdp"]
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

CRD Field Reference

Minimal field matrix for the core CRDs. All structs follow Kubernetes conventions: metadata drives identity/labels, and spec captures desired state.

StorageBackend

Field Type Description
spec.type enum Provider type: s3, gcs, azureBlob, or digitalOceanSpaces.
spec.auth.mode enum Credential strategy (e.g., iam, accessKey, workloadIdentity, serviceAccountKey, managedIdentity, servicePrincipal).
spec.auth.*SecretRef SecretKeyRef Secret + key used when the auth mode requires static credentials (access keys, service account JSON, or service principal).
spec.s3.region string Required for type: s3; standard AWS region for the bucket.
spec.s3.endpoint string Optional override for S3-compatible endpoints; defaults to the AWS S3 endpoint for the configured region.
spec.s3.forcePathStyle boolean Forces path-style requests when using custom S3-compatible endpoints.
spec.s3.skipTLSVerification boolean Disable TLS certificate verification for custom endpoints with self-signed certs.
spec.gcs.projectId string Optional project hint for type: gcs.
spec.azureBlob.accountName string Account name for type: azureBlob.
spec.azureBlob.containerName string Container name for type: azureBlob.
spec.azureBlob.endpoint string Optional override for the Azure Blob endpoint.
spec.digitalOceanSpaces.region string Required region for type: digitalOceanSpaces; sets the default Spaces endpoint.
spec.digitalOceanSpaces.endpoint string Optional override when using a custom CDN or Spaces domain.

StorageBackend example

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: s3-ml-artifacts
spec:
  type: s3
  auth:
    mode: iam
  s3:
    region: us-west-2
    endpoint: https://s3.us-west-2.amazonaws.com

Dataset

Field Type Description
spec.source.storageBackendRef string Name of the StorageBackend that supplies endpoints and credentials.
spec.source.path string Dataset path including bucket/container and optional prefix.
spec.cache DatasetCacheSpec Cache behavior for this dataset; see DatasetCacheSpec below for per-node limits.
spec.prefetchPolicy enum When to prefetch: OnCreate or OnFirstUse.
spec.evictionPriority enum Eviction priority: Low, Medium, High, or Pinned.

Dataset example

apiVersion: zcdp.io/v1alpha1
kind: Dataset
metadata:
  name: imagenet
spec:
  source:
    storageBackendRef: s3-ml-artifacts
    path: s3://ml-datasets/imagenet
  cache:
    maxSize: 2Ti
  prefetchPolicy: OnFirstUse
  evictionPriority: Medium

DatasetCacheSpec

Field Type Description
maxSize string Soft per-node cap in Kubernetes quantity format, e.g. 500Mi, 1Gi, 2Ti. Supported suffixes include binary (Ki, Mi, Gi, Ti) and decimal (K, M, G, T).

DatasetCacheSpec example

cache:
  maxSize: 500Gi

DatasetClaim

Field Type Description
spec.datasetRef string Name of the Dataset to mount.
spec.mountPath string Absolute path inside the container where the dataset should appear.
spec.podSelector LabelSelector Selector for pods that should get this dataset mounted.
spec.serviceAccountName string Optional service account name to scope the claim to pods running under that service account.

DatasetClaim example

apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
  name: imagenet-readers
  namespace: ml
spec:
  datasetRef: imagenet
  mountPath: /datasets/imagenet
  podSelector:
    matchLabels:
      app: imagenet-consumer

Service account scoping

ZCDP controls dataset access using Kubernetes ServiceAccounts, not individual users. Users authenticate to Kubernetes using SSO, cloud identity, or certificates, and Kubernetes maps them to groups and permissions via RBAC. Workloads themselves do not run as users — they run as ServiceAccounts, which are the stable, enforced identity for pods.

A DatasetClaim can optionally restrict access to a specific ServiceAccount. When serviceAccountName is set, only pods running as that ServiceAccount can access the dataset. When it is omitted, the dataset is available to all workloads in the namespace, which keeps the default experience simple and unchanged. An empty pod service account is treated as default.

This approach allows ZCDP to support least-privilege access when needed, without adding complexity for basic use cases. Platform teams control which ServiceAccounts users may deploy workloads as, while ZCDP grants datasets to those ServiceAccounts. This provides clear separation between user authentication, workload identity, and dataset access.

apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
  name: imagenet-trainers
  namespace: ml
spec:
  datasetRef: imagenet
  mountPath: /datasets/imagenet
  podSelector:
    matchLabels:
      app: resnet-trainer
  serviceAccountName: trainer

NodeDataset

Field Type Description
spec.datasetRef string Dataset being materialized on this node.
spec.nodeName string Target node; node agent populates after caching or eviction.
spec.pathKey string Identifies the dataset source used to populate the cache entry.
spec.desiredSnapshot string Snapshot ID desired on this node.
spec.activeConsumers int32 Number of pods currently using this dataset on the node.
status.phase enum Lifecycle: PendingSyncingReadyEvictingErrorInsufficientCapacity.
status.message string Human-readable status details.
status.localPath string Local path where the snapshot is cached (cleared on eviction).
status.sizeOnDisk int64 Bytes consumed on the node for this dataset; useful for debugging eviction.
status.lastAccessed timestamp Time of last access on the node.
status.activeConsumers int32 Number of pods currently using this dataset on the node.

NodeDataset example

apiVersion: zcdp.io/v1alpha1
kind: NodeDataset
metadata:
  name: imagenet-node-a
spec:
  datasetRef: imagenet
  nodeName: ip-10-0-12-34

RefreshDataset

Trigger a best-effort resync of a Dataset across every node currently caching it. The controller fans out refresh requests to node agents, which reconcile local manifests against the backend and remove stale files, redownload missing objects, and update any content that has changed.

Field Type Description
spec.datasetRef string Name of the Dataset to refresh on all nodes with cached copies.
status.phase enum One of Refreshing, Completed, Partial, or Error summarizing the rollout.
status.message string Human-readable status details.
status.refreshedNodes int32 How many nodes successfully processed the refresh.
status.errors int32 How many nodes failed to refresh (the controller will requeue to retry).
status.lastRefreshTime timestamp Time of the last reconciliation, useful for auditing when data last synced.
status.observedGeneration int64 Tracks which RefreshDataset generation has been processed to avoid repeated refreshes unless you change the object again.

Refreshes are opt-in: the controller only triggers resyncs when a RefreshDataset resource is created or its spec is updated to bump the generation. Routine dataset changes do not automatically resync existing caches.

RefreshDataset example

apiVersion: zcdp.io/v1alpha1
kind: RefreshDataset
metadata:
  name: refresh-sample-model
spec:
  datasetRef: sample-model

Apply the manifest to trigger a refresh:

kubectl apply -f refresh-dataset.yaml

# watch progress
kubectl get refreshdataset refresh-sample-model
kubectl describe refreshdataset refresh-sample-model

Lightweight Dataset Annotations

Use annotations to steer small-footprint datasets without touching shared deployment defaults. These are read by the controller and node agent to adjust behavior per object.

Annotation Key Applies To Effect
zcdp.io/prefetch Dataset Boolean; when true the node agent pre-downloads chunks before claims attach.
zcdp.io/hot-tier Dataset Enum memory|nvme|disk; pins the dataset to a specific cache class.
zcdp.io/ttl Dataset Duration (e.g., 6h); evicts after inactivity to keep the footprint small.
zcdp.io/claim-scope DatasetClaim namespace or cluster; limits which namespaces can bind the dataset.
zcdp.io/read-only DatasetClaim Boolean; hardens mounts by forcing read-only even if the Dataset allows writes.

Example: Minimal Prefetch Dataset

apiVersion: zcdp.io/v1alpha1
kind: Dataset
metadata:
  name: tiny-model
  annotations:
    zcdp.io/prefetch: "true"
    zcdp.io/ttl: "4h"
spec:
  snapshot: v1
  source:
    uri: s3://ml-artifacts/tiny-model/
---
apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
  name: tiny-model-claim
  annotations:
    zcdp.io/claim-scope: namespace
spec:
  datasetRef:
    name: tiny-model
  targetPath: /datasets/tiny-model