ZCDP Reference

Installation

Installation scripts and all CRDs are available in the deploy directory of the https://github.com/dataplatformsolutions/zero-copy-data-plane repository.

Either clone the repository or download one of the release bundles at https://github.com/dataplatformsolutions/zero-copy-data-plane/releases

Zero Copy Data Plane ships an installer script that applies the namespace, CRDs, controller, webhook, in the right order and waits for readiness (including cert-manager).

./deploy/install-all.sh

As an alternative, you can deploy with Helm from deploy/helm/zcdp. This chart assumes cert-manager is already installed in the cluster and keeps configuration intentionally minimal.

helm upgrade --install zcdp ./deploy/helm/zcdp \
  --set imageVersion=0.1.0 \
  --set managedNamespaces={team-a,team-b}

imageVersion: Docker image tag used for both the controller and agent.
managedNamespaces: list of namespaces where the mutating webhook should operate.

Webhook scope: The webhook configuration (deploy/manifests/webhook.yaml) uses a namespaceSelector that limits mutation to the test-workload namespace so data-heavy workloads can be rolled out safely. Cluster admins can change this at any time by editing the selector before applying it or by running kubectl edit mutatingwebhookconfiguration zcdp-webhook after installation.

How to make a node ZCDP enabled

Give a node the zcdp.io/enable-agent=true label so the DaemonSet in deploy/manifests/agent.yaml is allowed to run there: kubectl label node <name> zcdp.io/enable-agent=true. Repeat per node or use selectors.
Where to set labels: Run normal kubectl label commands against your cluster. Label a single node (kubectl label node NODE zcdp.io/enable-agent=true), label a node pool (kubectl label nodes -l nodepool=worker zcdp.io/enable-agent=true), or script against cloud provider tags. Verify with kubectl get nodes --show-labels.

GPU vs. CPU tagging: Add zcdp.io/node-type=gpu on GPU-capable nodes; anything else is treated as CPU by default. Example:

kubectl label node gpu-node-1 zcdp.io/enable-agent=true zcdp.io/node-type=gpu
kubectl label node cpu-node-2 zcdp.io/enable-agent=true zcdp.io/node-type=cpu

Developer Edition limits: Only 5 nodes may carry zcdp.io/enable-agent=true under the Developer Edition (GPU nodes count toward the same limit). Remove the label on extras to stay under the cap (kubectl label node NODE zcdp.io/enable-agent-).

How Pods are Scheduled

The DaemonSet in deploy/manifests/agent.yaml includes a nodeSelector of zcdp.io/enable-agent=true, so only labeled nodes are eligible.
If you want to restrict the agent further (for example, only to GPU nodes), combine zcdp.io/enable-agent=true with zcdp.io/node-type=gpu on the nodes you care about; unlabeled nodes will not schedule the agent.
Replica counts come from the DaemonSet itself: one pod per matching node. Keep the labels limited to the nodes you intend to license and operate.

Verify the pods are healthy with kubectl get pods -n zcdp-system. Tweak images, resource requests, and environment variables directly in the YAMLs to match your registry and cluster defaults.

License Configuration

Licenses are required for production clusters. Visit the Pricing page to choose a tier, then apply one of the supported delivery methods below.

Add a downloaded license (ConfigMap)

Download the issued license.json from the ZCDP portal.
Create a ConfigMap named zcdp-license in the controller namespace (defaults to zcdp-system) so the controller can read it via the built-in loader:
```
kubectl -n zcdp-system create configmap zcdp-license --from-file=license.json
```
The controller automatically prefers the ConfigMap key license.json over the Secret of the same name when both are present; if neither exists, it falls back to the Developer Edition limits. The controller periodically reloads the ConfigMap/Secret on its license interval (default: 10m), so updates no longer require a restart.

Enable remote license refresh

To have the controller download and refresh licenses directly from ZCDP HQ, add the license flags to the controller deployment (env CONTROLLER_ARGS or explicit args in the manifest):

--license-remote \
--license-server-url=https://api.zcdp.io/licenses \
--license-customer-id=<your-customer-id> \
--license-order-id=<your-order-id>

The controller writes the downloaded payload to /var/run/zcdp/license.json (override with --license-file if needed) and continues to reconcile active licenses on the interval defined by --license-recalculate-interval (default: 10m).

License violation reporting

When observed node counts exceed the licensed limits, the controller logs the overage locally and reports the violation back to ZCDP HQ for auditability and support follow-up. Reduce labeled nodes or upgrade your plan to clear the alarm.

Storage Backend Reference

Define provider settings once in a StorageBackend CRD. Datasets refer to the backend by name via spec.source.storageBackendRef and provide a path that includes the bucket/container and dataset prefix, keeping credentials and endpoints centralized.

S3 / MinIO

Endpoint: https://s3.amazonaws.com or custom MinIO URL; support for dual-stack and VPC endpoints.
Dataset Path: Store the bucket and prefix in each Dataset path; prefer immutable prefixes per snapshot.
Credentials: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, optional session token; IAM role via IRSA is recommended.
Region & Path Style: Set region; toggle forcePathStyle for MinIO/ceph.
Security: sse for SSE-S3/KMS, insecureSkipTLSVerify for lab clusters, caBundle for private CAs.
Cloud Security: Prefer IAM roles with IRSA for production; use scoped access keys only for development or shared services.

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: s3-backend
spec:
  type: s3
  # For production, prefer auth.mode: iam with IRSA instead of static access keys.
  auth:
    mode: accessKey
    accessKeyIdSecretRef:
      name: s3-credentials
      key: access_key_id
    secretAccessKeySecretRef:
      name: s3-credentials
      key: secret_access_key
  s3:
    region: us-west-2
    endpoint: https://s3.amazonaws.com
    forcePathStyle: false
    skipTLSVerification: false

DigitalOcean Spaces

Endpoint: Defaults to <region>.digitaloceanspaces.com; override with spec.digitalOceanSpaces.endpoint for custom domains.
Credentials: Access keys from digitalocean_spaces_access_key resources; only auth.mode: accessKey is supported.
Dataset Path: Include the Spaces bucket name directly in the Dataset path.
Compatibility: Uses the S3-compatible pathway, so dataset URIs are still s3://bucket/path.
Cloud Security: Rotate Spaces keys regularly and scope them to read-only buckets when possible.

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: do-spaces-backend
spec:
  type: digitalOceanSpaces
  auth:
    mode: accessKey
    accessKeyIdSecretRef:
      name: do-spaces-credentials
      key: access_key_id
    secretAccessKeySecretRef:
      name: do-spaces-credentials
      key: secret_access_key
  digitalOceanSpaces:
    region: nyc3
    endpoint: https://nyc3.digitaloceanspaces.com

Google Cloud Storage

Dataset Path: Include the GCS bucket name directly in the Dataset path.
Auth: Service Account JSON (secret mount) or workload identity; supports impersonateServiceAccount.
Performance: Tune chunkSize and maxIdleConns when running inside GKE.
Encryption: Customer-managed keys via kmsKeyName.
Cloud Security: Use GKE Workload Identity for least privilege and avoid long-lived JSON keys.

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: gcs-backend
spec:
  type: gcs
  # For production, prefer auth.mode: workloadIdentity to avoid static keys.
  auth:
    mode: serviceAccountKey
    serviceAccountKeySecretRef:
      name: gcs-service-account
      key: service-account.json
  gcs:
    projectId: my-gcp-project

Azure Blob Storage

Container: Set accountName, containerName, and optional endpoint for sovereign clouds.
Auth: Service principal or managed identity (clientId for user-assigned).
Networking: Use custom endpoint for private clouds and per-tenant routing.
Cloud Security: Prefer managed identity for AKS; rotate service principals if used.

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: azure-blob-backend
spec:
  type: azureBlob
  # For production, prefer auth.mode: managedIdentity on AKS instead of client secrets.
  auth:
    mode: servicePrincipal
    clientIdSecretRef:
      name: azure-sp
      key: client_id
    clientSecretSecretRef:
      name: azure-sp
      key: client_secret
    tenantIdSecretRef:
      name: azure-sp
      key: tenant_id
  azureBlob:
    accountName: myaccount
    containerName: datasets
    endpoint: https://myaccount.blob.core.windows.net

Cache Eviction Strategies

Size-based LRU: Default strategy evicts least-recently-used datasets until the low watermark is reached.
Age-based TTL: Optional per-dataset timeToLive that expires cold snapshots after a duration.
Pinning: pin=true prevents eviction for golden datasets or base models.
Quota-aware: Per-namespace or per-tenant quotas can cap cache consumption before eviction kicks in.
Graceful drain: Nodes marked Unschedulable continue serving cached data but stop admitting new downloads.

Content-Addressable Storage (CAS)

Chunk Hashing: SHA-256 digests for every chunk; manifests record offsets and sizes.
Deduplication: Identical chunks across snapshots are stored once under their digest.
Integrity: Verification on download and on bind-mount to protect against silent corruption.
Partial Fetch: Supports ranged GETs for sparse downloads when only a subset of files is referenced.
Garbage Collection: Periodic sweep removes orphaned chunks once no manifests reference them.

Observability: status page and Prometheus

The controller now exposes a small operational surface for dashboards and quick troubleshooting.

Controller endpoints

/status — HTML status page with license usage, node counts, dataset cache reach, and an embedded metrics snapshot.
/api/status — JSON snapshot of the same data for lightweight UIs or scripts.
/metrics — Prometheus exposition of controller metrics (licensing, reconciler durations, dataset syncs/evictions).

Configure the listener with ZCDP_CONTROLLER_HTTP_ADDR or --http-addr. The default is :8080.

# port-forward the controller for local inspection
kubectl -n zcdp port-forward deploy/zerocopy-controller 8080

# open the HTML page in a browser
open http://localhost:8080/status

# retrieve raw JSON or Prometheus text
curl -s http://localhost:8080/api/status | jq
curl -s http://localhost:8080/metrics | head

Agent metrics

Each node agent serves Prometheus metrics on /metrics. Configure the bind address with ZCDP_AGENT_HTTP_ADDR or the --listen-http flag (default :9090). Scrape them directly or through a DaemonSet ServiceMonitor.

# example ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: zcdp-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: zcdp-agent
  namespaceSelector:
    matchNames: ["zcdp"]
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

CRD Field Reference

Minimal field matrix for the core CRDs. All structs follow Kubernetes conventions: metadata drives identity/labels, and spec captures desired state.

StorageBackend

Field	Type	Description
`spec.type`	enum	Provider type: `s3`, `gcs`, `azureBlob`, or `digitalOceanSpaces`.
`spec.auth.mode`	enum	Credential strategy (e.g., `iam`, `accessKey`, `workloadIdentity`, `serviceAccountKey`, `managedIdentity`, `servicePrincipal`).
`spec.auth.*SecretRef`	SecretKeyRef	Secret + key used when the auth mode requires static credentials (access keys, service account JSON, or service principal).
`spec.s3.region`	string	Required for `type: s3`; standard AWS region for the bucket.
`spec.s3.endpoint`	string	Optional override for S3-compatible endpoints; defaults to the AWS S3 endpoint for the configured region.
`spec.s3.forcePathStyle`	boolean	Forces path-style requests when using custom S3-compatible endpoints.
`spec.s3.skipTLSVerification`	boolean	Disable TLS certificate verification for custom endpoints with self-signed certs.
`spec.gcs.projectId`	string	Optional project hint for `type: gcs`.
`spec.azureBlob.accountName`	string	Account name for `type: azureBlob`.
`spec.azureBlob.containerName`	string	Container name for `type: azureBlob`.
`spec.azureBlob.endpoint`	string	Optional override for the Azure Blob endpoint.
`spec.digitalOceanSpaces.region`	string	Required region for `type: digitalOceanSpaces`; sets the default Spaces endpoint.
`spec.digitalOceanSpaces.endpoint`	string	Optional override when using a custom CDN or Spaces domain.

StorageBackend example

apiVersion: zcdp.io/v1alpha1
kind: StorageBackend
metadata:
  name: s3-ml-artifacts
spec:
  type: s3
  auth:
    mode: iam
  s3:
    region: us-west-2
    endpoint: https://s3.us-west-2.amazonaws.com

Dataset

Field	Type	Description
`spec.source.storageBackendRef`	string	Name of the StorageBackend that supplies endpoints and credentials.
`spec.source.path`	string	Dataset path including bucket/container and optional prefix.
`spec.cache`	DatasetCacheSpec	Cache behavior for this dataset; see DatasetCacheSpec below for per-node limits.
`spec.prefetchPolicy`	enum	When to prefetch: `OnCreate` or `OnFirstUse`.
`spec.evictionPriority`	enum	Eviction priority: `Low`, `Medium`, `High`, or `Pinned`.

Dataset example

apiVersion: zcdp.io/v1alpha1
kind: Dataset
metadata:
  name: imagenet
spec:
  source:
    storageBackendRef: s3-ml-artifacts
    path: s3://ml-datasets/imagenet
  cache:
    maxSize: 2Ti
  prefetchPolicy: OnFirstUse
  evictionPriority: Medium

DatasetCacheSpec

Field	Type	Description
`maxSize`	string	Soft per-node cap in Kubernetes quantity format, e.g. `500Mi`, `1Gi`, `2Ti`. Supported suffixes include binary (`Ki`, `Mi`, `Gi`, `Ti`) and decimal (`K`, `M`, `G`, `T`).

DatasetCacheSpec example

cache:
  maxSize: 500Gi

DatasetClaim

Field	Type	Description
`spec.datasetRef`	string	Name of the Dataset to mount.
`spec.mountPath`	string	Absolute path inside the container where the dataset should appear.
`spec.podSelector`	LabelSelector	Selector for pods that should get this dataset mounted.
`spec.serviceAccountName`	string	Optional service account name to scope the claim to pods running under that service account.

DatasetClaim example

apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
  name: imagenet-readers
  namespace: ml
spec:
  datasetRef: imagenet
  mountPath: /datasets/imagenet
  podSelector:
    matchLabels:
      app: imagenet-consumer

Service account scoping

ZCDP controls dataset access using Kubernetes ServiceAccounts, not individual users. Users authenticate to Kubernetes using SSO, cloud identity, or certificates, and Kubernetes maps them to groups and permissions via RBAC. Workloads themselves do not run as users — they run as ServiceAccounts, which are the stable, enforced identity for pods.

A DatasetClaim can optionally restrict access to a specific ServiceAccount. When serviceAccountName is set, only pods running as that ServiceAccount can access the dataset. When it is omitted, the dataset is available to all workloads in the namespace, which keeps the default experience simple and unchanged. An empty pod service account is treated as default.

This approach allows ZCDP to support least-privilege access when needed, without adding complexity for basic use cases. Platform teams control which ServiceAccounts users may deploy workloads as, while ZCDP grants datasets to those ServiceAccounts. This provides clear separation between user authentication, workload identity, and dataset access.

apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
  name: imagenet-trainers
  namespace: ml
spec:
  datasetRef: imagenet
  mountPath: /datasets/imagenet
  podSelector:
    matchLabels:
      app: resnet-trainer
  serviceAccountName: trainer

NodeDataset

Field	Type	Description
`spec.datasetRef`	string	Dataset being materialized on this node.
`spec.nodeName`	string	Target node; node agent populates after caching or eviction.
`spec.pathKey`	string	Identifies the dataset source used to populate the cache entry.
`spec.desiredSnapshot`	string	Snapshot ID desired on this node.
`spec.activeConsumers`	int32	Number of pods currently using this dataset on the node.
`status.phase`	enum	Lifecycle: Pending → Syncing → Ready → Evicting → Error → InsufficientCapacity.
`status.message`	string	Human-readable status details.
`status.localPath`	string	Local path where the snapshot is cached (cleared on eviction).
`status.sizeOnDisk`	int64	Bytes consumed on the node for this dataset; useful for debugging eviction.
`status.lastAccessed`	timestamp	Time of last access on the node.
`status.activeConsumers`	int32	Number of pods currently using this dataset on the node.

NodeDataset example

apiVersion: zcdp.io/v1alpha1
kind: NodeDataset
metadata:
  name: imagenet-node-a
spec:
  datasetRef: imagenet
  nodeName: ip-10-0-12-34

RefreshDataset

Trigger a best-effort resync of a Dataset across every node currently caching it. The controller fans out refresh requests to node agents, which reconcile local manifests against the backend and remove stale files, redownload missing objects, and update any content that has changed.

Field	Type	Description
`spec.datasetRef`	string	Name of the Dataset to refresh on all nodes with cached copies.
`status.phase`	enum	One of Refreshing, Completed, Partial, or Error summarizing the rollout.
`status.message`	string	Human-readable status details.
`status.refreshedNodes`	int32	How many nodes successfully processed the refresh.
`status.errors`	int32	How many nodes failed to refresh (the controller will requeue to retry).
`status.lastRefreshTime`	timestamp	Time of the last reconciliation, useful for auditing when data last synced.
`status.observedGeneration`	int64	Tracks which RefreshDataset generation has been processed to avoid repeated refreshes unless you change the object again.

Refreshes are opt-in: the controller only triggers resyncs when a RefreshDataset resource is created or its spec is updated to bump the generation. Routine dataset changes do not automatically resync existing caches.

RefreshDataset example

apiVersion: zcdp.io/v1alpha1
kind: RefreshDataset
metadata:
  name: refresh-sample-model
spec:
  datasetRef: sample-model

Apply the manifest to trigger a refresh:

kubectl apply -f refresh-dataset.yaml

# watch progress
kubectl get refreshdataset refresh-sample-model
kubectl describe refreshdataset refresh-sample-model

Lightweight Dataset Annotations

Use annotations to steer small-footprint datasets without touching shared deployment defaults. These are read by the controller and node agent to adjust behavior per object.

Annotation Key	Applies To	Effect
`zcdp.io/prefetch`	Dataset	Boolean; when `true` the node agent pre-downloads chunks before claims attach.
`zcdp.io/hot-tier`	Dataset	Enum `memory`\|`nvme`\|`disk`; pins the dataset to a specific cache class.
`zcdp.io/ttl`	Dataset	Duration (e.g., `6h`); evicts after inactivity to keep the footprint small.
`zcdp.io/claim-scope`	DatasetClaim	`namespace` or `cluster`; limits which namespaces can bind the dataset.
`zcdp.io/read-only`	DatasetClaim	Boolean; hardens mounts by forcing read-only even if the Dataset allows writes.

Example: Minimal Prefetch Dataset

apiVersion: zcdp.io/v1alpha1
kind: Dataset
metadata:
  name: tiny-model
  annotations:
    zcdp.io/prefetch: "true"
    zcdp.io/ttl: "4h"
spec:
  snapshot: v1
  source:
    uri: s3://ml-artifacts/tiny-model/
---
apiVersion: zcdp.io/v1alpha1
kind: DatasetClaim
metadata:
  name: tiny-model-claim
  annotations:
    zcdp.io/claim-scope: namespace
spec:
  datasetRef:
    name: tiny-model
  targetPath: /datasets/tiny-model