Skip to content

ian/adding_k8_backend#36

Draft
ianhodge wants to merge 3 commits intomainfrom
03-19-ian_adding_k8_backend
Draft

ian/adding_k8_backend#36
ianhodge wants to merge 3 commits intomainfrom
03-19-ian_adding_k8_backend

Conversation

@ianhodge
Copy link
Member

@ianhodge ianhodge commented Mar 19, 2026

Summary

This PR adds a Kubernetes execution backend to oz-agent-worker and includes the deployment and hardening work needed to make that backend practical to run in a customer Kubernetes environment.

At a high level, the worker can now execute tasks by creating Kubernetes Jobs instead of running them via Docker or the direct backend. The PR also adds a namespace-scoped Helm chart, updates the docs for customer deployment, and tightens the production path with CI coverage, safer chart defaults, and runtime/container hardening.

What changed

Kubernetes backend

  • added a new Kubernetes backend implementation in internal/worker/kubernetes.go
  • added config parsing / merge support for backend.kubernetes.* in internal/config/config.go and main.go
  • execute each task as a Kubernetes Job / Pod in a target namespace
  • support setup and teardown hooks for task execution
  • propagate configured environment variables into task jobs
  • support Kubernetes-specific execution settings including:
    • namespace / kubeconfig selection
    • image pull secret / image pull policy
    • task job service account
    • node selectors / tolerations / resource requests and limits
    • extra labels / annotations
    • active deadline / termination grace period / workspace size limit
    • configurable unschedulable timeout
    • configurable startup preflight_image
  • added a startup dry-run Job preflight so policy / RBAC / admission issues surface before task execution begins
  • removed the need for a cluster-scoped namespace read at startup; validation stays namespaced
  • added stable hash-based labels and job naming to avoid selector collisions after sanitization
  • updated worker shutdown cleanup to use a fresh context for backend cleanup

Helm chart

  • added a namespace-scoped chart at charts/oz-agent-worker
  • chart deploys:
    • long-lived worker Deployment
    • ServiceAccount
    • namespaced Role / RoleBinding
    • worker config ConfigMap
    • optional API key Secret
  • chart is designed for in-cluster auth by default
  • chart distinguishes between:
    • the worker Deployment service account
    • the optional task Job service account configured via backend.kubernetes.service_account
  • chart now requires an explicit image.tag so installs pin a worker image rather than defaulting to latest
  • chart defaults the long-lived worker Deployment to a non-root security context with conservative resource requests
  • added kubernetesBackend.preflightImage so restricted clusters can override the startup preflight image

CI / packaging / docs

  • updated CI to:
    • use the Go version from go.mod
    • run go test ./...
    • lint and render the Helm chart in CI
  • fixed .gitignore so the top-level binary is ignored without accidentally ignoring charts/oz-agent-worker/**
  • hardened the runtime Dockerfile to run the worker as a non-root user on a pinned Alpine base image
  • expanded README.md with:
    • Kubernetes backend configuration and caveats
    • Helm installation flow
    • production notes for explicit image pinning
    • non-root worker defaults
    • preflight image override guidance
    • the distinction between worker and task-job service accounts

Operational notes

  • this backend does not require CRDs
  • this backend does not create cluster-scoped RBAC resources
  • the worker Deployment is intended to run long-term in-cluster
  • each task is executed as a Kubernetes Job
  • the worker Deployment defaults to non-root, but the task namespace must still allow creating Jobs with a root init container because sidecar materialization currently depends on that pattern
  • keep replicaCount=1 for a given worker.workerId; scale by creating multiple releases with distinct worker IDs instead of scaling a single release horizontally
  • if cluster policy restricts allowed registries/images, set preflight_image / kubernetesBackend.preflightImage to an allowlisted image

Validation

  • gofmt -w on modified Go files
  • go test ./...
  • go build ./...
  • helm lint charts/oz-agent-worker --set worker.workerId=my-worker --set image.tag=v1.2.3
  • helm template oz-agent-worker charts/oz-agent-worker --namespace agents --set worker.workerId=my-worker --set image.tag=v1.2.3
  • helm lint + helm template again with richer override values to exercise optional chart branches including secret creation, annotations, node selectors, tolerations, resources, setup/teardown hooks, environment entries, and kubernetesBackend.preflightImage
  • docker build to verify the hardened runtime image still builds successfully

Reviewer notes

The highest-risk / highest-value areas to review are:

  • internal/worker/kubernetes.go for job lifecycle, startup preflight behavior, and failure detection
  • main.go + internal/config/config.go for config merge / validation behavior
  • charts/oz-agent-worker/* for install ergonomics and namespaced deployment assumptions
  • README.md for customer-facing deployment guidance and caveats

Artifacts

Co-Authored-By: Oz oz-agent@warp.dev

@ianhodge ianhodge force-pushed the 03-18-ian_adding_customizable_idle_on_complete branch from 43b8290 to 8b0459f Compare March 19, 2026 20:28
@ianhodge ianhodge force-pushed the 03-19-ian_adding_k8_backend branch 2 times, most recently from 5a67f57 to 1c8df93 Compare March 19, 2026 20:37
@ianhodge ianhodge force-pushed the 03-18-ian_adding_customizable_idle_on_complete branch 2 times, most recently from 7cfae36 to 5b80342 Compare March 19, 2026 20:41
@ianhodge ianhodge force-pushed the 03-19-ian_adding_k8_backend branch 2 times, most recently from 4a8394b to 1964392 Compare March 19, 2026 20:45
Base automatically changed from 03-18-ian_adding_customizable_idle_on_complete to main March 19, 2026 20:51
@ianhodge ianhodge force-pushed the 03-19-ian_adding_k8_backend branch from 1964392 to cbff910 Compare March 19, 2026 21:28
…robe, Helm fixes

- Replace 2s poll loop with Kubernetes Watch for Job and Pod status,
  with 30s safety-net fallback poll for watch disconnects
- Bound container log reads to 1 MiB (LimitBytes + io.LimitReader)
- Sort env vars for deterministic Pod specs
- Gate Events API calls behind pod failure signals (Pending/Failed only)
- Add exec liveness probe to Helm Deployment (kill -0 1)
- Fix ConfigMap and ServiceAccount template whitespace (use {{- trimming)
- Add watch verb to RBAC for jobs and pods
- Add tests for handleJobState, watch lifecycle, and pod watch events

Co-Authored-By: Oz <oz-agent@warp.dev>
# Install ca-certificates for HTTPS connections
RUN apk --no-cache add ca-certificates
# Install ca-certificates for HTTPS connections and create a non-root runtime user
RUN apk --no-cache add ca-certificates \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are needed for the helm chart

@@ -0,0 +1,90 @@
# Keep this at 1 for a given worker.workerId. To run multiple workers, use distinct releases
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can't really be changed, should we remove it as an option?

type KubernetesConfig struct {
Namespace string `yaml:"namespace"`
Kubeconfig string `yaml:"kubeconfig"`
ImagePullSecret string `yaml:"image_pull_secret" validate:"omitempty,no_whitespace"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes has a PodTemplate construct that several of their built-in resource types use: https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates

We should try to be generally compatible with that - might even be possible to reuse the type definition here: https://pkg.go.dev/k8s.io/api/core/v1#PodTemplate

log.Debugf(ctx, "Using Kubernetes task image: %s", params.DockerImage)

baseEnv := envSliceFromMap(b.config.Env)
mainEnv := mergeEnvVars(params.EnvVars, append(baseEnv,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider writing all the environment variables as key-value pairs in a secret, and having the pod reference those.

I don't think Kubernetes treats environment variables specified in the pod definition as sensitive, but we're including API keys and GitHub tokens in these.


var initContainers []corev1.Container

for i, sidecar := range params.Sidecars {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the latest Kubernetes release has beta support for images as volumes: https://kubernetes.io/docs/tasks/configure-pod-container/image-volumes/

We probably can't use that just yet, but would be good to track - I imagine it's faster than us manually materializing each sidecar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — tracking this. K8s image volumes (beta in 1.33+, KEP-4639) would let us mount sidecar images directly as read-only volumes without the tar+emptyDir dance. That'd be faster and eliminate the root init container requirement. Adding a TODO comment in the sidecar materialization code. We can switch to it once the feature goes GA and our minimum supported K8s version includes it.

if _, err := b.clientset.BatchV1().Jobs(b.config.Namespace).Create(ctx, job, metav1.CreateOptions{
DryRun: []string{metav1.DryRunAll},
}); err != nil {
return fmt.Errorf("kubernetes startup preflight failed: the kubernetes backend requires creating task Jobs with a root init container for sidecar materialization; verify service account/RBAC and Pod Security or admission policy for namespace %q: %w", b.config.Namespace, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC why does the sidecar init container need to run as root? So that it can copy files into the root directory of the container?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — the sidecar init container runs as root because it tars the entire sidecar image filesystem from / into an emptyDir volume. Files inside the sidecar image may be owned by root (or have restrictive permissions), so a non-root tar would fail with permission errors on those files. The --no-same-owner --no-same-permissions flags on the extract side ensure the materialized files in the emptyDir are accessible to the non-root task container.

"fi",
"/bin/sh /agent/entrypoint.sh \"$@\"",
"status=$?",
"if [ -n \"$OZ_TEARDOWN_COMMAND\" ]; then",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if it's also running as a lifecycle hook?

- Add pod_template field accepting raw corev1.PodSpec YAML in config,
  enabling standard K8s syntax for scheduling, resources, and env vars
  with valueFrom.secretKeyRef support for Kubernetes Secrets
- Worker merges its required containers/volumes/env into user-provided
  PodSpec template; finds or creates 'task' container
- Validate that pod_template and legacy fields are mutually exclusive
- Remove redundant teardown from wrapper script (keep only preStop hook)
- Hardcode replicas: 1 in Helm Deployment (cannot safely be >1)
- Add TODO tracking K8s image volumes (KEP-4639) as future replacement
  for tar-based sidecar materialization
- Replied to Bnavetta's comments on root init container and image volumes
- Update README with pod_template examples and Secret references docs

Co-Authored-By: Oz <oz-agent@warp.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants