Conversation
43b8290 to
8b0459f
Compare
5a67f57 to
1c8df93
Compare
7cfae36 to
5b80342
Compare
4a8394b to
1964392
Compare
1964392 to
cbff910
Compare
…robe, Helm fixes
- Replace 2s poll loop with Kubernetes Watch for Job and Pod status,
with 30s safety-net fallback poll for watch disconnects
- Bound container log reads to 1 MiB (LimitBytes + io.LimitReader)
- Sort env vars for deterministic Pod specs
- Gate Events API calls behind pod failure signals (Pending/Failed only)
- Add exec liveness probe to Helm Deployment (kill -0 1)
- Fix ConfigMap and ServiceAccount template whitespace (use {{- trimming)
- Add watch verb to RBAC for jobs and pods
- Add tests for handleJobState, watch lifecycle, and pod watch events
Co-Authored-By: Oz <oz-agent@warp.dev>
| # Install ca-certificates for HTTPS connections | ||
| RUN apk --no-cache add ca-certificates | ||
| # Install ca-certificates for HTTPS connections and create a non-root runtime user | ||
| RUN apk --no-cache add ca-certificates \ |
There was a problem hiding this comment.
These changes are needed for the helm chart
charts/oz-agent-worker/values.yaml
Outdated
| @@ -0,0 +1,90 @@ | |||
| # Keep this at 1 for a given worker.workerId. To run multiple workers, use distinct releases | |||
There was a problem hiding this comment.
If this can't really be changed, should we remove it as an option?
| type KubernetesConfig struct { | ||
| Namespace string `yaml:"namespace"` | ||
| Kubeconfig string `yaml:"kubeconfig"` | ||
| ImagePullSecret string `yaml:"image_pull_secret" validate:"omitempty,no_whitespace"` |
There was a problem hiding this comment.
Kubernetes has a PodTemplate construct that several of their built-in resource types use: https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
We should try to be generally compatible with that - might even be possible to reuse the type definition here: https://pkg.go.dev/k8s.io/api/core/v1#PodTemplate
| log.Debugf(ctx, "Using Kubernetes task image: %s", params.DockerImage) | ||
|
|
||
| baseEnv := envSliceFromMap(b.config.Env) | ||
| mainEnv := mergeEnvVars(params.EnvVars, append(baseEnv, |
There was a problem hiding this comment.
We should consider writing all the environment variables as key-value pairs in a secret, and having the pod reference those.
I don't think Kubernetes treats environment variables specified in the pod definition as sensitive, but we're including API keys and GitHub tokens in these.
|
|
||
| var initContainers []corev1.Container | ||
|
|
||
| for i, sidecar := range params.Sidecars { |
There was a problem hiding this comment.
It looks like the latest Kubernetes release has beta support for images as volumes: https://kubernetes.io/docs/tasks/configure-pod-container/image-volumes/
We probably can't use that just yet, but would be good to track - I imagine it's faster than us manually materializing each sidecar.
There was a problem hiding this comment.
Good call — tracking this. K8s image volumes (beta in 1.33+, KEP-4639) would let us mount sidecar images directly as read-only volumes without the tar+emptyDir dance. That'd be faster and eliminate the root init container requirement. Adding a TODO comment in the sidecar materialization code. We can switch to it once the feature goes GA and our minimum supported K8s version includes it.
| if _, err := b.clientset.BatchV1().Jobs(b.config.Namespace).Create(ctx, job, metav1.CreateOptions{ | ||
| DryRun: []string{metav1.DryRunAll}, | ||
| }); err != nil { | ||
| return fmt.Errorf("kubernetes startup preflight failed: the kubernetes backend requires creating task Jobs with a root init container for sidecar materialization; verify service account/RBAC and Pod Security or admission policy for namespace %q: %w", b.config.Namespace, err) |
There was a problem hiding this comment.
OOC why does the sidecar init container need to run as root? So that it can copy files into the root directory of the container?
There was a problem hiding this comment.
Yes — the sidecar init container runs as root because it tars the entire sidecar image filesystem from / into an emptyDir volume. Files inside the sidecar image may be owned by root (or have restrictive permissions), so a non-root tar would fail with permission errors on those files. The --no-same-owner --no-same-permissions flags on the extract side ensure the materialized files in the emptyDir are accessible to the non-root task container.
internal/worker/kubernetes.go
Outdated
| "fi", | ||
| "/bin/sh /agent/entrypoint.sh \"$@\"", | ||
| "status=$?", | ||
| "if [ -n \"$OZ_TEARDOWN_COMMAND\" ]; then", |
There was a problem hiding this comment.
Do we need this if it's also running as a lifecycle hook?
- Add pod_template field accepting raw corev1.PodSpec YAML in config, enabling standard K8s syntax for scheduling, resources, and env vars with valueFrom.secretKeyRef support for Kubernetes Secrets - Worker merges its required containers/volumes/env into user-provided PodSpec template; finds or creates 'task' container - Validate that pod_template and legacy fields are mutually exclusive - Remove redundant teardown from wrapper script (keep only preStop hook) - Hardcode replicas: 1 in Helm Deployment (cannot safely be >1) - Add TODO tracking K8s image volumes (KEP-4639) as future replacement for tar-based sidecar materialization - Replied to Bnavetta's comments on root init container and image volumes - Update README with pod_template examples and Secret references docs Co-Authored-By: Oz <oz-agent@warp.dev>

Summary
This PR adds a Kubernetes execution backend to
oz-agent-workerand includes the deployment and hardening work needed to make that backend practical to run in a customer Kubernetes environment.At a high level, the worker can now execute tasks by creating Kubernetes
Jobs instead of running them via Docker or the direct backend. The PR also adds a namespace-scoped Helm chart, updates the docs for customer deployment, and tightens the production path with CI coverage, safer chart defaults, and runtime/container hardening.What changed
Kubernetes backend
internal/worker/kubernetes.gobackend.kubernetes.*ininternal/config/config.goandmain.goJob/ Pod in a target namespacepreflight_imageHelm chart
charts/oz-agent-workerDeploymentServiceAccountRole/RoleBindingConfigMapSecretbackend.kubernetes.service_accountimage.tagso installs pin a worker image rather than defaulting tolatestkubernetesBackend.preflightImageso restricted clusters can override the startup preflight imageCI / packaging / docs
go.modgo test ./....gitignoreso the top-level binary is ignored without accidentally ignoringcharts/oz-agent-worker/**Dockerfileto run the worker as a non-root user on a pinned Alpine base imageREADME.mdwith:Operational notes
JobreplicaCount=1for a givenworker.workerId; scale by creating multiple releases with distinct worker IDs instead of scaling a single release horizontallypreflight_image/kubernetesBackend.preflightImageto an allowlisted imageValidation
gofmt -won modified Go filesgo test ./...go build ./...helm lint charts/oz-agent-worker --set worker.workerId=my-worker --set image.tag=v1.2.3helm template oz-agent-worker charts/oz-agent-worker --namespace agents --set worker.workerId=my-worker --set image.tag=v1.2.3helm lint+helm templateagain with richer override values to exercise optional chart branches including secret creation, annotations, node selectors, tolerations, resources, setup/teardown hooks, environment entries, andkubernetesBackend.preflightImagedocker buildto verify the hardened runtime image still builds successfullyReviewer notes
The highest-risk / highest-value areas to review are:
internal/worker/kubernetes.gofor job lifecycle, startup preflight behavior, and failure detectionmain.go+internal/config/config.gofor config merge / validation behaviorcharts/oz-agent-worker/*for install ergonomics and namespaced deployment assumptionsREADME.mdfor customer-facing deployment guidance and caveatsArtifacts
Co-Authored-By: Oz oz-agent@warp.dev