CP-37207: Implement configurable enforce flag for validator diagnostics by evan-cz · Pull Request #646 · Cloudzero/cloudzero-agent

evan-cz · 2026-01-28T16:29:07Z

The validator's pre-start diagnostic stage had a hardcoded enforce: true setting that wasn't actually wired up to affect behavior. This made it impossible for users to control whether diagnostic failures should block pod startup.

Additionally, when the diagnostic runner encountered errors (distinct from check failures), it would cause the validator to fail even when enforcement was disabled, effectively making enforce: false meaningless in error scenarios.

Functional Change:

Before: The enforce setting in validator config was vestigial - diagnostic failures always logged warnings but never blocked pod startup. Runner errors would crash the validator regardless of enforce setting.

After: When enforce: true, failing pre-start checks cause the validator to exit with error code 1, blocking pod startup via the lifecycle hook. When enforce: false (now the default), failures are logged and reported via telemetry but the pod starts normally. Runner errors are handled gracefully when enforcement is disabled.

Solution:

Added components.validator.enforce to values.yaml with default false
- Includes documentation explaining behavior for both settings
- Updated JSON schema with enforce boolean property
Updated helm/templates/validator-cm.yaml to use the configurable value
- Pre-start stage now uses {{ .Values.components.validator.enforce }}
- Other stages remain hardcoded to enforce: false
Enhanced diagnostic runner (app/domain/diagnostic/runner/runner.go):
- Added enforce and hasFailures fields to runner struct
- NewRunner() captures enforce setting from stage config
- Added ShouldFail() method: returns true only when enforce=true AND checks failed
- Added IsEnforced() method: exposes enforce state for error handling
Modified command.go to implement enforcement behavior:
- After Run(), checks engine.ShouldFail() to determine exit behavior
- When enforce=true and checks fail: logs failures and returns error
- When enforce=false and runner errors: warns and continues (instead of Fatal)

Validation:

Added 5 new test functions to runner_test.go (coverage: 69.5% -> 86.7%):
- TestRunner_ShouldFail: verifies enforce+hasFailures logic
- TestRunner_EnforceSetFromStageConfig: verifies config parsing
- TestRunner_HasFailuresTracking: verifies failure detection
- TestRunner_ShouldFailIntegration: end-to-end config->behavior test
- TestRunner_IsEnforced: verifies IsEnforced() method
Added helm/tests/validator_enforce_test.yaml with 6 test cases:
- Default value (false), explicit true/false, other stages unaffected
Added 3 schema validation test files:
- components.validator.enforce.true.pass.yaml
- components.validator.enforce.false.pass.yaml
- components.validator.enforce.invalid.fail.yaml
Deployed to Brahms cluster and verified:
- enforce=true + invalid API key: pod enters Init:CrashLoopBackOff (expected)
- enforce=false + invalid API key: pod starts normally with warnings (expected)

Customer reported pods entering CrashLoopBackOff with FailedPostStartHook events when deploying the CloudZero agent. Investigation revealed the validator's postStart hook was attempting to reach a webhook service that didn't exist, causing DNS lookup failures and ~70 second delays due to retry logic. Functional Change: Before: The validator ConfigMap referenced `cloudzero-agent-cz-webhook-svc` for the webhook service, but the actual service was named `cloudzero-agent-cz-webhook` (no `-svc` suffix). This caused the webhook_server_reachable check to fail on every deployment, blocking startup for ~70 seconds while retries exhausted. After: The validator ConfigMap correctly references the webhook service using the same helper function as the service definition, ensuring names always match. Root Cause: The validator-cm.yaml template (line 45) was introduced in commit 90e1bce (April 2025) with a hardcoded `-svc` suffix that never matched the actual service name: ```yaml insights_service: {{ include "cloudzero-agent.insightsController.server.webhookFullname" . }}-svc ``` The webhook service in webhook-service.yaml uses: ```yaml name: {{ include "cloudzero-agent.serviceName" . }} ``` Both helpers resolve to the same base name (`release-cz-webhook`), but the validator template erroneously appended `-svc`, causing DNS lookup failures. The bug went unnoticed because: 1. The enforce flag for post-start stage is `false`, so failures don't crash pods 2. The check eventually times out after ~70 seconds and returns nil 3. Federated mode deployments skip the webhook check entirely 4. Warning-level logs were easily missed Solution: 1. Changed validator-cm.yaml line 45 to use the correct helper without suffix: `insights_service: {{ include "cloudzero-agent.serviceName" . }}` 2. Added regression test (helm/tests/validator_insights_service_test.yaml) with 5 test cases verifying: - insights_service matches expected pattern with default release name - webhook service name matches the same pattern - insights_service matches with custom release names - insights_service does NOT contain `-svc` suffix (regression guard) Validation: - All tests pass, including new ones. - Manual verification: `helm template test-release ./helm --set apiKey=test-key` shows `insights_service: test-release-cz-webhook` (no `-svc` suffix) - No new test failures introduced (pre-existing failures unrelated to this change)

The validator's pre-start diagnostic stage had a hardcoded `enforce: true` setting that wasn't actually wired up to affect behavior. This made it impossible for users to control whether diagnostic failures should block pod startup. Additionally, when the diagnostic runner encountered errors (distinct from check failures), it would cause the validator to fail even when enforcement was disabled, effectively making `enforce: false` meaningless in error scenarios. Functional Change: Before: The `enforce` setting in validator config was vestigial - diagnostic failures always logged warnings but never blocked pod startup. Runner errors would crash the validator regardless of enforce setting. After: When `enforce: true`, failing pre-start checks cause the validator to exit with error code 1, blocking pod startup via the lifecycle hook. When `enforce: false` (now the default), failures are logged and reported via telemetry but the pod starts normally. Runner errors are handled gracefully when enforcement is disabled. Solution: 1. Added `components.validator.enforce` to values.yaml with default `false` - Includes documentation explaining behavior for both settings - Updated JSON schema with enforce boolean property 2. Updated helm/templates/validator-cm.yaml to use the configurable value - Pre-start stage now uses `{{ .Values.components.validator.enforce }}` - Other stages remain hardcoded to `enforce: false` 3. Enhanced diagnostic runner (app/domain/diagnostic/runner/runner.go): - Added `enforce` and `hasFailures` fields to runner struct - `NewRunner()` captures enforce setting from stage config - Added `ShouldFail()` method: returns true only when enforce=true AND checks failed - Added `IsEnforced()` method: exposes enforce state for error handling 4. Modified command.go to implement enforcement behavior: - After `Run()`, checks `engine.ShouldFail()` to determine exit behavior - When enforce=true and checks fail: logs failures and returns error - When enforce=false and runner errors: warns and continues Validation: - Added 5 new test functions to runner_test.go (coverage: 69.5% -> 86.7%): - TestRunner_ShouldFail: verifies enforce+hasFailures logic - TestRunner_EnforceSetFromStageConfig: verifies config parsing - TestRunner_HasFailuresTracking: verifies failure detection - TestRunner_ShouldFailIntegration: end-to-end config->behavior test - TestRunner_IsEnforced: verifies IsEnforced() method - Added helm/tests/validator_enforce_test.yaml with 6 test cases: - Default value (false), explicit true/false, other stages unaffected - Added 3 schema validation test files: - components.validator.enforce.true.pass.yaml - components.validator.enforce.false.pass.yaml - components.validator.enforce.invalid.fail.yaml - Deployed to Brahms cluster and verified: - enforce=true + invalid API key: pod enters Init:CrashLoopBackOff (expected) - enforce=false + invalid API key: pod starts normally with warnings (expected)

evan-cz · 2026-02-04T16:36:33Z

Closing in favor of #652

evan-cz requested a review from a team as a code owner January 28, 2026 16:29

evan-cz force-pushed the CP-37207 branch from 5ed79b7 to 74508e8 Compare January 28, 2026 16:38

evan-cz marked this pull request as draft January 28, 2026 16:41

evan-cz force-pushed the CP-37207 branch from 74508e8 to 283933e Compare January 28, 2026 16:47

evan-cz marked this pull request as ready for review February 2, 2026 15:13

josephbarnett approved these changes Feb 2, 2026

View reviewed changes

Base automatically changed from CP-37198 to develop February 3, 2026 18:56

evan-cz closed this Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP-37207: Implement configurable enforce flag for validator diagnostics#646

CP-37207: Implement configurable enforce flag for validator diagnostics#646
evan-cz wants to merge 2 commits intodevelopfrom
CP-37207

evan-cz commented Jan 28, 2026

Uh oh!

evan-cz commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evan-cz commented Jan 28, 2026

Uh oh!

evan-cz commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants