OCPBUGS-76334: Add TLS 1.3 (Modern profile) support to TestTLSDefaults#30746
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Scheduling required tests: |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-tls-13 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-tls-13 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-tls-13 |
|
@wangke19: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/44c09590-01e0-11f1-99ae-f7f91100ef91-0 |
|
CI job https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30746-nightly-4.22-e2e-aws-ovn-tls-13/2019074402091536384: |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-tls-13 |
|
@wangke19: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ad6ed650-01fa-11f1-880d-126cb2da3d50-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-tls-13 |
|
@wangke19: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3fb34a50-02a5-11f1-8110-7108862631ab-0 |
|
/retitle [WIP]OCPBUGS-76334: Add TLS 1.3 (Modern profile) support to TestTLSDefaults |
|
@wangke19: This pull request references Jira Issue OCPBUGS-76334, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Test Failure Analysis:
|
| Run ID | Ungraceful Terminations | TestTLSDefaults Execution | Time Gap |
|---|---|---|---|
| 2019122669349244928 | 19:56:19, 20:04:07 | 21:38:50 ✅ PASSED (5.3s) | 1h 34m - 1h 42m |
| 2019429245096300544 | 16:24:22 | 17:47:18 ✅ PASSED (2.3s) | 1h 23m |
| 2019074402091536384 | 16:48:52, 16:52:45, 16:57:14 | 18:11:20 ✅ PASSED (2.7s) | 1h 14m - 1h 23m |
Pattern: All ungraceful terminations occur during cluster bootstrap (16:00-20:00 range), well before test execution begins (18:11-21:38 range).
Detailed Timeline (Run 1 Example)
Timeline for run 2019122669349244928:
19:56:19 🔴 Pod kube-apiserver-ip-10-0-81-175 started (ungraceful termination detected)
20:04:07 🔴 Pod kube-apiserver-ip-10-0-76-239 started (ungraceful termination detected)
⏱️ [~1h 34m gap - cluster stabilization phase]
21:38:50 ✅ TestTLSDefaults starts (THIS PR)
21:38:55 ✅ TestTLSDefaults ends - PASSED
21:39:48 ✅ TestTLSMinimumVersions starts (pre-existing, also tests TLS 1.3)
21:39:59 ✅ TestTLSMinimumVersions ends - PASSED
22:03:01 ❌ Graceful termination test runs
22:03:04 ❌ Test FAILS - detects ungraceful terminations from 19:56 and 20:04
From the test failure output:
fail [github.com/openshift/origin/test/extended/apiserver/graceful_termination.go:89]:
The following API Servers weren't gracefully terminated:
kube-apiserver on node ip-10-0-76-239.us-west-2.compute.internal wasn't gracefully terminated,
reason: Previous pod kube-apiserver-ip-10-0-76-239.us-west-2.compute.internal started at
2026-02-04 20:04:07.028765438 +0000 UTC did not terminate gracefully
Key Findings
- ✅ TestTLSDefaults always passes in all 3 runs (5.3s, 2.3s, 2.7s)
- ✅ Temporal impossibility - failures occur 1-1.5 hours BEFORE the test runs
- ✅ TestTLSMinimumVersions pre-existed - TLS 1.3 connection testing was already happening (11.3s runtime, more intensive than TestTLSDefaults)
- ✅ Consistent pattern - all 3 runs show identical behavior: bootstrap failures detected later
- ❌ Real infrastructure issue - kube-apiserver pods are terminated ungracefully during cluster initialization
Root Cause: TLS 1.3 Cluster Bootstrap Issue
The nightly-4.22-e2e-aws-ovn-tls-13 CI job has a pre-existing cluster initialization problem where kube-apiserver pods don't terminate gracefully during bootstrap. This happens during:
- Initial cluster provisioning (16:00-20:00 time range)
- Before any e2e tests execute (tests start ~18:00-21:00)
- Likely causes: kubelet/CRI-O not respecting graceful termination during bootstrap, TLS configuration triggering rapid restarts, or resource pressure
Why this appears on this PR: The PR triggers the TLS 1.3 CI job via /payload-job command, which exposes the existing infrastructure issue. This job may not run frequently enough on main branch to detect the problem consistently.
Why This PR is Not the Cause
- Code changes are test-only - only modifies
TestTLSDefaultsto support TLS 1.3 instead of skipping - Test is read-only - no cluster modifications, only verification via port-forward and TLS connections
- Pre-existing TLS testing -
TestTLSMinimumVersionsalready performed similar (more intensive) TLS connection testing - Test passes successfully - if the test code was problematic, it would fail or cause issues during its execution, not 1.5 hours earlier
Recommendation
This PR can proceed to merge. The test failure is detecting a real cluster infrastructure problem that exists independently of this PR's changes.
Separate issue tracking: I recommend filing a bug against the TLS 1.3 CI environment for the cluster bootstrap ungraceful termination issue. The failure is legitimate - it's just not caused by this PR.
Related Tests
For reference, both TLS tests run successfully:
- TestTLSDefaults (this PR): Tests TLS versions and cipher suites via port-forward, ~5s runtime
- TestTLSMinimumVersions (pre-existing): Tests TLS versions across 7 components, ~11s runtime
Both tests work correctly and pass. The graceful termination test is designed to detect any ungraceful terminations that occurred at any point during cluster lifetime, including initialization phase.
|
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-tls-13 |
|
@wangke19: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
Scheduling required tests: |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-tls-13 |
|
@wangke19: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/pj-rehearse periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-tls-13 |
|
@wangke19: trigger 14 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8e1bcf80-1e45-11f1-9673-c3ae2c377049-0 |
|
/retest |
Payload Test Results:
|
| Job | Result |
|---|---|
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node |
✅ PASSED |
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips |
✅ PASSED |
periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade |
✅ PASSED |
periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade |
✅ PASSED |
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance |
✅ PASSED |
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview |
✅ PASSED |
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips-no-nat-instance |
✅ PASSED |
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4 |
✅ PASSED |
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6 |
✅ PASSED |
⚪ Jobs That Did NOT Run This Test (5) — Expected
The 5 serial-suite jobs (nightly-4.22-e2e-aws-ovn-serial-1of2, -2of2, ci-4.22-e2e-aws-ovn-techpreview-serial-1of3, -2of3, -3of3) do not include [Suite:openshift/conformance/parallel] tests by design.
All 9 applicable jobs passed ✅
|
@wangke19: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
permafailing e2e-aws-ovn-microshift is a known issue: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1773225998780929 |
|
/verfied by CI jobs |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, sanchezl, wangke19 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by CI test |
|
@wangke19: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5c29d3a
into
openshift:main
|
@wangke19: Jira Issue Verification Checks: Jira Issue OCPBUGS-76334 Jira Issue OCPBUGS-76334 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
The replace directive pointing to github.com/wangke19/filepath-securejoin-1 was added in PR openshift#30746 as a workaround for opencontainers/runc v1.2.5 calling the removed MkdirAllHandle function from filepath-securejoin v0.6.0. However, go mod why shows that opencontainers/runc is not actually needed by the origin module: # github.com/opencontainers/runc (main module does not need package github.com/opencontainers/runc) Since runc is only a transitive dependency (not directly used), and the compatibility shim in the fork is not needed, we can safely remove the personal fork and use the official filepath-securejoin v0.6.0. This eliminates the supply chain risk of depending on a personal fork and follows Go module best practices. Note: PR openshift#30866 (K8s 1.35 rebase) will also remove this when it merges, as the rebase removes opencontainers/runc entirely from the dependency graph. This PR provides immediate cleanup for the current main branch.
Personal Fork Status & Migration PlanThis PR currently uses a personal fork to work around a dependency issue: github.com/cyphar/filepath-securejoin => github.com/wangke19/filepath-securejoin-1 v0.6.2-0.20260309113322-d8b3e820b3d6Why the Personal Fork is NeededThe library-go bump in this PR transitively upgraded
Proof: PR #30889 attempted to remove the personal fork based on misleading Follow-up: Organization Fork MigrationThe personal fork replace directive is functionally necessary but creates supply chain risks. Action item: Created issue #30890 to track migrating to an organization fork (
Potential ObsolescenceNote: PR #30866 is rebasing to K8s 1.35. If that rebase removes Long-term: This should be removed once |
This commit removes the dependency on github.com/openshift/library-go/pkg/crypto from the TLS test by creating a local tlsutil package that provides the same functionality without requiring the library-go bump. The test only used simple utility functions from library-go: - DefaultTLSVersion() - Returns tls.VersionTLS12 - ValidTLSVersions() - Returns list of TLS version names - TLSVersionOrDie(name) - Converts version name to uint16 - DefaultCiphers() - Returns list of default cipher suites - ValidCipherSuites() - Returns list of cipher suite names - CipherSuite(name) - Converts cipher suite name to uint16 All of these are simple lookup/conversion functions with no transitive dependencies beyond crypto/tls standard library. By extracting them into test/extended/apiserver/tlsutil/, we can avoid the library-go bump that was introduced in the original PR openshift#30746. This approach: - Keeps the valuable TLS 1.3 test improvements - Avoids unnecessary dependency updates - Maintains test stability - Removes the coupling between test logic and library-go versions The personal fork of filepath-securejoin is kept as it's required for runc compatibility and will be addressed separately in issue #30890.
This commit reverts the library-go version from v0.0.0-20260223145824-7b234b47a906 back to v0.0.0-20251015151611-6fc7a74b67c5, removing the unnecessary dependency update that was bundled with the TLS 1.3 test fixes. The library-go bump was not required for the test functionality, as the test only used simple static lookup functions. By using local crypto helpers instead, we can avoid this dependency update and its associated supply chain implications. Updated: - go.mod: Reverted library-go to v0.0.0-20251015151611-6fc7a74b67c5 - go.sum: Updated checksums - vendor/: Synced with go.mod changes - Updated TODO comment in replace directive to reference issue #30890 The personal fork replace directive for filepath-securejoin is retained as it's required for runc v1.2.5 compatibility with filepath-securejoin v0.6.0, which is pulled in transitively by opencontainers/selinux@v1.13.0.
This commit removes the dependency on github.com/openshift/library-go/pkg/crypto from the TLS test by creating a local tlsutil package that provides the same functionality without requiring the library-go bump. The test only used simple utility functions from library-go: - DefaultTLSVersion() - Returns tls.VersionTLS12 - ValidTLSVersions() - Returns list of TLS version names - TLSVersionOrDie(name) - Converts version name to uint16 - DefaultCiphers() - Returns list of default cipher suites - ValidCipherSuites() - Returns list of cipher suite names - CipherSuite(name) - Converts cipher suite name to uint16 All of these are simple lookup/conversion functions with no transitive dependencies beyond crypto/tls standard library. By extracting them into test/extended/apiserver/tlsutil/, we can avoid the library-go bump that was introduced in the original PR openshift#30746. This approach: - Keeps the valuable TLS 1.3 test improvements - Avoids unnecessary dependency updates - Maintains test stability - Removes the coupling between test logic and library-go versions The personal fork of filepath-securejoin is kept as it's required for runc compatibility and will be addressed separately in issue #30890.
This commit reverts the library-go version from v0.0.0-20260223145824-7b234b47a906 back to v0.0.0-20251015151611-6fc7a74b67c5, removing the unnecessary dependency update that was bundled with the TLS 1.3 test fixes. The library-go bump was not required for the test functionality, as the test only used simple static lookup functions. By using local crypto helpers instead, we can avoid this dependency update and its associated supply chain implications. Updated: - go.mod: Reverted library-go to v0.0.0-20251015151611-6fc7a74b67c5 - go.sum: Updated checksums - vendor/: Synced with go.mod changes - Updated TODO comment in replace directive to reference issue #30890 The personal fork replace directive for filepath-securejoin is retained as it's required for runc v1.2.5 compatibility with filepath-securejoin v0.6.0, which is pulled in transitively by opencontainers/selinux@v1.13.0.
…faults-support" This reverts commit 5c29d3af01fbee9cc0c6f748a8d2f1aac5c80a8e. Reason for revert: PR openshift#30746 is causing test instability in 4.22 payloads. Multiple test failures have been observed in periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade jobs as reported in #forum-testplatform Slack channel. Example failures: - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-upgrade-4.21-to-4.22-ci-openshift-release-analysis-aggregator/2033742335904321536 - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade/2033742330044878848 - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade/2033742330229428224 - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade/2033742330355257344 The PR bundled test logic improvements with an unnecessary library-go dependency bump, which has introduced instability. A follow-up PR (openshift#30894) will reintroduce the test improvements without the dependency changes. This revert is needed to: 1. Restore test stability in 4.22 CI 2. Separate test logic from dependency management concerns 3. Allow proper evaluation of the test improvements in isolation Related: - Original PR: openshift#30746 - Replacement PR: openshift#30894 - JIRA: OCPBUGS-76334 - Slack: #forum-testplatform discussion on 2026-03-17
CNTRLPLANE-2999:Revert PR #30746: TLS 1.3 test causing instability in 4.22
…-go dependency This commit re-adds the TLS 1.3 (Modern profile) test improvements from PR openshift#30746 but uses a local tlsutil package instead of depending on library-go/pkg/crypto. Changes from original PR openshift#30746: - Add support for testing Modern TLS profile (TLS 1.3 only) - Add FIPS mode detection for ChaCha20-Poly1305 cipher handling - Add ECDSA cipher suite handling (expect failure with RSA keys) - Improve test logging and error messages Changes from original implementation: - Created test/extended/apiserver/tlsutil/ package with crypto helpers - Replaced library-go/pkg/crypto dependency with local tlsutil - No dependency on library-go version bump The test now properly validates both Intermediate (TLS 1.2+) and Modern (TLS 1.3 only) TLS profiles, with appropriate handling for FIPS mode restrictions and RSA vs ECDSA cipher compatibility.
Summary
This PR adds TLS 1.3 (Modern profile) support to
TestTLSDefaultsand fixes critical test reliability issues.Changes
1. Add TLS 1.3 (Modern profile) support
minTLSVersionvariable based on the profile2. Fix cipher suite testing for Intermediate profile
isTLS13Cipher()helper function to identify TLS 1.3-specific cipher suitesCipherSuitesMaxVersion: tls.VersionTLS12always fails with "handshake failure"3. Skip ECDSA cipher suites (RSA-only server keys)
library-go'sNewKeyPair()->newRSAKeyPair()->rsa.GenerateKey())TLS_ECDHE_ECDSA_WITH_*) require an ECDSA server key to negotiate -- they always fail the TLS handshake against an RSA-keyed server4. Skip ChaCha20-Poly1305 ciphers in FIPS mode
exutil.IsFIPS()(readscluster-config-v1ConfigMap fromkube-system)TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305andTLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305(ECDSA variant already handled by Added etcd folder into .gitignore #3)5. Add missing dial failure assertion
Problem
TestTLSDefaults Issue
CI jobs running with TLS 1.3 clusters (e.g.,
e2e-aws-ovn-tls-13) were seeing this test skip instead of validating TLS behavior.Test Failures on Intermediate Profile
After the library-go update (commit
4e6c9ca5fin cluster-kube-apiserver-operator, Dec 17 2025) that added TLS 1.3 ciphers toDefaultCiphers(), the test began failing with:This occurred because the test was trying to verify TLS 1.3 ciphers using
MaxVersion: tls.VersionTLS12.ECDSA Cipher Handshake Failures
After bumping library-go to remove CBC ciphers, the test still failed with:
DefaultCiphers()includes both ECDSA and RSA cipher suites (per Mozilla Intermediate profile). ECDSA ciphers require an ECDSA server key, but kube-apiserver uses RSA certificates.ChaCha20-Poly1305 Failures in FIPS Mode
After fixing ECDSA handling, FIPS CI jobs failed with:
ChaCha20-Poly1305 is not FIPS 140-2/140-3 approved. Go's FIPS crypto implementation disables it entirely.
Root Cause
Three distinct issues in the test's cipher validation logic:
TLS 1.3 ciphers tested at TLS 1.2: The Intermediate profile now includes TLS 1.3 ciphers, but they cannot be negotiated when
MaxVersionis set to TLS 1.2.ECDSA ciphers vs RSA server keys:
DefaultCiphers()includes ECDSA ciphers per Mozilla guidelines, but kube-apiserver generates RSA keys vialibrary-go'sNewKeyPair(). Go's TLS server silently skips ciphers incompatible with the server's key type.ChaCha20-Poly1305 in FIPS mode:
DefaultCiphers()includes ChaCha20-Poly1305 ciphers, but Go's FIPS crypto implementation disables them. The kube-apiserver in FIPS mode cannot negotiate these ciphers.Solution
Verification
Code Quality
go build ./test/extended/apiserver/...passesgo vet ./test/extended/apiserver/...passesTesting Approach
MaxVersion: tls.VersionTLS12Related Issues
4e6c9ca5f(Dec 17, 2025)