Skip to content

ROX-33305 prevent gc#166

Draft
davdhacs wants to merge 2 commits intomasterfrom
rox-33305-prevent-gc
Draft

ROX-33305 prevent gc#166
davdhacs wants to merge 2 commits intomasterfrom
rox-33305-prevent-gc

Conversation

@davdhacs
Copy link

@davdhacs davdhacs commented Mar 4, 2026

On gke-latest nightly runs, we've started to see images deleted from the registry before all tests using them have run.
attempt: pin images to prevent gc

result: doesn't seem to be strictly honored. Images still get deleted.
See the stackrox PR using this branch image: stackrox/stackrox#19218
and gke-latest test run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/stackrox_stackrox/19218/pull-ci-stackrox-stackrox-master-gke-latest-qa-e2e-tests/2028915923859017728

davdhacs and others added 2 commits March 2, 2026 22:39
After each successful CRI PullImage, use the containerd native client
API to set io.cri-containerd.pinned=pinned on the image. This tells
kubelet's image GC to skip the image.

The CRI API doesn't support setting image labels, so we connect to
containerd directly (same socket) using the containerd client library
in the k8s.io namespace. The pinning happens immediately after each
successful pull, before GC has a chance to evict the image.

This is a proof-of-concept to test if pinning at pull time (via the
containerd API) works better than post-hoc pinning via ctr CLI, which
has known bugs (containerd#9328, #10270).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs davdhacs requested review from porridge and stehessel March 4, 2026 05:30
@davdhacs
Copy link
Author

davdhacs commented Mar 4, 2026

@porridge could you take a look at this problem? Example nightlies qa-e2e on gke-latest: https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/branch-ci-stackrox-stackrox-nightlies-gke-latest-qa-e2e-tests

@davdhacs davdhacs removed the request for review from stehessel March 4, 2026 05:35
@davdhacs
Copy link
Author

davdhacs commented Mar 4, 2026

Since pinning still doesn't work around the problem, I'm wondering if we need to do something else to keep refreshing the images (periodic run of prefetcher?) or just change the tests to load the images if they're not present.

@porridge
Copy link
Collaborator

porridge commented Mar 4, 2026

@davdhacs I'm not sure it's about garbage collection. It seems this image is being pulled using a different pull spec than the one with which it's being used. Namely this query only shows quay.io/rhacs-eng/qa-multi-arch:nginx-1.12, not quay.io/rhacs-eng/qa-multi-arch:nginx-1.12@sha256:72daaf46f11cc753c4eab981cbf869919bd1fee3d2170a2adeac12400f494728 that the failing test complains about.

SELECT *
FROM `acs-san-stackroxci.ci_metrics.stackrox_image_prefetches`
WHERE build_id = '2023195859587436544' and image like 'quay.io/rhacs-eng/qa-multi-arch:nginx-1.12%' -- '@sha256:72daaf46f11cc753c4eab981cbf869919bd1fee3d2170a2adeac12400f494728'
LIMIT 10

@porridge
Copy link
Collaborator

porridge commented Mar 4, 2026

I've added the digest-qualified pull spec in stackrox/stackrox#19287, let's see if it helps.

@porridge
Copy link
Collaborator

porridge commented Mar 4, 2026

I've added the digest-qualified pull spec in stackrox/stackrox#19287, let's see if it helps.

Hm, it didn't. ProcessVisualizationTest still failed ~10:06:58 complaining about the exact same pull spec that is now in the prefetch list and was fetched at ~08:55 according to bigquery 🤔

Let me try again together with the disk bump piece.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants