Skip to content

Add PVC volume expansion#712

Open
HoustonPutman wants to merge 11 commits intoapache:mainfrom
HoustonPutman:pvc-expanding
Open

Add PVC volume expansion#712
HoustonPutman wants to merge 11 commits intoapache:mainfrom
HoustonPutman:pvc-expanding

Conversation

@HoustonPutman
Copy link
Copy Markdown
Contributor

@HoustonPutman HoustonPutman commented Jun 14, 2024

Resolves #709

  • Implement
  • Changelog
  • Tests
  • Documentation

@HoustonPutman
Copy link
Copy Markdown
Contributor Author

Unfortunately, KinD does not support volume expansion by default, so we will need to find a new PVC provisioner for our integration tests.
kubernetes-sigs/kind#3734

@janhoy
Copy link
Copy Markdown
Contributor

janhoy commented Mar 26, 2025

I just handled a volume scale for customer's customer today. Disks were full and we did the dance of updating sts, deleting it in orphan mode, adjusting each PVC manually, waiting for resize, and rolling restart of sts. A true pain :)

So is there perhaps any news on this, or is it an alternative to test this feature only on unit level or with mocks until we can use a true provisioner?

@HoustonPutman
Copy link
Copy Markdown
Contributor Author

Ok, I removed part of the test that relied on the volumes actually changing. The tests pass and, as expected, the operator successfully updates the size requested.

This should be close to good-to-go. If you want to review, we can merge and hopefully get the next release out soon.

Co-authored-by: Steffen Moldenhauer <54577793+smoldenhauer-ish@users.noreply.github.com>
@janhoy janhoy requested a review from Copilot March 30, 2026 15:09
@janhoy
Copy link
Copy Markdown
Contributor

janhoy commented Mar 30, 2026

Great. I trust Copilot’s review more than mine, so triggered one 😉

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class support in the Solr Operator for expanding (resizing) persistent data PVCs, working around the current Kubernetes StatefulSet limitation by patching PVCs directly and triggering a rolling restart.

Changes:

  • Introduces a new cluster operation (PVCExpansion) and related annotations to coordinate PVC resizing + restart.
  • Expands RBAC to allow patch/update on PersistentVolumeClaims.
  • Adds/updates E2E test coverage and supporting test harness/log collection, plus documentation/changelog updates.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
controllers/solrcloud_controller.go Adds PVC expansion reconciliation logic and hooks it into the reconcile loop.
controllers/solr_cluster_ops_util.go Adds new PVCExpansion cluster-op type and handler.
controllers/util/solr_util.go Adds storageMinimumSize annotation tracking on StatefulSet + pod template.
config/rbac/role.yaml Grants patch/update permissions on PVCs for the manager role.
helm/solr-operator/templates/role.yaml Grants patch/update permissions on PVCs for the Helm-installed operator role.
tests/e2e/solrcloud_storage_test.go Adds an E2E test validating PVC request-size updates during expansion flow.
tests/e2e/suite_test.go Enhances failure artifact collection (PVC info) and tweaks operator log filtering.
tests/scripts/manage_e2e_tests.sh Patches KinD’s default StorageClass to allow expansion (workaround for kind issue).
docs/solr-cloud/solr-cloud-crd.md Documents that PVC request-size can be updated for expansion, other fields remain immutable.
docs/upgrade-notes.md Fixes a small typo.
helm/solr/Chart.yaml Updates ArtifactHub changelog entry for the feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +431 to +433
statusFile, err := os.Create(baseFilename + ".status.json")
defer statusFile.Close()
Expect(err).ToNot(HaveOccurred(), "Could not open file to save PVC status: %s", baseFilename+".status.json")
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In writeAllPvcInfoToFiles, statusFile/eventsFile are deferred closed before checking the os.Create error. If Create fails, the deferred Close will panic on a nil file handle. Move the defer until after verifying err == nil (and consider handling Close errors consistently).

Copilot uses AI. Check for mistakes.
Comment on lines 533 to 547
if filterLinesWithString != "" {
filteredWriter := bytes.NewBufferString("")
scanner := bufio.NewScanner(podLogs)
for scanner.Scan() {
line := scanner.Text()
if strings.Contains(line, filterLinesWithString) {
io.WriteString(filteredWriter, line)
io.WriteString(filteredWriter, "\n")
io.WriteString(logFile, line)
io.WriteString(logFile, "\n")
}
}
logReader = filteredWriter
} else {
_, err = io.Copy(logFile, podLogs)
}

_, err = io.Copy(logFile, logReader)
Expect(err).ToNot(HaveOccurred(), "Could not write podLogs to file: %s", filename)
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When filterLinesWithString != "", the function never assigns to err and does not check scanner.Err() / io.WriteString errors. As a result, read/write failures (including bufio.Scanner token-too-long) can be silently ignored and the final Expect(err) will still pass. Capture and assert scanner.Err(), and propagate write errors so the test output reflects failures.

Copilot uses AI. Check for mistakes.
HoustonPutman and others added 2 commits March 30, 2026 12:16
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@HoustonPutman
Copy link
Copy Markdown
Contributor Author

good suggestions by Copilot. Implemented, and so this should probably be done?

@HoustonPutman
Copy link
Copy Markdown
Contributor Author

So the integration tests are just noisier than they used to be. I'm not sure why that would be the case (after fixing the bug in the CoPilot code), but I'm going to hold-off merging until I can nail it down.

@HoustonPutman
Copy link
Copy Markdown
Contributor Author

Ok, good and bad news here:

Good news: I got a local expandable-PVC setup working using the RawFile PV project from open-ebs.

Bad news: This PR doesn't work because of the way that PVC expansion works, at least for this storage provider.

The way we do it in the operator, we increase the request for all PVCs manually, then wait for them to be updated and have the new capacity. Then we issue a rolling restart so that the pods get the new storage that has been allocated to them. However this storage provider, and possibly all of them, require a pod to be restarted before the PVC's storage status gets updated to the new value. So we stall out because the Solr-operator is waiting on the PVC's statuses to be updated, and the Storage-class operator is waiting on the Pods to be updated.

We need to investigate if all storage classes operate this way, and if so change the logic to do a rolling restart when the PVCs are in FileSystemResizePending condition. Then at the end make sure all the PVCs are the right size. This is definitely more complex than the current logic.

Also in newer versions of Kubernetes, there are better ways of monitoring the PVC to see what is happening: https://kubernetes.io/blog/2025/09/19/kubernetes-v1-34-recover-expansion-failure/
These new PVC fields are beta in 1.32 and GA in 1.34, and off by default in all versions before 1.32. Since the Solr Operator supports many more versions than that, I'm hesitant to bump up the required version by so much just to make use of these. But it would make the code better IMO and more resilient when used across many different storage providers.

Lots to think about, and certainly not decisions I want to make myself.

@janhoy , would you be able to test in your setup to ensure that your storage provider behaves the same way?

@janhoy
Copy link
Copy Markdown
Contributor

janhoy commented Apr 2, 2026

Insight from Claude on the deadlock, not sure if that helps:

The operator waits on status.capacity reaching the new size before issuing the rolling restart, but for drivers that don't implement ExpandInUsePersistentVolumes, the kubelet only updates status.capacity after expanding the filesystem on pod mount — which requires a pod restart first. Deadlock.

Fix

Pivot the wait condition from status.capacity to status.conditions. After patching PVCs, wait until each PVC is in one of:

  • status.capacity >= requested → online expansion complete, no restart needed
  • status.conditions contains FileSystemResizePending=True → block device resized, pod restart needed to expand filesystem. Then issue the rolling restart, and verify status.capacity as a post-condition rather than a gate.

Which providers require this

  • AWS EBS CSI, GCE PD CSI, Azure Disk CSI: support ExpandInUsePersistentVolumes, no pod restart needed, won't hit this path
  • OpenEBS LocalPV, local-path-provisioner, Longhorn (some modes), NFS-backed: require pod restart, set FileSystemResizePending
  • Rook/Ceph RBD: online. Ceph FS: offline.

Rough rule: cloud-managed block storage → online. Local/file/distributed → offline. The operator needs to handle both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow resizing (expanding) of persistent data PVCs

4 participants