Description
When adding a new disk to an existing JBOD storage policy, the operator incorrectly treats the missing PVC for the new volume as data loss and executes SYSTEM DROP REPLICA, which removes the replica's ZooKeeper
state (including log_ptr).
Since we set None for both replica and shard in schemaPolicy, so the operator won't try to do any recovery.
Root Cause
It seems the root cause is that the PVC reconciliation flow misclassifies a new volume as data loss:
// stsReconcileOpts, migrateTableOpts = w.hostPVCsDataVolumeMissedDetectedOptions(host)
stsReconcileOpts, migrateTableOpts = w.hostPVCsDataLossDetectedOptions(host)
See
|
// stsReconcileOpts, migrateTableOpts = w.hostPVCsDataVolumeMissedDetectedOptions(host) |
Any idea why we don't use hostPVCsDataVolumeMissedDetectedOptions? Any edge case it won't handle?
Steps to Reproduce
- Deploy a ClickHouseInstallation with a JBOD storage policy containing one or more disks
- Add a new disk to the JBOD volume in the CHI spec
- Observe operator logs showing SYSTEM DROP REPLICA being executed
- Verify ZooKeeper state (log_ptr, etc.) is removed for the affected replicas
Description
When adding a new disk to an existing JBOD storage policy, the operator incorrectly treats the missing PVC for the new volume as data loss and executes
SYSTEM DROP REPLICA, which removes the replica's ZooKeeperstate (including
log_ptr).Since we set None for both replica and shard in schemaPolicy, so the operator won't try to do any recovery.
Root Cause
It seems the root cause is that the PVC reconciliation flow misclassifies a new volume as data loss:
See
clickhouse-operator/pkg/controller/chi/worker-reconciler-chi.go
Line 822 in d342fca
Any idea why we don't use
hostPVCsDataVolumeMissedDetectedOptions? Any edge case it won't handle?Steps to Reproduce