Skip to content

More details on restore design#2151

Open
sseago wants to merge 1 commit intoopenshift:oadp-devfrom
sseago:restore-design
Open

More details on restore design#2151
sseago wants to merge 1 commit intoopenshift:oadp-devfrom
sseago:restore-design

Conversation

@sseago
Copy link
Copy Markdown
Contributor

@sseago sseago commented Apr 2, 2026

Why the changes were made

This PR updates the restore design for kubevirt datamover with more details.

How to test the changes made

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

Walkthrough

This pull request updates the KubeVirt datamover design document to expand on velero integration details, including new VM/PVC coordination annotations, restore plugin requirements, controller restore workflow enhancements, datamover pod operation sequencing, and backup metadata format extensions with per-checkpoint pvcSizes entries.

Changes

Cohort / File(s) Summary
Design Documentation
docs/design/kubevirt-datamover.md
Extended velero integration section with new annotations (velerov1api.PVCNamespaceNameLabel, kubevirt-datamover-vm), updated restore plugin requirements for VirtualMachine and PVC RIA plugins, refined DataDownload reconciler workflow with explicit BSL metadata and temporary PVC/PV creation logic, detailed datamover pod sequential PVC processing with qcow2 handling and block device writes, and augmented backup metadata format with pvcSizes per checkpoint.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sseago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/kubevirt-datamover.md`:
- Around line 101-102: The doc's PVC sizing logic currently hardcodes a 10%
buffer when computing temp PVC size from qcow2 sizes and PVC sizes; update the
implementation (the PVC sizing calculation / any function named like
calculate_buffered_size or tempPVCSize computation in the datamover/controller)
to make the buffer percentage configurable and increase the default to a safer
value (e.g., 15–20%), add validation tests or notes documenting empirical
bounds, and implement runtime handling for insufficient-space errors by
detecting qemu-img or filesystem ENOSPC failures and retrying the operation
after increasing the PVC size (or failing with a clear actionable error). Ensure
the change is reflected in docs by replacing the hardcoded "10% buffer" text
with the new configurable/default value and a note about dynamic retry behavior.
- Line 118: The FIXME in docs/design/kubevirt-datamover.md about using the
unsafe `-u` flag for qemu-img rebase must be resolved: decide whether to use
`-u` (and under what conditions) by reviewing current KubeVirt incremental
backup docs and consulting KubeVirt SME/PRs, then replace the FIXME with a
definitive guidance sentence that states the chosen approach, the exact qemu-img
invocation pattern, and the safety rationale (when `-u` is allowed vs forbidden)
so implementers of the datamover know whether to include `-u` in the rebase step
and why.
- Line 126: Clarify cleanup by explicitly naming the temporary resources and
their lifecycle: state that the "qcow2 download PVC" (the temp PVC created for
the qcow2 download/staging) and its backing PV (referred to as the "qcow2
staging PV") are deleted after the datamover pod finishes and the qcow2 image is
converted; state that each "per-VM temp PVC" (the temp PVCs created for each VM
PVC) are also deleted after their corresponding VM disk has been copied to the
target PVC and the PV bind is completed; specify which PVs are retained (the
final restored PVs bound to the target PVCs) versus which PVs are removed (the
qcow2 staging PV and any PVs exclusively backing the temp PVCs), and add a
concise ordered lifecycle sequence (create qcow2 download PVC -> create per-VM
temp PVCs -> datamover copies/creates final PVCs -> datamover exits -> delete
per-VM temp PVCs and qcow2 download PVC -> leave restored PVs bound to final
PVCs).
- Around line 62-63: The document references a non-existent constant
velerov1api.PVCNamespaceNameLabel and mixes “label” vs “annotation”; update the
design to either (a) import/add a constant in Velero upstream only if agreed
with maintainers, or (b) define it locally in pkg/common/common.go (e.g.,
PVCNamespaceNameAnnotation) and use that symbol in the doc; rename the
identifier to remove “Label” if it will be an annotation, update the
VirtualMachine annotation description to state its exact key/name, expected
value format (e.g., "<namespace>/<name>" or just "<namespace>"), and how it is
used to compute the ConfigMap label selector during restore so the
implementation (OperationID/async progress/DU status) can rely on a concrete
annotation key and value shape.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 36d9a07b-999a-41a6-8690-0d2f27272d0e

📥 Commits

Reviewing files that changed from the base of the PR and between 3cceb1e and 5ed54cf.

📒 Files selected for processing (1)
  • docs/design/kubevirt-datamover.md

Comment on lines +62 to 63
- Add `velerov1api.PVCNamespaceNameLabel` annotation to VirtualMachine (doesn't need to be a label, since we're just using it to figure out what label selector to use for the ConfigMap on restore).
- OperationID will be created and returned similar to what's done with the CSI PVC plugin, and the async operation Progress method will report on progress based on the DU status (similar to CSI PVC plugin)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Search for PVCNamespaceNameLabel definition in Velero API references

# Search for PVCNamespaceNameLabel in go files
rg -n "PVCNamespaceNameLabel" --type=go

# Also check if it's imported from velero
rg -n "velerov1api\." --type=go -A2 -B2 | rg -i "pvc.*label"

Repository: openshift/oadp-operator

Length of output: 49


🏁 Script executed:

# Search for similar label/annotation constants in the codebase
rg "Label\s*=" --type=go | head -20

# Search for velerov1api imports to understand what's available
rg "velerov1api\." --type=go -B5 | head -40

# Check pkg/common/common.go for existing constants
fd "common.go" --type=f | head -5

Repository: openshift/oadp-operator

Length of output: 1220


🏁 Script executed:

# Read pkg/common/common.go to see existing constants
cat -n pkg/common/common.go

Repository: openshift/oadp-operator

Length of output: 15794


🏁 Script executed:

# Search for how other Velero annotations are referenced and used
rg "velerov1\." --type=go | grep -i "annotation\|label" | head -20

Repository: openshift/oadp-operator

Length of output: 586


Clarify and verify the PVCNamespaceNameLabel annotation.

The constant velerov1api.PVCNamespaceNameLabel is referenced in the design document but does not exist anywhere in the codebase or Velero v1 API references found here. Additionally, the annotation name includes "Label" while the comment states it "doesn't need to be a label"—this contradiction is confusing.

Before implementation, clarify:

  1. Whether PVCNamespaceNameLabel needs to be added to Velero upstream or defined locally in pkg/common/common.go
  2. Why the name includes "Label" if it's an annotation, not a label
  3. The precise purpose and structure of this annotation for ConfigMap label selector resolution
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/kubevirt-datamover.md` around lines 62 - 63, The document
references a non-existent constant velerov1api.PVCNamespaceNameLabel and mixes
“label” vs “annotation”; update the design to either (a) import/add a constant
in Velero upstream only if agreed with maintainers, or (b) define it locally in
pkg/common/common.go (e.g., PVCNamespaceNameAnnotation) and use that symbol in
the doc; rename the identifier to remove “Label” if it will be an annotation,
update the VirtualMachine annotation description to state its exact key/name,
expected value format (e.g., "<namespace>/<name>" or just "<namespace>"), and
how it is used to compute the ConfigMap label selector during restore so the
implementation (OperationID/async progress/DU status) can rely on a concrete
annotation key and value shape.

Comment on lines +101 to +102
- PVC size based on the size of the qcow2 files in BSL needed for restore as well as the PVC sizes
- For each PVC, calculate the sum of all qcow2 files added to the PVC size, and then add 10% as a buffer. If there are multiple PVCs, take the max value, as we can process one PVC at a time, so we don't need to hold files for all PVCs on the temp disk at the same time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the 10% buffer sizing calculation.

The sizing calculation adds a 10% buffer to the sum of qcow2 files and PVC sizes. This buffer percentage appears arbitrary and may be insufficient in edge cases, such as:

  • Highly compressed qcow2 files that expand significantly during conversion
  • Temporary intermediate files created during the qemu-img operations
  • Filesystem overhead on the temporary PVC

Consider whether:

  1. The 10% buffer has been empirically validated with real-world VM backup scenarios
  2. A larger or dynamic buffer (e.g., 15-20%) would be safer
  3. The controller should detect and handle "insufficient space" errors by automatically retrying with a larger PVC

Would you like me to research typical qcow2 expansion ratios or generate logic to handle PVC resize scenarios?

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 101-101: Inconsistent indentation for list items at the same level
Expected: 4; Actual: 1

(MD005, list-indent)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/kubevirt-datamover.md` around lines 101 - 102, The doc's PVC
sizing logic currently hardcodes a 10% buffer when computing temp PVC size from
qcow2 sizes and PVC sizes; update the implementation (the PVC sizing calculation
/ any function named like calculate_buffered_size or tempPVCSize computation in
the datamover/controller) to make the buffer percentage configurable and
increase the default to a safer value (e.g., 15–20%), add validation tests or
notes documenting empirical bounds, and implement runtime handling for
insufficient-space errors by detecting qemu-img or filesystem ENOSPC failures
and retrying the operation after increasing the PVC size (or failing with a
clear actionable error). Ensure the change is reflected in docs by replacing the
hardcoded "10% buffer" text with the new configurable/default value and a note
about dynamic retry behavior.

- Process one PVC at a time:
- Download all required qcow2 files for this PVC.
- Call `qemu-img rebase -b fullbackup.qcow2 -f qcow2 -u incrementalN.qcow2` for each incremental backup, in order
- FIXME: Should we use the unsafe `-u` flag? kubevirt docs suggested this, but we should confirm that this is still appropriate.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve the FIXME about the unsafe -u flag before implementation.

The FIXME comment indicates uncertainty about whether to use the unsafe -u flag with qemu-img rebase. This is a critical decision point that affects data integrity during restore operations.

The unsafe flag bypasses backing file format consistency checks. Using it incorrectly could result in:

  • Corrupted restored VM disks
  • Silent data loss if the backing chain is inconsistent
  • Difficult-to-debug restore failures

This design decision must be resolved before implementation proceeds. Consult with KubeVirt subject matter experts or review the latest KubeVirt incremental backup documentation to confirm the recommended approach.

Would you like me to search the KubeVirt enhancement proposal and related issues to find guidance on this flag usage?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/kubevirt-datamover.md` at line 118, The FIXME in
docs/design/kubevirt-datamover.md about using the unsafe `-u` flag for qemu-img
rebase must be resolved: decide whether to use `-u` (and under what conditions)
by reviewing current KubeVirt incremental backup docs and consulting KubeVirt
SME/PRs, then replace the FIXME with a definitive guidance sentence that states
the chosen approach, the exact qemu-img invocation pattern, and the safety
rationale (when `-u` is allowed vs forbidden) so implementers of the datamover
know whether to include `-u` in the rebase step and why.

- Delete the raw file
- Note that the various `qemu-img` actions might eventually be combined into a single kubevirt API call, but for the moment this would need to be done manually.
- Once datamover pod has restored the VM disks, it will exit and the VirtualMachine can launch with these disks (following the velero datamover model where the temporary PVC is deleted, retaining the PV, which then binds to the VM's PVCs may work here). The temporary PVC (containing the qcow2 images, not the restored VM disk image) should be completely removed at this point, including PV content.
- Once datamover pod has restored the VM disks, it will exit, the temporary PVCs will be deleted, leaving the restored PVs but deleting the qcow2 staging PV.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the cleanup logic for temporary resources.

The statement "temporary PVCs will be deleted, leaving the restored PVs but deleting the qcow2 staging PV" is unclear about the distinction between:

  1. Temporary PVCs used for qcow2 download (created at line 99)
  2. Temporary PVCs for each VM PVC (created at line 103)
  3. The "qcow2 staging PV" mentioned here

Please clarify:

  • Which specific PVCs/PVs are deleted vs. retained
  • Whether "qcow2 staging PV" refers to the PV backing the temp PVC from line 99
  • The lifecycle of all temporary resources created during restore

This ambiguity could lead to resource leaks or incorrect cleanup during implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/kubevirt-datamover.md` at line 126, Clarify cleanup by explicitly
naming the temporary resources and their lifecycle: state that the "qcow2
download PVC" (the temp PVC created for the qcow2 download/staging) and its
backing PV (referred to as the "qcow2 staging PV") are deleted after the
datamover pod finishes and the qcow2 image is converted; state that each "per-VM
temp PVC" (the temp PVCs created for each VM PVC) are also deleted after their
corresponding VM disk has been copied to the target PVC and the PV bind is
completed; specify which PVs are retained (the final restored PVs bound to the
target PVCs) versus which PVs are removed (the qcow2 staging PV and any PVs
exclusively backing the temp PVCs), and add a concise ordered lifecycle sequence
(create qcow2 download PVC -> create per-VM temp PVCs -> datamover
copies/creates final PVCs -> datamover exits -> delete per-VM temp PVCs and
qcow2 download PVC -> leave restored PVs bound to final PVCs).

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 2, 2026

@sseago: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant