OCPBUGS-77015: wait for patch port to apply drop garp flows#2910
OCPBUGS-77015: wait for patch port to apply drop garp flows#2910bpickard22 wants to merge 1 commit intoopenshift:masterfrom
Conversation
WalkthroughUpdated OVN-Kubernetes script lib: Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bpickard22 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bindata/network/ovn-kubernetes/common/008-script-lib.yaml (1)
244-252:⚠️ Potential issue | 🟠 MajorValidate
of_portreadiness andovs-ofctlsuccess before returning.The function returns 0 immediately after calling
ovs-ofctl add-flow, without checking ifof_portis-1(port not ready) or if the command failed. This causes the retry wrapper to exit prematurely, even when the patch port isn't ready or the flow didn't apply, defeating the race condition fix.Before returning success, validate that
of_portis not-1or empty, and check theovs-ofctlexit status:Suggested guard & status check
of_port=$(ovs-vsctl get interface "$port_name" ofport) + if [[ -z "${of_port}" || "${of_port}" == "-1" ]]; then + return 1 + fi - ovs-ofctl add-flow "$bridge" "cookie=$cookie,table=0,priority=$priority,in_port=$of_port,arp,arp_op=1,actions=drop" > /dev/null - echo "$(date -Iseconds) - INFO - Added GARP drop flow on $bridge port $port_name (ofport=$of_port)" - return 0 + if ovs-ofctl add-flow "$bridge" "cookie=$cookie,table=0,priority=$priority,in_port=$of_port,arp,arp_op=1,actions=drop" > /dev/null; then + echo "$(date -Iseconds) - INFO - Added GARP drop flow on $bridge port $port_name (ofport=$of_port)" + return 0 + fi + return 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bindata/network/ovn-kubernetes/common/008-script-lib.yaml` around lines 244 - 252, The loop adds a GARP drop flow then returns success immediately without verifying the interface ofport or the ovs-ofctl result; update the code around the ovs-vsctl/ovs-ofctl calls (symbols: ovs-vsctl, of_port, ovs-ofctl, return 0, bridge, port_name) to first ensure of_port is neither empty nor "-1" and then check the exit status of the ovs-ofctl add-flow command (capture its return code) before printing the success message and returning 0; if of_port is invalid or ovs-ofctl failed, do not return—log the failure and allow the retry wrapper to continue/try the next port or repeat instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@bindata/network/ovn-kubernetes/common/008-script-lib.yaml`:
- Around line 244-252: The loop adds a GARP drop flow then returns success
immediately without verifying the interface ofport or the ovs-ofctl result;
update the code around the ovs-vsctl/ovs-ofctl calls (symbols: ovs-vsctl,
of_port, ovs-ofctl, return 0, bridge, port_name) to first ensure of_port is
neither empty nor "-1" and then check the exit status of the ovs-ofctl add-flow
command (capture its return code) before printing the success message and
returning 0; if of_port is invalid or ovs-ofctl failed, do not return—log the
failure and allow the retry wrapper to continue/try the next port or repeat
instead.
|
/retest |
|
ci is failing because the wait-and-add-drop-flows function I added is blocking, it currently has a retry window of 60 seconds, but its blocking, and in a ci cluster it takes longer than 60 seconds for the patch port to come up, so the container errors out, causing the failures I think there are two ways that we could go about this fix
I think that we can avoid the race between the ovn-controller and the ovnkube script more easily with this solution, but it will be a bigger change |
|
I think for now I am going to go with a mix. I will make the check run in the background and fail gracefully. this should provide some relief. In the meantime I will work on a fix in ovn-k, then when that merges we can remove the bash workaround here for the drop flows |
dbbd6ad to
0e93809
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bindata/network/ovn-kubernetes/common/008-script-lib.yaml (1)
245-257:⚠️ Potential issue | 🟠 MajorReturn success only after the flow is actually installed.
With the new wait loop, returning
0immediately after a port name match can stop retries even ifof_portis still-1orovs-ofctl add-flowfails. That would leave the drop flow missing and reintroduce the race. Please gate the success path on a validof_portand a successfulovs-ofctlcall.Proposed fix
if [[ "$port_name" == *to-br-int ]]; then local of_port of_port=$(ovs-vsctl get interface "$port_name" ofport) - ovs-ofctl add-flow "$bridge" "cookie=$cookie,table=0,priority=$priority,in_port=$of_port,arp,arp_op=1,actions=drop" > /dev/null - echo "$(date -Iseconds) - INFO - Added GARP drop flow on $bridge port $port_name (ofport=$of_port)" - return 0 + if [[ "${of_port}" -gt 0 ]] && ovs-ofctl add-flow "$bridge" "cookie=$cookie,table=0,priority=$priority,in_port=$of_port,arp,arp_op=1,actions=drop" > /dev/null; then + echo "$(date -Iseconds) - INFO - Added GARP drop flow on $bridge port $port_name (ofport=$of_port)" + return 0 + fi fi done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bindata/network/ovn-kubernetes/common/008-script-lib.yaml` around lines 245 - 257, The loop currently returns success as soon as a port name match is found even if of_port is "-1" or ovs-ofctl failed; update the block that handles matching port_name so that after setting of_port (from ovs-vsctl get interface ... ofport) you verify of_port is a valid number (not "-1") and then run ovs-ofctl add-flow and check its exit status before logging and returning 0; if of_port is invalid or ovs-ofctl fails, do not return success — either continue scanning other ports or let the function fall through to the final "return 1" so retries can happen. Ensure you reference the variables and commands used (port_name, of_port, bridge, cookie, priority, ovs-vsctl get interface, ovs-ofctl add-flow) when implementing the checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@bindata/network/ovn-kubernetes/common/008-script-lib.yaml`:
- Around line 245-257: The loop currently returns success as soon as a port name
match is found even if of_port is "-1" or ovs-ofctl failed; update the block
that handles matching port_name so that after setting of_port (from ovs-vsctl
get interface ... ofport) you verify of_port is a valid number (not "-1") and
then run ovs-ofctl add-flow and check its exit status before logging and
returning 0; if of_port is invalid or ovs-ofctl fails, do not return success —
either continue scanning other ports or let the function fall through to the
final "return 1" so retries can happen. Ensure you reference the variables and
commands used (port_name, of_port, bridge, cookie, priority, ovs-vsctl get
interface, ovs-ofctl add-flow) when implementing the checks.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
bindata/network/ovn-kubernetes/common/008-script-lib.yaml
|
/retest-required |
|
none of the failing jobs have passed for some time, so i think it should be safe to override right @jluhrsen |
|
@bpickard22: This pull request references Jira Issue OCPBUGS-77015, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| # Wait for patch port to exist and add GARP drop flow. | ||
| wait-and-add-garp-drop-flow() { | ||
| local bridge="$1" | ||
| local max_retries=90 |
There was a problem hiding this comment.
Did you check comment #2801 (comment) on the PR which added this GARP drop flow?
This comment basically says that this workaround is for hard reboots or container restarts and should not require a retry, as during a graceful node reboot we add GARP drop flows while stopping the nodeControllerManager.
There was a problem hiding this comment.
Thats the whole problem I believe. We have to ensure (or in this case best-effort) attempt to get the drop flow into the patch port.
There is a race between this script (where we write the flow) and the ovn-controller creating the patch port, and if the port is not up when we try to write the flow, we will just continue, and then the ovn-controller will create the patch port without the flow.
|
/retest-required |
|
Also to address the code-rabbit comment, initially I was trying to gate the script to ensure that the drop flows returns successfully, but in CI clusters it takes to long for the patch port to come up, so it would always timeout causing the container to fail and ci to error out. This workaround should provide relief via more success in adding the drop flows to the port, but there are scenarios in which it will not. The fix here -> ovn-kubernetes/ovn-kubernetes#6011 will ensure that it always works, and we can remove the workaround here when that merges, but in the meantime I believe this is our best option |
| (( retries += 1 )) | ||
| if [[ "${retries}" -lt "${max_retries}" ]]; then | ||
| echo "$(date -Iseconds) - WARN - Patch port not found on ${bridge}, retry ${retries}/${max_retries}" | ||
| sleep 2 |
There was a problem hiding this comment.
Sleeping for 2 seconds seems aggressive to me. Can we increase the number of retries instead?
There was a problem hiding this comment.
Please be mindful about info level logs if we are setting max_retries to a very high number.
There was a problem hiding this comment.
yeah good catch, I will reduce the level to a warning and reduce the amount of logs we send
There was a problem hiding this comment.
i still think the 2 second sleep is the right number to give us the best shot at the patch port being up, but i reduced the number of retries from 90 - 60 so at the worst case we only sit here for 2 mins
There is currently a race in the ovnkube-lib script where we try to add some drop flows to prevent duplicate garps to the patch port, but if the patch port isnt ready, the ovn-controller can connect to the db and create the patch port without the flow. We need to wait for the patch port to exist before we try to add the GARP drop flow also added some logs for better visibility Assisted-By: Claude Sonnet 4.5 Signed-off-by: Benjamin Pickard <bpickard@redhat.com>
0e93809 to
49ab7b2
Compare
|
@bpickard22: This pull request references Jira Issue OCPBUGS-77015, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bindata/network/ovn-kubernetes/common/008-script-lib.yaml`:
- Around line 249-252: The add_garp_drop_flow() function currently echoes
success and returns 0 immediately after running ovs-ofctl add-flow even if that
command failed; change the logic to capture and check the exit status of the
ovs-ofctl invocation (the command that writes the flow for $bridge / $of_port)
before logging success and returning 0. If ovs-ofctl returns non-zero, log an
error including the exit code and any stderr (don't let the > /dev/null swallow
the check), and return a non-zero value so the caller can retry; only return 0
when the ovs-ofctl add-flow command actually succeeded.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
bindata/network/ovn-kubernetes/common/008-script-lib.yaml
| ovs-ofctl add-flow "$bridge" "cookie=$cookie,table=0,priority=$priority,in_port=$of_port,arp,arp_op=1,actions=drop" > /dev/null | ||
| break | ||
| echo "$(date -Iseconds) - INFO - Added GARP drop flow on $bridge port $port_name (ofport=$of_port)" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the add_garp_drop_flow function and surrounding context
rg -n -C3 'add_garp_drop_flow\(|ovs-ofctl add-flow|return 0|return 1' bindata/network/ovn-kubernetes/common/008-script-lib.yaml | head -100Repository: openshift/cluster-network-operator
Length of output: 2213
Gate success on ovs-ofctl add-flow exit status.
add_garp_drop_flow() reports success even when flow insertion fails, which stops retries early. The function must check the ovs-ofctl add-flow exit status before returning success.
Proposed fix
- ovs-ofctl add-flow "$bridge" "cookie=$cookie,table=0,priority=$priority,in_port=$of_port,arp,arp_op=1,actions=drop" > /dev/null
- echo "$(date -Iseconds) - INFO - Added GARP drop flow on $bridge port $port_name (ofport=$of_port)"
- return 0
+ if ovs-ofctl add-flow "$bridge" "cookie=$cookie,table=0,priority=$priority,in_port=$of_port,arp,arp_op=1,actions=drop" > /dev/null; then
+ echo "$(date -Iseconds) - INFO - Added GARP drop flow on $bridge port $port_name (ofport=$of_port)"
+ return 0
+ fi
+ echo "$(date -Iseconds) - WARN - Failed to add GARP drop flow on $bridge port $port_name (ofport=$of_port), will retry"
+ return 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bindata/network/ovn-kubernetes/common/008-script-lib.yaml` around lines 249 -
252, The add_garp_drop_flow() function currently echoes success and returns 0
immediately after running ovs-ofctl add-flow even if that command failed; change
the logic to capture and check the exit status of the ovs-ofctl invocation (the
command that writes the flow for $bridge / $of_port) before logging success and
returning 0. If ovs-ofctl returns non-zero, log an error including the exit code
and any stderr (don't let the > /dev/null swallow the check), and return a
non-zero value so the caller can retry; only return 0 when the ovs-ofctl
add-flow command actually succeeded.
|
@bpickard22: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test e2e-gcp-ovn this working yet? |
There is currently a race in the ovnkube-lib script where we try to add some drop flows to prevent duplicate garps to the patch port, but if the patch port isnt ready, the ovn-controller can connect to the db and create the patch port without the flow.
We need to wait for the patch port to exist before we try to add the GARP drop flow
also added some logs for better visibility
Assisted-By: Claude Sonnet 4.5
cc @pperiyasamy
Summary by CodeRabbit