Skip to content

fix(kubeovn): derive MASTER_NODES from server group for multi-master support#12

Merged
lexfrei merged 10 commits intomainfrom
fix/multi-master-nodes
Mar 18, 2026
Merged

fix(kubeovn): derive MASTER_NODES from server group for multi-master support#12
lexfrei merged 10 commits intomainfrom
fix/multi-master-nodes

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented Mar 18, 2026

Summary

kube-ovn MASTER_NODES was hardcoded to cozystack_api_server_host — always a single IP. On multi-master clusters this breaks OVN RAFT consensus because ovn-central only gets one node in NODE_IPS.

The role now auto-detects all control-plane node IPs from the server inventory group, with an explicit cozystack_master_nodes override for non-standard inventories.

Ref: cozystack/cozystack#2242

Changes

  • Add cozystack_master_nodes variable (default: auto-detect from server group host keys)
  • Compute _cozystack_master_nodes via set_fact with validation
  • Update template to use computed value instead of single cozystack_api_server_host
  • Set explicit cozystack_master_nodes in CI inventory (localhost host key is not an IP)
  • Expand Ubuntu example to 3 control-plane nodes
  • Document cozystack_master_nodes and multi-master setup in README
  • Add CI job testing template rendering with both 3-node and single-node inventories

Test plan

  • ansible-lint passes
  • ansible-test sanity passes (CI)
  • Tested on a live cluster: 3-node Colima VMs with bridged networking, k3s multi-master. Platform Package CR correctly rendered with MASTER_NODES containing all three control-plane IPs. ovn-central started without host ip not in NODE_IPS errors
  • Idempotency verified (pipeline EXIT: 0 on re-run)

Summary by CodeRabbit

  • New Features

    • Multi‑master support with automatic master-node detection, explicit override option, and validation that host entries are IPs (hostname host keys rejected)
  • Documentation

    • Added multi‑master setup docs and examples showing detection and explicit configuration
  • Tests

    • New end‑to‑end tests covering auto‑detection, explicit override, IP validation, and manifest rendering
    • New CI job to run multi‑master validation in workflows

lexfrei added 5 commits March 18, 2026 14:11
…r kube-ovn

MASTER_NODES was set to cozystack_api_server_host (always a single IP),
breaking OVN RAFT consensus on multi-master clusters. Now auto-detected
from the 'server' inventory group host keys, with an explicit override
via cozystack_master_nodes variable.

Ref: cozystack/cozystack#2242

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
CI inventory uses 'localhost' as host key (not an IP), so auto-detection
would produce an incorrect value. Set the IP explicitly.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Expand server group to three control-plane nodes to demonstrate
multi-master setup. Add commented cozystack_master_nodes override.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Add variable to the optional variables table and a new section
explaining auto-detection from the server inventory group.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…tion

Verify that MASTER_NODES is correctly derived from the server inventory
group with both 3-node and single-node inventories, and that explicit
cozystack_master_nodes override works.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 18, 2026

Warning

Rate limit exceeded

@lexfrei has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0cf6e0e0-c0bb-41eb-a66c-f427962dfa6f

📥 Commits

Reviewing files that changed from the base of the PR and between cb1edb1 and 68dd433.

📒 Files selected for processing (2)
  • .github/workflows/test.yml
  • galaxy.yml
📝 Walkthrough
📝 Walkthrough
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: deriving MASTER_NODES from the server group to support multi-master Kubernetes clusters.
Description check ✅ Passed The description follows the template with all required sections (Summary, Changes, Test plan) completed. Each section provides concrete, comprehensive details about the fix and validation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/multi-master-nodes
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical limitation in kube-ovn's MASTER_NODES configuration, which previously only supported a single IP, hindering proper OVN RAFT consensus in multi-master Kubernetes clusters. The changes introduce a flexible mechanism to either automatically detect control-plane node IPs from the Ansible inventory's server group or allow users to explicitly define them, thereby ensuring robust multi-master support for kube-ovn deployments.

Highlights

  • Multi-Master Support for Kube-OVN: Introduced a new cozystack_master_nodes variable, allowing explicit specification of control-plane node IPs for kube-ovn RAFT consensus.
  • Automatic IP Detection: Implemented logic to automatically detect MASTER_NODES from the Ansible server inventory group if cozystack_master_nodes is not explicitly set.
  • Template Update: Modified the kube-ovn Platform Package template to utilize the dynamically derived or explicitly provided MASTER_NODES value.
  • Expanded Example and Documentation: Updated the Ubuntu example inventory to a three-node control-plane setup and added comprehensive documentation for multi-master configuration in the README.md.
  • New CI Validation: Integrated a new CI job to validate the correct rendering of MASTER_NODES in the template across both single-node and multi-node inventories.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/test.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@lexfrei lexfrei marked this pull request as ready for review March 18, 2026 13:49
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the limitation of hardcoded MASTER_NODES by introducing a flexible mechanism to derive these nodes from the server inventory group or allow explicit override. The changes are well-implemented across the Ansible roles, including defaults, tasks, and templates. Comprehensive documentation updates in README.md and example inventory files (examples/ubuntu/inventory.yml) clearly explain the new functionality and usage. Furthermore, the addition of dedicated test playbooks (tests/test-master-nodes.yml and tests/test-multi-master-inventory.yml) ensures the correctness and robustness of the multi-master support. The update to tests/ci-inventory.yml correctly handles the CI environment's specific needs. Overall, this is a solid improvement that enhances the flexibility and reliability of kube-ovn deployments in multi-master clusters.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test-master-nodes.yml (1)

63-72: Consider a more robust assertion for MASTER_NODES verification.

The assertion at line 67 uses to_json which adds JSON-style quoting. This works if the rendered YAML contains the value as a quoted string, but could be fragile if the template format changes. A more explicit check could parse the YAML and verify the actual value.

That said, for the current use case this approach is acceptable and functional.

♻️ Optional: More robust YAML parsing approach
     - name: Assert rendered MASTER_NODES contains all server IPs
       ansible.builtin.assert:
         that:
           - "'MASTER_NODES' in _rendered_content"
-          - "(_cozystack_master_nodes | to_json) in _rendered_content"
+          - "_cozystack_master_nodes in _rendered_content"
         fail_msg: >-
           Rendered template does not contain expected MASTER_NODES value.
-          Expected {{ _cozystack_master_nodes | to_json }} in output.
+          Expected '{{ _cozystack_master_nodes }}' in output.
       vars:
         _rendered_content: "{{ _rendered_manifest.content | b64decode }}"

Alternatively, if stricter validation is needed, consider using from_yaml filter to parse and extract the actual value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test-master-nodes.yml` around lines 63 - 72, The current assert task
for MASTER_NODES is fragile because it checks the rendered string with
(_cozystack_master_nodes | to_json) inside _rendered_content; instead parse the
rendered manifest and compare structured values: b64decode
_rendered_manifest.content into _rendered_content, apply from_yaml to parse it,
then assert that the parsed data's MASTER_NODES key equals or contains the
_cozystack_master_nodes list; update the ansible.builtin.assert task to
reference the parsed variable (using from_yaml and b64decode) rather than a
quoted JSON string so the check is robust to formatting changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test-master-nodes.yml`:
- Around line 63-72: The current assert task for MASTER_NODES is fragile because
it checks the rendered string with (_cozystack_master_nodes | to_json) inside
_rendered_content; instead parse the rendered manifest and compare structured
values: b64decode _rendered_manifest.content into _rendered_content, apply
from_yaml to parse it, then assert that the parsed data's MASTER_NODES key
equals or contains the _cozystack_master_nodes list; update the
ansible.builtin.assert task to reference the parsed variable (using from_yaml
and b64decode) rather than a quoted JSON string so the check is robust to
formatting changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dcabcb0d-a4f9-4758-8598-96af4c9c9393

📥 Commits

Reviewing files that changed from the base of the PR and between 6eb8e37 and cfc72aa.

📒 Files selected for processing (9)
  • .github/workflows/test.yml
  • README.md
  • examples/ubuntu/inventory.yml
  • roles/cozystack/defaults/main.yml
  • roles/cozystack/tasks/main.yml
  • roles/cozystack/templates/platform-package.yml.j2
  • tests/ci-inventory.yml
  • tests/test-master-nodes.yml
  • tests/test-multi-master-inventory.yml

Move master node computation and validation to a dedicated task file
(compute-master-nodes.yml) so both the role and tests use the same
code path. Add IP address validation using ansible.utils.ipaddr to
catch hostname-based inventory keys at Ansible time instead of
runtime kube-ovn failures.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei self-assigned this Mar 18, 2026
lexfrei added 2 commits March 18, 2026 22:46
…master auto-detection

The ipaddr filter requires ansible.utils collection and netaddr Python
package. Add both to CI and declare ansible.utils in requirements.yml.

Add a single-master test inventory with IP host key (127.0.0.1) to
exercise the auto-detection path that was previously untested.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Verify that IP validation correctly rejects hostname-based host keys
in the server group with a clear error message directing users to set
cozystack_master_nodes explicitly.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@roles/cozystack/tasks/compute-master-nodes.yml`:
- Around line 7-10: The current lookup uses groups['server'] which will
hard-fail when the 'server' group is absent; change the compute expression to
safely default to an empty list before joining (e.g. use groups.get('server',
[]) or groups['server'] | default([])) so in the template block that references
cozystack_master_nodes and groups['server'] you replace the failing join with a
guarded join (e.g. groups.get('server', []) | join(',') or groups['server'] |
default([]) | join(',')), preserving the existing cozystack_master_nodes check
and assert flow.
- Around line 21-24: Add the missing ansible.utils collection to test
dependencies instead of changing filter syntax: update tests/requirements.yml
(and optionally examples/*/requirements.yml for consistency) to include
ansible.utils, keep the explicit ansible.utils.ipaddr usage in the task named
"Validate master node IPs are valid addresses" in
roles/cozystack/tasks/compute-master-nodes.yml, and adjust the loop handling
there to use whitespace-safe looping (e.g., proper loop/loop_control or
with_items handling) so the assert runs correctly for each item.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29293594-45eb-414c-bd88-3e1851fbf096

📥 Commits

Reviewing files that changed from the base of the PR and between cfc72aa and fb3131a.

📒 Files selected for processing (3)
  • roles/cozystack/tasks/compute-master-nodes.yml
  • roles/cozystack/tasks/main.yml
  • tests/test-master-nodes.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test-master-nodes.yml

Comment on lines +7 to +10
{%- if cozystack_master_nodes | length > 0 -%}
{{ cozystack_master_nodes }}
{%- else -%}
{{ groups['server'] | join(',') }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard inventory lookups to avoid pre-assert crashes.

At Line 10, groups['server'] hard-fails if the group is absent, which bypasses your clearer assert message. Use safe defaults in the compute step.

Proposed hardening
     _cozystack_master_nodes: >-
-      {%- if cozystack_master_nodes | length > 0 -%}
+      {%- if (cozystack_master_nodes | default('')) | length > 0 -%}
         {{ cozystack_master_nodes }}
       {%- else -%}
-        {{ groups['server'] | join(',') }}
+        {{ groups.get('server', []) | join(',') }}
       {%- endif -%}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/cozystack/tasks/compute-master-nodes.yml` around lines 7 - 10, The
current lookup uses groups['server'] which will hard-fail when the 'server'
group is absent; change the compute expression to safely default to an empty
list before joining (e.g. use groups.get('server', []) or groups['server'] |
default([])) so in the template block that references cozystack_master_nodes and
groups['server'] you replace the failing join with a guarded join (e.g.
groups.get('server', []) | join(',') or groups['server'] | default([]) |
join(',')), preserving the existing cozystack_master_nodes check and assert
flow.

The ansible.utils.ipaddr filter requires the netaddr Python package
on the target node. Add it to the E2E job pip install.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/workflows/test.yml (1)

82-84: Prefer installing collections from requirements.yml to avoid CI drift.

At Line 83, installing only ansible.utils duplicates dependency declarations and can diverge from the repo’s canonical list.

💡 Proposed refactor
-      - name: Install required collections
-        run: ansible-galaxy collection install ansible.utils
+      - name: Install required collections
+        run: ansible-galaxy collection install --requirements-file requirements.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 82 - 84, The CI step named "Install
required collections" is hardcoding a single collection (ansible.utils) which
risks drift; change the step so it installs collections from the repository's
canonical requirements file instead of the single package—replace the
ansible-galaxy command that runs "ansible-galaxy collection install
ansible.utils" with a command that uses the requirements file (e.g.,
"ansible-galaxy collection install -r requirements.yml") and ensure the workflow
references the correct requirements.yml path so all declared collections are
installed consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test.yml:
- Around line 100-108: The current negative test treats any ansible-playbook
failure as success; modify the step named "Test hostname host keys are rejected"
to capture the playbook output (run ansible-playbook tests/test-master-nodes.yml
--inventory tests/test-hostname-inventory.yml) into a variable and assert the
failure contains the specific hostname-key validation message (e.g. grep/check
for a known substring like "host key validation" or the exact error emitted when
hostname keys are rejected) before declaring success; if the playbook fails but
the output does not contain that expected message, fail the job to avoid false
positives.

---

Nitpick comments:
In @.github/workflows/test.yml:
- Around line 82-84: The CI step named "Install required collections" is
hardcoding a single collection (ansible.utils) which risks drift; change the
step so it installs collections from the repository's canonical requirements
file instead of the single package—replace the ansible-galaxy command that runs
"ansible-galaxy collection install ansible.utils" with a command that uses the
requirements file (e.g., "ansible-galaxy collection install -r
requirements.yml") and ensure the workflow references the correct
requirements.yml path so all declared collections are installed consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 301c06a9-0d24-48c1-ab06-4c441883b9b3

📥 Commits

Reviewing files that changed from the base of the PR and between fb3131a and cb1edb1.

📒 Files selected for processing (4)
  • .github/workflows/test.yml
  • requirements.yml
  • tests/test-hostname-inventory.yml
  • tests/test-single-master-inventory.yml
✅ Files skipped from review due to trivial changes (1)
  • tests/test-single-master-inventory.yml

Comment on lines +100 to +108
- name: Test hostname host keys are rejected
run: |
if ansible-playbook tests/test-master-nodes.yml \
--inventory tests/test-hostname-inventory.yml 2>&1; then
echo "ERROR: Expected failure for hostname host keys, but playbook succeeded"
exit 1
else
echo "OK: Hostname host keys correctly rejected"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Tighten negative test assertion to prevent false positives.

At Lines 102-108, any playbook failure is considered success. This can pass even when failure is unrelated to hostname-key validation.

✅ Proposed fix (assert failure reason)
       - name: Test hostname host keys are rejected
         run: |
-          if ansible-playbook tests/test-master-nodes.yml \
-            --inventory tests/test-hostname-inventory.yml 2>&1; then
+          set +e
+          output="$(ansible-playbook tests/test-master-nodes.yml \
+            --inventory tests/test-hostname-inventory.yml 2>&1)"
+          status=$?
+          set -e
+
+          if [ "$status" -eq 0 ]; then
             echo "ERROR: Expected failure for hostname host keys, but playbook succeeded"
             exit 1
-          else
-            echo "OK: Hostname host keys correctly rejected"
           fi
+
+          if ! grep -q "not a valid IP address in MASTER_NODES" <<<"$output"; then
+            echo "ERROR: Playbook failed, but not due to hostname/IP validation"
+            echo "$output"
+            exit 1
+          fi
+
+          echo "OK: Hostname host keys correctly rejected"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 100 - 108, The current negative test
treats any ansible-playbook failure as success; modify the step named "Test
hostname host keys are rejected" to capture the playbook output (run
ansible-playbook tests/test-master-nodes.yml --inventory
tests/test-hostname-inventory.yml) into a variable and assert the failure
contains the specific hostname-key validation message (e.g. grep/check for a
known substring like "host key validation" or the exact error emitted when
hostname keys are rejected) before declaring success; if the playbook fails but
the output does not contain that expected message, fail the job to avoid false
positives.

The role uses ansible.utils.ipaddr in compute-master-nodes.yml.
Without this dependency in galaxy.yml, users installing the collection
via ansible-galaxy would get a missing filter error.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei merged commit 8a19589 into main Mar 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant