[DPE-8772] Stereo mode: add watcher role to the PostgreSQL charm#1516
[DPE-8772] Stereo mode: add watcher role to the PostgreSQL charm#1516marceloneppel wants to merge 78 commits into16/edgefrom
Conversation
Add a lightweight witness/voter charm that participates in Raft consensus to provide quorum in 2-node PostgreSQL clusters without storing any PostgreSQL data. Key components: - Watcher charm with Raft controller integration - Health checking for PostgreSQL endpoints - Relation interface (postgresql_watcher) for PostgreSQL operator - Topology and health check actions Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
… pysyncobj Raft service Add standalone raft_service.py that implements KVStoreTTL-compatible Raft node managed as a systemd service, eliminating the dependency on the charmed-postgresql snap. Remove automatic health checks in favor of on-demand checks via action, since the watcher lacks PostgreSQL credentials. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…tereo mode tests Replace cut_network_from_unit_without_ip_change with cut_network_from_unit in stereo mode integration tests. The iptables-based approach with REJECT was still causing timeouts; removing the interface entirely triggers faster TCP connection failures. Added use_ip_from_inside=True for check_writes since restored units get new IPs. Also adds spread task for stereo mode tests. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add Raft member proactively during IP change to prevent race conditions where member restarts Patroni before being added to cluster. Implement watcher removal from Raft on relation departure to maintain correct quorum calculations. Add idempotency check before adding watcher to Raft. Use fresh peer IPs for Raft member addition instead of cached values. Update stereo mode tests with iptables-based network isolation and Raft health verification. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…o tests Build the watcher charm automatically if not found and deploy charms sequentially instead of concurrently to improve reliability. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Add idempotency check to skip deployment if already in expected state - Clean up unexpected state before redeploying to avoid test pollution - Add wait_for_idle after replica shutdown to allow cluster stabilization Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…fy_raft_cluster_health call - Add use_ip_from_inside=True to test_watcher_network_isolation to handle stale IPs - Fix verify_raft_cluster_health call in test_health_check_action to pass required arguments Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add __expire_keys and _onTick methods to WatcherKVStoreTTL to match Patroni's KVStoreTTL behavior. When the watcher becomes the Raft leader (e.g., when PostgreSQL primary is network-isolated), it must expire stale leader keys so that a replica can acquire leadership. Without this fix, the watcher would become Raft leader but wouldn't process TTL expirations, causing the old Patroni leader key to remain valid and preventing failover. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Juju action results require hyphenated keys (e.g., 'healthy-count') rather than underscored keys. Fixed the health check action to use proper key format and updated test expectations. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…sues
- Add watcher PostgreSQL user for health check authentication:
- Create 'watcher' user with password via relation secret
- Add pg_hba.conf entry for watcher IP in patroni.yml template
- Pass password from relation secret to health checker
- Fix lint issues:
- Extract S3 initialization to _handle_s3_initialization() to reduce
_on_peer_relation_changed complexity from 11 to 10
- Use absolute paths for subprocess commands (/usr/bin/systemctl, etc.)
- Update type hints to use modern syntax (X | None vs Optional[X])
- Fix line length formatting issues
- Fix unit test failures:
- Add missing mocks in test_update_member_ip for endpoint methods
- Add _units_ips mock in test_update_relation_data_leader
- Fix integration test:
- Add check_watcher_ip parameter to verify_raft_cluster_health()
to handle watcher IP changes after network isolation tests
- Update watcher charm to handle IP changes:
- Add _update_unit_address_if_changed() for IP change detection
- Call from config-changed and update-status events
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Remove outdated constraint about deploy order being critical for stereo mode with Raft DCS. Testing confirmed that 2 PostgreSQL units can now be deployed simultaneously without causing split-brain. Also update deprecated relate() calls to integrate(). Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
* add new-tab-link extension and increase linkcheck timeout Signed-off-by: andreia <andreia.velasco@canonical.com> * replace mentions of old Juju password actions with Juju secrets Signed-off-by: andreia <andreia.velasco@canonical.com> * update links to 16 repo and remove mention of 14 bundle Signed-off-by: andreia <andreia.velasco@canonical.com> * update instructions for secrets retrieval --------- Signed-off-by: andreia <andreia.velasco@canonical.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* refactor home page * fix missing refs
* add new stable releases to releases.md * invert order (newest to oldest) * Update release in refresh docs * correct architecture for 990, 989 * correct arch for 952, 951 --------- Co-authored-by: Carl Csaposs <carl.csaposs@canonical.com>
Integrate the watcher charm as a mode within the main PostgreSQL charm, following the MongoDB pattern of using a config `role` option to alternate between "postgresql" (default) and "watcher" modes. Key changes: - Add `role` config option (postgresql|watcher), immutable after deploy - Rename provides relation `watcher` to `watcher-offer` for PostgreSQL mode - Add requires relation `watcher` for watcher mode - Branch charm __init__ based on role: watcher mode skips snap install, Patroni, backups, TLS, etc. and only runs Raft + health checker - Move watcher source files (raft_controller, raft_service, watcher_health) into main src/ - Create WatcherRequirerHandler for watcher-mode event handling - Persist role in peer databag and block on role change attempts - Update integration tests for unified charm deployment Deploy example: juju deploy postgresql pg juju deploy postgresql pg-watcher --config role=watcher juju relate pg:watcher-offer pg-watcher:watcher Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
The @trace_charm decorator expects tracing_endpoint attribute to exist after __init__. In watcher mode we return early, so set it to None. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
* Limit bucket listing to find the timelines * Add ceph pitr test * Switch back to recurse * Refactor tests * Fix imports * Fix tests * Reduce boto logs * Typo
* Cleanup config code * Merge update sync config in the bulk patch call * Add storage-hot-standby-feedback and durability-maximum-lag-on-failover * Fix default * Remove extra patch * Update to spec
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* Move TLS transfer to single kernel * Switch to released lib
* add instructions for custom usernames to integration guide * Update docs/how-to/integrate-with-another-application.md Co-authored-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com> Signed-off-by: Andreia <andreia.velasco@canonical.com> --------- Signed-off-by: Andreia <andreia.velasco@canonical.com> Co-authored-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…ailable) (#1318) * DPE-8980 Support Juju 4: us 'ip' databag field (overwrites 'private-address') The Juju 4 has removed support databag fiesl `private-address`, `ingress-address` and more. The field we should use is `ip` now. The PG16 charm still have to support Juju 3.6 LTS, so adding support of the ip field with backward compatibility. Users can deploy it on Juju 4 using: > juju deploy postgresql --channel 16/edge --force * Address comments in PR
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Signed-off-by: Andreia <andreia.velasco@canonical.com>
* Test app channel and base/series * Switch from base to series * Switch bases to series
- Handle existing relations gracefully in test_build_and_deploy_stereo_mode - Update charm base from Ubuntu 22.04 to 24.04 Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…nces Enable watcher charm to connect to multiple PostgreSQL clusters with dynamic port allocation, isolated data directories, and AZ-aware deployment blocking to prevent split-brain scenarios. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
| content = secret.get_content(refresh=True) | ||
| return content.get("raft-password") | ||
| except SecretNotFoundError: | ||
| logger.warning(f"Secret {secret_id} not found") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
| content = secret.get_content(refresh=True) | ||
| return content.get("watcher-password") | ||
| except SecretNotFoundError: | ||
| logger.warning(f"Secret {secret_id} not found") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
…d-charm Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Codecov Report❌ Patch coverage is ❌ Your project check has failed because the head coverage (64.79%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## 16/edge #1516 +/- ##
===========================================
- Coverage 70.43% 64.79% -5.64%
===========================================
Files 15 19 +4
Lines 4282 5545 +1263
Branches 694 889 +195
===========================================
+ Hits 3016 3593 +577
- Misses 1057 1675 +618
- Partials 209 277 +68 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…d-charm Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Use charmed-postgresql snap's patroni_raft_controller instead of custom pysyncobj implementation for wire compatibility with Patroni. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Replace local snap bundling with snap store installation using snap charm library. Removes watcher-snap build part and subprocess calls in favor of cleaner snap.SnapCache API. Install from channel 16/edge/neppel. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…and Juju spaces Add integration tests verifying: - Single watcher serving two PostgreSQL clusters with async replication - Stereo mode deployment across separate Juju spaces for network isolation - Cross-space Raft consensus and failover with space-bound Patroni API Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…d-charm Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Remove duplicate postgresql-watcher/ standalone charm; consolidate all watcher logic into the main charm's role=watcher mode - Fix Raft config change detection so partner/password changes trigger restarts - Clean up systemd service on relation-broken (stop → disable → remove) - Add early role-immutability validation in __init__ before mode init - Propagate per-unit AZ and IP data from all units, not just leader - Serialize action results as JSON to avoid Juju key validation errors - Infer cluster role (primary/standby/unknown) from health check data - Wire standby cluster linking into cluster-set status output - Return False from _install_service() when daemon-reload fails - Guard against None unit_ip producing "None:port" in topology - Fix SNAP_CHANNEL from dev-specific "16/edge/neppel" to "16/edge" - Rename show-topology action to get-cluster-status with new params - Add unit tests for Raft controller, watcher requirer, and role validation Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use configured self_addr first, then 127.0.0.1:<port> when querying watcher Raft status. This avoids a false waiting state when local administration is reachable only via loopback. Add unit test coverage for fallback behavior after self_addr probe failure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The snap app profile for charmed-postgresql.patroni-raft-controller has no network permissions, so the process starts but never binds a socket. Use the patroni app profile to launch patroni_raft_controller with network-bind access. Add a unit test to ensure the generated service uses the patroni profile ExecStart command. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d-charm Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…data, and block PG-only actions - Exclude watcher from Raft voting when the PostgreSQL unit count is odd; adding it in that case would produce an even total Raft membership, degrading partition tolerance. The watcher is dynamically added/removed as the PG unit count changes between even and odd. - Publish watcher-voting, timeline, per-member lag, and tls-enabled in relation data so the requirer can report accurate cluster status without querying Patroni separately. - Register handlers for PG-specific actions (create-backup, restore, etc.) on watcher units so they fail with a clear, human-readable message instead of a generic Juju "action not found" error. - Rename get-cluster-status action param `cluster-set` → `standby-clusters`. - Add integration tests for odd-count Raft exclusion and action blocking. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Extract watcher voting, member-lag parsing, topology building, TLS check, and timeline parsing into dedicated methods. Also remove unused RetryError import and fix minor whitespace issues in integration tests. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Summary
This PR turns the earlier stereo-mode watcher proof of concept into a deployment mode of the main
postgresqlcharm. The same charm can now be deployed either as a full PostgreSQL unit or as a lightweight watcher/witness by settingrole=watcherat deploy time.This keeps stereo mode inside a single charm codebase while also expanding watcher support for multi-cluster and async-replication scenarios.
The implementation includes:
roleconfig withpostgresqlandwatchermodeswatcher-offerandwatcherraft_controllerservices with dynamic port allocationget-cluster-statusandtrigger-health-checkactionsHow It Works
watcher-offer.role=watcherskips the PostgreSQL workload and runs only the watcher logic.Deployment
For a local or single-host LXD demo, deploy the watcher with
profile=testingso the AZ co-location safety check does not block the watcher when all units land in the same availability zone.For production deployments, keep
profile=productionand place the watcher in a different availability zone from the PostgreSQL units. If the watcher is co-located in the same AZ, it will enterblockedwith anAZ co-locationstatus by design.A single watcher deployment can also relate to more than one PostgreSQL cluster, which is covered by the async-replication stereo-mode tests added in this branch.
Test Coverage
Checklist
Supercedes #1401.