Skip to content

[DPE-8772] Stereo mode: add watcher role to the PostgreSQL charm#1516

Draft
marceloneppel wants to merge 78 commits into16/edgefrom
stereo-mode-unified-charm
Draft

[DPE-8772] Stereo mode: add watcher role to the PostgreSQL charm#1516
marceloneppel wants to merge 78 commits into16/edgefrom
stereo-mode-unified-charm

Conversation

@marceloneppel
Copy link
Copy Markdown
Member

@marceloneppel marceloneppel commented Mar 10, 2026

Summary

This PR turns the earlier stereo-mode watcher proof of concept into a deployment mode of the main postgresql charm. The same charm can now be deployed either as a full PostgreSQL unit or as a lightweight watcher/witness by setting role=watcher at deploy time.

This keeps stereo mode inside a single charm codebase while also expanding watcher support for multi-cluster and async-replication scenarios.

The implementation includes:

  • An immutable deploy-time role config with postgresql and watcher modes
  • Unified watcher relation support in the main charm via watcher-offer and watcher
  • Patroni-compatible per-relation raft_controller services with dynamic port allocation
  • Watcher-side get-cluster-status and trigger-health-check actions
  • Richer relation data for timeline, per-member lag, TLS status, standby clusters, and watcher voting
  • Clear failure messages for PostgreSQL-only actions run against watcher units
  • New unit, integration, and spread coverage for stereo mode, async replication, and Juju spaces

How It Works

┌──────────────────────────┐      watcher-offer / watcher      ┌──────────────────────────┐
│ PostgreSQL cluster       │◄─────────────────────────────────►│ postgresql               │
│ role=postgresql          │                                    │ role=watcher            │
│ - Patroni + PostgreSQL   │                                    │ - Raft witness only     │
│ - Raft voter             │                                    │ - No PostgreSQL server  │
└──────────────────────────┘                                    └──────────────────────────┘
  • The default deployment keeps the normal PostgreSQL workload and exposes watcher-offer.
  • A deployment with role=watcher skips the PostgreSQL workload and runs only the watcher logic.
  • Each watcher relation gets its own Patroni-compatible Raft controller instance, data directory, and port.
  • The watcher is dynamically excluded from voting when the PostgreSQL unit count is odd, avoiding an even-sized Raft membership that would reduce partition tolerance.
  • The watcher publishes and consumes enough relation data to report cluster topology, health, and async-replication context without requiring separate manual Patroni queries.

Deployment

For a local or single-host LXD demo, deploy the watcher with profile=testing so the AZ co-location safety check does not block the watcher when all units land in the same availability zone.

# Deploy a 2-node PostgreSQL cluster
juju deploy postgresql pg --channel 16/edge --base ubuntu@24.04 --num-units 2

# Deploy the same charm as a watcher
# Use profile=testing for local demos where all units may share the same AZ.
juju deploy postgresql pg-watcher --channel 16/edge --base ubuntu@24.04 \
  --config role=watcher \
  --config profile=testing

# Relate PostgreSQL to the watcher
juju integrate pg:watcher-offer pg-watcher:watcher

# Inspect watcher-reported status
juju run pg-watcher/0 get-cluster-status
juju run pg-watcher/0 trigger-health-check

For production deployments, keep profile=production and place the watcher in a different availability zone from the PostgreSQL units. If the watcher is co-located in the same AZ, it will enter blocked with an AZ co-location status by design.

A single watcher deployment can also relate to more than one PostgreSQL cluster, which is covered by the async-replication stereo-mode tests added in this branch.

Test Coverage

  • Unit tests for role validation, watcher relation handling, watcher requirer behaviour, and Raft controller service generation
  • Integration tests for stereo mode deployment, failover, watcher actions, odd/even voting behaviour, async replication, and cross-space networking
  • Spread coverage for stereo mode

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

Supercedes #1401.

marceloneppel and others added 30 commits January 27, 2026 08:54
Add a lightweight witness/voter charm that participates in Raft
consensus to provide quorum in 2-node PostgreSQL clusters without
storing any PostgreSQL data.

Key components:
- Watcher charm with Raft controller integration
- Health checking for PostgreSQL endpoints
- Relation interface (postgresql_watcher) for PostgreSQL operator
- Topology and health check actions

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
… pysyncobj Raft service

Add standalone raft_service.py that implements KVStoreTTL-compatible
Raft node managed as a systemd service, eliminating the dependency on
the charmed-postgresql snap. Remove automatic health checks in favor of
on-demand checks via action, since the watcher lacks PostgreSQL credentials.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…tereo mode tests

Replace cut_network_from_unit_without_ip_change with cut_network_from_unit
in stereo mode integration tests. The iptables-based approach with REJECT
was still causing timeouts; removing the interface entirely triggers faster
TCP connection failures. Added use_ip_from_inside=True for check_writes
since restored units get new IPs. Also adds spread task for stereo mode tests.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add Raft member proactively during IP change to prevent race conditions
where member restarts Patroni before being added to cluster. Implement
watcher removal from Raft on relation departure to maintain correct
quorum calculations. Add idempotency check before adding watcher to Raft.
Use fresh peer IPs for Raft member addition instead of cached values.
Update stereo mode tests with iptables-based network isolation and Raft
health verification.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…o tests

Build the watcher charm automatically if not found and deploy charms
sequentially instead of concurrently to improve reliability.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Add idempotency check to skip deployment if already in expected state
- Clean up unexpected state before redeploying to avoid test pollution
- Add wait_for_idle after replica shutdown to allow cluster stabilization

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…fy_raft_cluster_health call

- Add use_ip_from_inside=True to test_watcher_network_isolation to handle stale IPs
- Fix verify_raft_cluster_health call in test_health_check_action to pass required arguments

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add __expire_keys and _onTick methods to WatcherKVStoreTTL to match
Patroni's KVStoreTTL behavior. When the watcher becomes the Raft leader
(e.g., when PostgreSQL primary is network-isolated), it must expire
stale leader keys so that a replica can acquire leadership.

Without this fix, the watcher would become Raft leader but wouldn't
process TTL expirations, causing the old Patroni leader key to remain
valid and preventing failover.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Juju action results require hyphenated keys (e.g., 'healthy-count')
rather than underscored keys. Fixed the health check action to use
proper key format and updated test expectations.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…sues

- Add watcher PostgreSQL user for health check authentication:
  - Create 'watcher' user with password via relation secret
  - Add pg_hba.conf entry for watcher IP in patroni.yml template
  - Pass password from relation secret to health checker

- Fix lint issues:
  - Extract S3 initialization to _handle_s3_initialization() to reduce
    _on_peer_relation_changed complexity from 11 to 10
  - Use absolute paths for subprocess commands (/usr/bin/systemctl, etc.)
  - Update type hints to use modern syntax (X | None vs Optional[X])
  - Fix line length formatting issues

- Fix unit test failures:
  - Add missing mocks in test_update_member_ip for endpoint methods
  - Add _units_ips mock in test_update_relation_data_leader

- Fix integration test:
  - Add check_watcher_ip parameter to verify_raft_cluster_health()
    to handle watcher IP changes after network isolation tests

- Update watcher charm to handle IP changes:
  - Add _update_unit_address_if_changed() for IP change detection
  - Call from config-changed and update-status events

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Remove outdated constraint about deploy order being critical for
stereo mode with Raft DCS. Testing confirmed that 2 PostgreSQL
units can now be deployed simultaneously without causing split-brain.

Also update deprecated relate() calls to integrate().

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
* add new-tab-link extension and increase linkcheck timeout

Signed-off-by: andreia <andreia.velasco@canonical.com>

* replace mentions of old Juju password actions with Juju secrets

Signed-off-by: andreia <andreia.velasco@canonical.com>

* update links to 16 repo and remove mention of 14 bundle

Signed-off-by: andreia <andreia.velasco@canonical.com>

* update instructions for secrets retrieval

---------

Signed-off-by: andreia <andreia.velasco@canonical.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* refactor home page

* fix missing refs
* add new stable releases to releases.md

* invert order (newest to oldest)

* Update release in refresh docs

* correct architecture for 990, 989

* correct arch for 952, 951

---------

Co-authored-by: Carl Csaposs <carl.csaposs@canonical.com>
Integrate the watcher charm as a mode within the main PostgreSQL charm,
following the MongoDB pattern of using a config `role` option to alternate
between "postgresql" (default) and "watcher" modes.

Key changes:
- Add `role` config option (postgresql|watcher), immutable after deploy
- Rename provides relation `watcher` to `watcher-offer` for PostgreSQL mode
- Add requires relation `watcher` for watcher mode
- Branch charm __init__ based on role: watcher mode skips snap install,
  Patroni, backups, TLS, etc. and only runs Raft + health checker
- Move watcher source files (raft_controller, raft_service, watcher_health)
  into main src/
- Create WatcherRequirerHandler for watcher-mode event handling
- Persist role in peer databag and block on role change attempts
- Update integration tests for unified charm deployment

Deploy example:
  juju deploy postgresql pg
  juju deploy postgresql pg-watcher --config role=watcher
  juju relate pg:watcher-offer pg-watcher:watcher

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
The @trace_charm decorator expects tracing_endpoint attribute to exist
after __init__. In watcher mode we return early, so set it to None.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
* Limit bucket listing to find the timelines

* Add ceph pitr test

* Switch back to recurse

* Refactor tests

* Fix imports

* Fix tests

* Reduce boto logs

* Typo
* Cleanup config code

* Merge update sync config in the bulk patch call

* Add storage-hot-standby-feedback and durability-maximum-lag-on-failover

* Fix default

* Remove extra patch

* Update to spec
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* Move TLS transfer to single kernel

* Switch to released lib
* add instructions for custom usernames to integration guide

* Update docs/how-to/integrate-with-another-application.md

Co-authored-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Andreia <andreia.velasco@canonical.com>

---------

Signed-off-by: Andreia <andreia.velasco@canonical.com>
Co-authored-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…ailable) (#1318)

* DPE-8980 Support Juju 4: us 'ip' databag field (overwrites 'private-address')

The Juju 4 has removed support databag fiesl `private-address`, `ingress-address` and more.
The field we should use is `ip` now. The PG16 charm still have to support Juju 3.6 LTS,
so adding support of the ip field with backward compatibility.

Users can deploy it on Juju 4 using:
> juju deploy postgresql --channel 16/edge --force

* Address comments in PR
renovate bot and others added 8 commits March 10, 2026 16:11
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Signed-off-by: Andreia <andreia.velasco@canonical.com>
* Test app channel and base/series

* Switch from base to series

* Switch bases to series
- Handle existing relations gracefully in test_build_and_deploy_stereo_mode
- Update charm base from Ubuntu 22.04 to 24.04

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…nces

Enable watcher charm to connect to multiple PostgreSQL clusters with dynamic
port allocation, isolated data directories, and AZ-aware deployment blocking
to prevent split-brain scenarios.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Comment thread postgresql-watcher/src/charm.py Fixed
Comment thread postgresql-watcher/src/charm.py Fixed
Comment thread postgresql-watcher/src/raft_controller.py Fixed
content = secret.get_content(refresh=True)
return content.get("raft-password")
except SecretNotFoundError:
logger.warning(f"Secret {secret_id} not found")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
content = secret.get_content(refresh=True)
return content.get("watcher-password")
except SecretNotFoundError:
logger.warning(f"Secret {secret_id} not found")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
…d-charm

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@github-actions github-actions bot added the Libraries: OK The charm libs used are OK and in-sync label Mar 10, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 45.48311% with 694 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.79%. Comparing base (95a5784) to head (2a88644).

Files with missing lines Patch % Lines
src/relations/watcher_requirer.py 42.79% 238 Missing and 16 partials ⚠️
src/relations/watcher.py 54.39% 141 Missing and 20 partials ⚠️
src/raft_controller.py 50.00% 107 Missing and 15 partials ⚠️
src/watcher_health.py 28.20% 55 Missing and 1 partial ⚠️
src/charm.py 45.16% 37 Missing and 14 partials ⚠️
src/cluster.py 6.00% 47 Missing ⚠️
src/relations/async_replication.py 40.00% 1 Missing and 2 partials ⚠️

❌ Your project check has failed because the head coverage (64.79%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (95a5784) and HEAD (2a88644). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (95a5784) HEAD (2a88644)
2 1
Additional details and impacted files
@@             Coverage Diff             @@
##           16/edge    #1516      +/-   ##
===========================================
- Coverage    70.43%   64.79%   -5.64%     
===========================================
  Files           15       19       +4     
  Lines         4282     5545    +1263     
  Branches       694      889     +195     
===========================================
+ Hits          3016     3593     +577     
- Misses        1057     1675     +618     
- Partials       209      277      +68     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

marceloneppel and others added 9 commits March 16, 2026 09:02
…d-charm

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Use charmed-postgresql snap's patroni_raft_controller instead of
custom pysyncobj implementation for wire compatibility with Patroni.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Replace local snap bundling with snap store installation using
snap charm library. Removes watcher-snap build part and subprocess
calls in favor of cleaner snap.SnapCache API. Install from
channel 16/edge/neppel.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…and Juju spaces

Add integration tests verifying:
- Single watcher serving two PostgreSQL clusters with async replication
- Stereo mode deployment across separate Juju spaces for network isolation
- Cross-space Raft consensus and failover with space-bound Patroni API

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…d-charm

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Remove duplicate postgresql-watcher/ standalone charm; consolidate
  all watcher logic into the main charm's role=watcher mode
- Fix Raft config change detection so partner/password changes trigger
  restarts
- Clean up systemd service on relation-broken (stop → disable → remove)
- Add early role-immutability validation in __init__ before mode init
- Propagate per-unit AZ and IP data from all units, not just leader
- Serialize action results as JSON to avoid Juju key validation errors
- Infer cluster role (primary/standby/unknown) from health check data
- Wire standby cluster linking into cluster-set status output
- Return False from _install_service() when daemon-reload fails
- Guard against None unit_ip producing "None:port" in topology
- Fix SNAP_CHANNEL from dev-specific "16/edge/neppel" to "16/edge"
- Rename show-topology action to get-cluster-status with new params
- Add unit tests for Raft controller, watcher requirer, and role
  validation

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@marceloneppel marceloneppel changed the title [WIP] Stereo mode unified charm [DPE-8772] [WIP] Stereo mode unified charm Apr 8, 2026
@marceloneppel marceloneppel added the enhancement New feature, UI change, or workload upgrade label Apr 8, 2026
marceloneppel and others added 4 commits April 8, 2026 10:45
Use configured self_addr first, then 127.0.0.1:<port> when querying watcher Raft status. This avoids a false waiting state when local administration is reachable only via loopback.

Add unit test coverage for fallback behavior after self_addr probe failure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The snap app profile for charmed-postgresql.patroni-raft-controller has no network permissions, so the process starts but never binds a socket. Use the patroni app profile to launch patroni_raft_controller with network-bind access.

Add a unit test to ensure the generated service uses the patroni profile ExecStart command.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d-charm

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…data, and block PG-only actions

- Exclude watcher from Raft voting when the PostgreSQL unit count is odd;
  adding it in that case would produce an even total Raft membership,
  degrading partition tolerance. The watcher is dynamically added/removed
  as the PG unit count changes between even and odd.
- Publish watcher-voting, timeline, per-member lag, and tls-enabled in
  relation data so the requirer can report accurate cluster status without
  querying Patroni separately.
- Register handlers for PG-specific actions (create-backup, restore, etc.)
  on watcher units so they fail with a clear, human-readable message
  instead of a generic Juju "action not found" error.
- Rename get-cluster-status action param `cluster-set` → `standby-clusters`.
- Add integration tests for odd-count Raft exclusion and action blocking.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel changed the title [DPE-8772] [WIP] Stereo mode unified charm [DPE-8772] Add watcher role to the PostgreSQL charm for stereo mode Apr 14, 2026
@marceloneppel marceloneppel changed the title [DPE-8772] Add watcher role to the PostgreSQL charm for stereo mode [DPE-8772] Stereo mode: add watcher role to the PostgreSQL charm Apr 14, 2026
Extract watcher voting, member-lag parsing, topology building, TLS check,
and timeline parsing into dedicated methods. Also remove unused RetryError
import and fix minor whitespace issues in integration tests.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature, UI change, or workload upgrade Libraries: OK The charm libs used are OK and in-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants