Conversation
dc5b671 to
44f666d
Compare
44f666d to
df7c132
Compare
Introduce a state-machine-based approach to managing LSPS2 JIT channel sessions. The FSM tracks payment collection from initial channel open through HTLC forwarding to completion, replacing the previous ad-hoc state tracking. Also adds Sum trait for Msat and PartialEq for protocol types needed by the FSM. Changelog-Experimental: LSPS2 session state machine for JIT channels
Introduce a session actor that runs the FSM in an async task and communicates side effects through an ActionExecutor trait. This separates state machine logic from I/O concerns like RPC calls and datastore writes.
Add SessionManager that routes incoming HTLCs to the correct session actor by payment hash, replacing the previous handler-based approach. Reworks the policy plugin API and integrates the CLN RPC executor, unifies HTLC handling into the session FSM, and removes the now deprecated handler.rs.
Add integration tests covering the full session lifecycle: channel opening, HTLC forwarding, payment collection, and session completion.
Implement crash recovery for LSPS2 sessions so that in-progress JIT channel sessions survive plugin restarts. Adds recovery traits and datastore methods, a RecoveryProvider implementation for ClnApiRpc, forward monitoring for recovered sessions, and integration tests for recovery scenarios. Makes broadcast_tx and abandon_session idempotent to handle replayed actions safely.
Reduce DatastoreProvider from many methods to 5, with the actor owning the DatastoreEntry and driving all writes through the actor loop. This makes the datastore boundary simpler and testable.
Add an EventSink trait that decouples session event reporting from the transport layer. Includes a composite sink and a channel-based implementation. Wires EventSink through SessionActor and SessionManager, and persists payment_hash in DatastoreEntry.
Replace CLN-specific types (cln_rpc PublicKey, ShortChannelId alias) with standalone alternatives, feature-gate CLN dependencies behind a "cln" feature flag, split ClnApiRpc into focused adapter structs, and refactor Lsps2ServiceHandler generics for cleaner trait boundaries. This makes the lsps2 core reusable outside of CLN.
Merge BlockheightProvider into Lsps2PolicyProvider, extract check_cltv_timeout helper in the session FSM, flatten recovery branching in SessionManager, simplify the actor loop with convert_input and tokio::select!, and remove the unused CollectTimeout ActorInput variant.
We actually only use this in tests Signed-off-by: Peter Neuroth <pet.v.ne@gmail.com>
After restart, recovered session actors were stored in a separate recovery_handles Vec, unreachable by the forward_event notification path that routes via the sessions HashMap. This caused intermittent CI failures where on_payment_settled could not find the session and the internal forward-monitoring loop failed to detect settlement. Register recovered sessions in the sessions HashMap keyed by payment_hash so forward_event notifications reach them directly. For already-settled forwards, recover into Broadcasting state so the actor self-drives to completion without needing forward_event re-delivery. Remove the now-redundant internal polling loop (get_forward_activity + wait_for_forward_resolution).
73b18ed to
da925c6
Compare
|
Cache cleared, and restarting the CI while I review the code 👍 |
|
Nice PR, maybe a bit on the long side, and a bit of duplication, but the architecture is nice. ACK |
| #[derive(Debug, Clone, PartialEq, Eq, thiserror::Error)] | ||
| pub enum Error { | ||
| #[error("variable amount payments are not supported")] | ||
| UnimplementedVarAmount, |
There was a problem hiding this comment.
Interesting, I didn't know that LSPS2 has a varamount mode. How is the signalling handled? I.e., how does the LSP learn the total amount, and the fee amount it is allowed to retain?
| )] | ||
| InsufficientDeductibleCapacity { | ||
| opening_fee_msat: u64, | ||
| deductible_capacity_msat: u128, |
|
|
||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub struct PaymentPart { | ||
| pub htlc_id: u64, |
There was a problem hiding this comment.
Just a quick doubt I had: this refers to which ID? As far as I remember the numbering of HTLCs was using a composite (channel_id, htlc_id) key because the protocol insists on numbering HTLCs, making either an alias necessary or the composite key to make them unique.
This is the DB HTLC ID that counts up monotonically, not the protocol ID which is per-channel, correct?
| } | ||
|
|
||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub enum SessionEvent { |
There was a problem hiding this comment.
That's quite the extensive list of potential errors and outcomes, nice listing 👍
| } | ||
| } | ||
|
|
||
| pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> { |
There was a problem hiding this comment.
You'll probably want to break this up into a dispatch method (containing the match (field_a, field_b, input)) and several handler methods that take the involved parts and pieces of information as explicit parameters. That'll help get a quick overview, and allow diving deep on specific operations if interested.
| scid: ShortChannelId, | ||
| datastore: D, | ||
| ) -> ActorInboxHandle { | ||
| let (tx, inbox) = mpsc::channel(128); // Should we use max_htlcs? |
There was a problem hiding this comment.
The backlog is mostly intended for bursty behavior, and should be set to the maximum number of events in flight. If there is no more room, we will drop block the sending side, no messages should be lost. If the sender does not need to make progress, and the expectation is that we can process messages in the order they arrive (no interleaving) it should be safe to set the backlog to 1. Don't think about elements queued up, rather consider if you need to make progress on the sendign side at all, while elements are being processed.
| } | ||
| } | ||
|
|
||
| fn execute_action(&mut self, action: SessionAction) { |
There was a problem hiding this comment.
Love the match () {} matrix, less so the deep nesting, as it pulls the matrix apart and makes reasoning about its relations and transitions harder :-)
| return Ok(serde_json::json!({ | ||
| "result": "continue", | ||
| "mindepth": 0, | ||
| "reserve": 0, |
There was a problem hiding this comment.
Hm, this will break megalithic LSP? They do not have a way to set no reserve.
| }; | ||
|
|
||
| // Main loop: process inbox events | ||
| loop { |
There was a problem hiding this comment.
Is this not duplicating the entire FSM logic, just because we enter the system through recovery, rather than kicking off a new session? We could just call into the dispatch of events here, and everything else would be the same, or am I missing something?
| ) -> Result<(String, String)> { | ||
| (**self) | ||
| .fund_channel(peer_id, channel_capacity_msat, opening_fee_params) | ||
| .fund_channel(peer_id, channel_capacity_msat, opening_fee_params, scid) |
There was a problem hiding this comment.
Are there "let's define ALL the operations on a variant of the original behavior" useful? I am failing to see how defining the operations on Arc<T> and then just having them forward to T could be useful 🤔
|
The CI failures appear to NOT be flaky tests, as I am seeing a lot of the new tests failing on us. |
Important
26.04 FREEZE March 11th: Non-bugfix PRs not ready by this date will wait for 26.06.
RC1 is scheduled on March 23rd
The final release is scheduled for April 15th.
Checklist
Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:
tools/lightning-downgradeIntroduces a state-machine-based approach to managing LSPS2 JIT channel sessions, replacing the previous ad-hoc state tracking with a structured FSM that tracks payment collection from initial channel open through HTLC forwarding to completion.