Skip to content

LSP-Plugin: Add MPP support#8948

Open
nepet wants to merge 11 commits intoElementsProject:masterfrom
nepet:plugins/lsps2/mpp-fsm
Open

LSP-Plugin: Add MPP support#8948
nepet wants to merge 11 commits intoElementsProject:masterfrom
nepet:plugins/lsps2/mpp-fsm

Conversation

@nepet
Copy link
Member

@nepet nepet commented Mar 18, 2026

Important

26.04 FREEZE March 11th: Non-bugfix PRs not ready by this date will wait for 26.06.

RC1 is scheduled on March 23rd

The final release is scheduled for April 15th.

Checklist

Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:

  • The changelog has been updated in the relevant commit(s) according to the guidelines.
  • Tests have been added or modified to reflect the changes.
  • Documentation has been reviewed and updated as needed.
  • Related issues have been listed and linked, including any that this PR closes.
  • Important All PRs must consider how to reverse any persistent changes for tools/lightning-downgrade

Introduces a state-machine-based approach to managing LSPS2 JIT channel sessions, replacing the previous ad-hoc state tracking with a structured FSM that tracks payment collection from initial channel open through HTLC forwarding to completion.

@nepet nepet force-pushed the plugins/lsps2/mpp-fsm branch from dc5b671 to 44f666d Compare March 18, 2026 23:33
@nepet nepet requested a review from cdecker March 18, 2026 23:36
@nepet nepet force-pushed the plugins/lsps2/mpp-fsm branch from 44f666d to df7c132 Compare March 19, 2026 10:33
@madelinevibes madelinevibes added this to the v26.04 milestone Mar 20, 2026
nepet added 11 commits March 22, 2026 10:47
Introduce a state-machine-based approach to managing LSPS2 JIT channel
sessions. The FSM tracks payment collection from initial channel open
through HTLC forwarding to completion, replacing the previous ad-hoc
state tracking.

Also adds Sum trait for Msat and PartialEq for protocol types needed
by the FSM.

Changelog-Experimental: LSPS2 session state machine for JIT channels
Introduce a session actor that runs the FSM in an async task and
communicates side effects through an ActionExecutor trait. This
separates state machine logic from I/O concerns like RPC calls and
datastore writes.
Add SessionManager that routes incoming HTLCs to the correct session
actor by payment hash, replacing the previous handler-based approach.

Reworks the policy plugin API and integrates the CLN RPC executor,
unifies HTLC handling into the session FSM, and removes the now
deprecated handler.rs.
Add integration tests covering the full session lifecycle: channel
opening, HTLC forwarding, payment collection, and session completion.
Implement crash recovery for LSPS2 sessions so that in-progress JIT
channel sessions survive plugin restarts.

Adds recovery traits and datastore methods, a RecoveryProvider
implementation for ClnApiRpc, forward monitoring for recovered
sessions, and integration tests for recovery scenarios. Makes
broadcast_tx and abandon_session idempotent to handle replayed
actions safely.
Reduce DatastoreProvider from many methods to 5, with the actor
owning the DatastoreEntry and driving all writes through the actor
loop. This makes the datastore boundary simpler and testable.
Add an EventSink trait that decouples session event reporting from the
transport layer. Includes a composite sink and a channel-based
implementation. Wires EventSink through SessionActor and
SessionManager, and persists payment_hash in DatastoreEntry.
Replace CLN-specific types (cln_rpc PublicKey, ShortChannelId alias)
with standalone alternatives, feature-gate CLN dependencies behind a
"cln" feature flag, split ClnApiRpc into focused adapter structs, and
refactor Lsps2ServiceHandler generics for cleaner trait boundaries.

This makes the lsps2 core reusable outside of CLN.
Merge BlockheightProvider into Lsps2PolicyProvider, extract
check_cltv_timeout helper in the session FSM, flatten recovery
branching in SessionManager, simplify the actor loop with
convert_input and tokio::select!, and remove the unused
CollectTimeout ActorInput variant.
We actually only use this in tests

Signed-off-by: Peter Neuroth <pet.v.ne@gmail.com>
After restart, recovered session actors were stored in a separate
recovery_handles Vec, unreachable by the forward_event notification
path that routes via the sessions HashMap. This caused intermittent
CI failures where on_payment_settled could not find the session and
the internal forward-monitoring loop failed to detect settlement.

Register recovered sessions in the sessions HashMap keyed by
payment_hash so forward_event notifications reach them directly.
For already-settled forwards, recover into Broadcasting state so
the actor self-drives to completion without needing forward_event
re-delivery. Remove the now-redundant internal polling loop
(get_forward_activity + wait_for_forward_resolution).
@cdecker
Copy link
Member

cdecker commented Mar 23, 2026

Cache cleared, and restarting the CI while I review the code 👍

@cdecker
Copy link
Member

cdecker commented Mar 23, 2026

Nice PR, maybe a bit on the long side, and a bit of duplication, but the architecture is nice.

ACK

#[derive(Debug, Clone, PartialEq, Eq, thiserror::Error)]
pub enum Error {
#[error("variable amount payments are not supported")]
UnimplementedVarAmount,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't know that LSPS2 has a varamount mode. How is the signalling handled? I.e., how does the LSP learn the total amount, and the fee amount it is allowed to retain?

)]
InsufficientDeductibleCapacity {
opening_fee_msat: u64,
deductible_capacity_msat: u128,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the different types here?


#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PaymentPart {
pub htlc_id: u64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick doubt I had: this refers to which ID? As far as I remember the numbering of HTLCs was using a composite (channel_id, htlc_id) key because the protocol insists on numbering HTLCs, making either an alias necessary or the composite key to make them unique.

This is the DB HTLC ID that counts up monotonically, not the protocol ID which is per-channel, correct?

}

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum SessionEvent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite the extensive list of potential errors and outcomes, nice listing 👍

}
}

pub fn apply(&mut self, input: SessionInput) -> Result<ApplyResult> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to break this up into a dispatch method (containing the match (field_a, field_b, input)) and several handler methods that take the involved parts and pieces of information as explicit parameters. That'll help get a quick overview, and allow diving deep on specific operations if interested.

scid: ShortChannelId,
datastore: D,
) -> ActorInboxHandle {
let (tx, inbox) = mpsc::channel(128); // Should we use max_htlcs?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backlog is mostly intended for bursty behavior, and should be set to the maximum number of events in flight. If there is no more room, we will drop block the sending side, no messages should be lost. If the sender does not need to make progress, and the expectation is that we can process messages in the order they arrive (no interleaving) it should be safe to set the backlog to 1. Don't think about elements queued up, rather consider if you need to make progress on the sendign side at all, while elements are being processed.

}
}

fn execute_action(&mut self, action: SessionAction) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the match () {} matrix, less so the deep nesting, as it pulls the matrix apart and makes reasoning about its relations and transitions harder :-)

return Ok(serde_json::json!({
"result": "continue",
"mindepth": 0,
"reserve": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this will break megalithic LSP? They do not have a way to set no reserve.

};

// Main loop: process inbox events
loop {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not duplicating the entire FSM logic, just because we enter the system through recovery, rather than kicking off a new session? We could just call into the dispatch of events here, and everything else would be the same, or am I missing something?

) -> Result<(String, String)> {
(**self)
.fund_channel(peer_id, channel_capacity_msat, opening_fee_params)
.fund_channel(peer_id, channel_capacity_msat, opening_fee_params, scid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there "let's define ALL the operations on a variant of the original behavior" useful? I am failing to see how defining the operations on Arc<T> and then just having them forward to T could be useful 🤔

@cdecker
Copy link
Member

cdecker commented Mar 23, 2026

The CI failures appear to NOT be flaky tests, as I am seeing a lot of the new tests failing on us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants