Skip to content

Fix stale RxDone race condition in sx12xx driver#2

Open
ringlej wants to merge 1 commit intogridpoint-4.3.0from
bug/sc-220449/memory-allocation-failures-for-lora-packets
Open

Fix stale RxDone race condition in sx12xx driver#2
ringlej wants to merge 1 commit intogridpoint-4.3.0from
bug/sc-220449/memory-allocation-failures-for-lora-packets

Conversation

@ringlej
Copy link
Copy Markdown
Collaborator

@ringlej ringlej commented Mar 25, 2026

Summary

  • Clear async_rx_cb when cancelling async reception in sx12xx_lora_recv_async(), preventing stale DIO1 work items from dispatching the callback after RX→TX transition
  • Add LOG_WRN detection for stale RxDone events to validate the fix in the field

Root Cause

sx12xx_lora_recv_async(dev, NULL) calls modem_release() but does not clear
dev_data.async_rx_cb. When a stale DIO1 work item fires after the radio has been
reconfigured for TX, sx12xx_ev_rx_done() sees the stale callback, calls Radio.Rx(0)
during an active TX, and fires the callback — leaking packet buffers permanently.

Race Condition Sequence

sequenceDiagram
    participant HW as Radio HW
    participant ISR as DIO1 ISR
    participant WQ as sysworkq
    participant TX as TX thread
    participant RX as RX thread

    Note over HW: Radio in async RX<br/>on network channel

    HW->>ISR: Packet received (RxDone)<br/>IRQ_RX_DONE latched in HW register
    ISR->>ISR: irq_disable()
    ISR->>WQ: k_work_submit(dio1_work)<br/>queued, not yet run

    Note over TX: TX thread needs to send<br/>join request on join channel

    TX->>TX: take radio lock

    TX->>HW: lora_recv_async(NULL)<br/>modem_release() → Radio.Sleep()
    Note over HW: IRQ_RX_DONE NOT cleared<br/>async_rx_cb NOT cleared

    TX->>HW: lora_config(TX, join_freq)
    TX->>HW: Radio.Send(data, len)
    Note over HW: TX active on join channel

    TX-->>TX: k_poll() — YIELDS waiting for TX_DONE

    Note over WQ: sysworkq scheduled<br/>(higher priority)

    WQ->>WQ: RadioOnDioIrq()<br/>IrqFired = true
    WQ->>HW: RadioIrqProcess()<br/>SX126xGetIrqStatus()
    HW-->>WQ: IRQ_RX_DONE (stale!)
    WQ->>HW: SX126xClearIrqStatus()

    WQ->>WQ: sx12xx_ev_rx_done()
    Note over WQ: async_rx_cb still set!

    WQ->>HW: Radio.Rx(0)
    Note over HW: RX started during active TX<br/>RADIO STATE CORRUPTED

    WQ->>WQ: async_rx_cb() fires
    WQ->>WQ: packet_buf_alloc() → buf[0]
    WQ->>RX: k_fifo_put() → buf[0] in FIFO

    WQ->>HW: irq_enable()
    Note over HW: Radio now receiving.<br/>Picks up noise/packets.

    loop Every ~1 second
        HW->>ISR: DIO1 interrupt (new packet/noise)
        ISR->>WQ: k_work_submit()
        WQ->>WQ: sx12xx_ev_rx_done() → async_rx_cb()
        WQ->>WQ: packet_buf_alloc() → buf[N] leaked
    end

    TX-->>TX: k_poll returns (timeout or TX_DONE)
    TX->>TX: release radio lock

    Note over TX,RX: 4 buffers permanently leaked<br/>FIFO delta = 3<br/>Gateway OOM — no recovery
Loading

Fix

Clear async_rx_cb before modem_release():

if (cb == NULL) {
    dev_data.async_rx_cb = NULL;  // ← prevents stale callback dispatch
    if (!modem_release(&dev_data)) {
        return -EINVAL;
    }
    return 0;
}

Evidence from logs

  • FIFO delta = 3: 3 buffers put into RX FIFO, never gotten
  • Buffer hold times ~1s apart: consistent with Radio.Rx(0) picking up successive packets/noise
  • Alloc timestamps correlate with join request TX window: race occurs during RX→TX channel switch

[sc-220449]

@ringlej ringlej self-assigned this Mar 25, 2026
@ringlej ringlej requested a review from abelino March 25, 2026 20:51
When lora_recv_async() is called with cb=NULL to cancel ongoing
reception, modem_release() is called but async_rx_cb is left pointing
to the previous callback. If a stale DIO1 work item fires after the
radio has been reconfigured for TX, sx12xx_ev_rx_done() checks
async_rx_cb (still set), calls Radio.Rx(0) during an active TX, and
fires the stale callback. This corrupts the radio state and leaks
packet buffers allocated by the callback.

Clear async_rx_cb before calling modem_release() so that any stale
DIO1 work item that fires after cancellation takes the synchronous
path in sx12xx_ev_rx_done(), which properly checks modem_usage and
bails out.

Signed-off-by: Jon Ringle <jringle@gridpoint.com>
@ringlej ringlej force-pushed the bug/sc-220449/memory-allocation-failures-for-lora-packets branch from 0bb1103 to a422e9e Compare March 26, 2026 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants