Skip to content

fix(security): harden hbbs/hbbr against unauthenticated abuse and state exhaustion#642

Open
ssh4net wants to merge 3 commits intorustdesk:masterfrom
ssh4net:Security
Open

fix(security): harden hbbs/hbbr against unauthenticated abuse and state exhaustion#642
ssh4net wants to merge 3 commits intorustdesk:masterfrom
ssh4net:Security

Conversation

@ssh4net
Copy link
Copy Markdown

@ssh4net ssh4net commented Apr 13, 2026

Summary

This PR hardens rustdesk-server against malicious clients and generic internet abuse.

The main goal is to treat clients as a possible attack vector, not only the server as a trust root for clients. The changes focus on:

  • enforcing the configured deployment key on more rendezvous control-plane paths
  • reducing relay and rendezvous abuse as generic internet-facing services
  • bounding in-memory and persistent state growth
  • improving operator visibility into protection behavior
  • adding process-level regression tests for the hardened paths

This PR also includes the required shared protocol update in hbb_common so client and server use the same keyed request fields.

Why

Before this change, several server paths were still too permissive:

  • keyed hbbs deployments enforced the server key for punch-hole requests, but not for all registration or presence paths
  • hbbr could be abused to accumulate large numbers of half-open relay requests
  • websocket listeners trusted forwarded IP headers from arbitrary clients
  • punch request tracking and peer state could grow without strong bounds
  • persistent peer growth had no hard server-side cap
  • the server had limited operator-facing visibility into rate-limit and pruning events

That made the server vulnerable to:

  • unauthorized peer registration on keyed deployments
  • unauthenticated presence enumeration
  • relay/state exhaustion by malicious clients
  • weaker IP-based anti-abuse controls due to spoofed proxy headers
  • long-term database growth from registration churn

What changed

1. Keyed auth on hbbs control-plane requests

Vulnerability:

  • on keyed/private deployments, some hbbs requests still worked without the configured server key

Fix:

  • require the configured server key for:
    • RegisterPeer
    • RegisterPk
    • OnlineRequest
    • TestNatRequest
  • keep existing key enforcement for punch-hole / relay-related flows
  • return explicit LICENSE_MISMATCH for RegisterPk where the client benefits from a clear failure reason

Implementation notes:

  • added licence_key to the relevant rendezvous protocol messages
  • updated both client and server handling to use the same shared protocol fields

2. Relay half-open abuse caps in hbbr

Vulnerability:

  • attackers could create large numbers of pending relay UUIDs and hold server state for 30 seconds at a time

Fix:

  • cap total pending relays
  • cap pending relays per source IP
  • prune expired pending relay entries before admitting new ones
  • refuse overload explicitly instead of allowing silent hangs
  • return an explicit Key mismatch refusal on relay auth failure

3. Do not trust forwarded IP headers by default

Vulnerability:

  • websocket listeners accepted X-Real-IP / X-Forwarded-For from arbitrary clients, weakening rate limits, logging, and abuse controls

Fix:

  • only honor forwarded proxy headers when TRUST_PROXY_HEADERS is explicitly enabled
  • otherwise use the real socket peer address
  • normalize X-Forwarded-For handling to the first address only

4. Bound punch-hole request tracking

Vulnerability:

  • punch request tracking could grow indefinitely in memory

Fix:

  • prune old entries by retention window
  • cap the total retained punch request entries
  • apply pruning both on insert and inspection

5. Registration abuse reduction

Vulnerability:

  • repeated registration churn could consume server resources cheaply

Fix:

  • validate peer registration IDs before touching peer state
  • apply rate limiting to RegisterPeer in addition to existing RegisterPk protections
  • reject obviously invalid IDs early

6. In-memory peer cache growth controls

Vulnerability:

  • pending and cached peer state could grow without strong bounds

Fix:

  • add a global in-memory peer cache cap
  • add per-IP pending registration limits
  • evict cached peers with preference for inactive / older entries
  • keep registration placeholder growth bounded

7. Persistent peer record cap and retention pruning

Vulnerability:

  • database-backed peer records could grow indefinitely over time

Fix:

  • add MAX_TOTAL_PEER_RECORDS
  • refuse creation of new peer records when the cap is hit
  • before refusing, prune stale records using PEER_RECORD_RETENTION_DAYS
  • continue allowing updates to existing records

Additional fix:

  • RegisterPk now returns the real backend result instead of always reporting OK
  • this was important for surfacing PEER_LIMIT_REACHED correctly on the wire

8. Listener-level rate limiting

Vulnerability:

  • abusive clients could repeatedly hit TCP and UDP listeners before deeper protocol checks

Fix:

  • add per-IP fixed-window limits for:
    • hbbs main TCP listener
    • hbbs NAT/test listener
    • hbbs websocket listener
    • hbbs UDP receive loop
    • hbbr TCP listener
    • hbbr websocket listener
  • keep loopback exempt for local admin usage

9. Operator-facing protection stats

Problem:

  • protections existed, but operator visibility was poor

Fix:

  • add protection-stats / ps admin output
  • report:
    • protection limit configuration
    • connection rate-limit hits
    • UDP rate-limit hits
    • punch-request pruning
    • relay pending rejections
    • peer pruning / peer limit events

Tests

Added process-level regression coverage in tests/server_protection_process.rs for:

  • hbbs admin ps
  • hbbr admin ps
  • hbbs OnlineRequest keyed auth enforcement
  • hbbs RegisterPk keyed auth enforcement
  • hbbr relay key mismatch refusal
  • hbbs persistent peer cap behavior
  • hbbs stale peer retention pruning

Validated with:

cargo test --test server_protection_process -- --test-threads=1

Security impact

This PR does not try to solve every operational abuse scenario, but it materially improves the default security posture for self-hosted RustDesk deployments:

  • private/keyed deployments now behave more consistently as private services
  • the server is harder to abuse as a generic tunnel or reconnaissance target
  • in-memory and persistent resource growth are bounded
  • anti-abuse controls are less vulnerable to trivial source-IP spoofing
  • failure modes are more explicit and test-covered

Notes

This PR depends on the matching shared hbb_common protocol update so client and server use the same keyed rendezvous request fields.

Summary by CodeRabbit

  • New Features

    • Configurable rate limiting for connections and UDP with per‑IP tracking, loopback exemption, and reporting
    • Peer capacity, retention and cache eviction controls to enforce caps and prune stale entries
    • Pending‑relay admission limits, per‑IP pending registration limits, and timed hold/reject behavior with clearer refusal reasons
    • Trusted‑proxy header support and stronger license/key validation across registration and relay flows
    • CLI/admin command to view protection limits and stats
  • Tests

    • Extensive unit and integration tests covering protection, limits, pruning, eviction and auth behavior

Update Cargo.lock with multiple dependency bumps and additions, update the hbb_common submodule, and adjust source code across src/common.rs, src/database.rs, src/peer.rs, src/relay_server.rs and src/rendezvous_server.rs to match the updates. Commit also updates the local SQLite DB (db_v2.sqlite3) and adds a new integration test (tests/server_protection_process.rs) to exercise server protection behavior. These changes combine dependency maintenance with corresponding code and test updates to ensure compatibility and improved coverage.

Signed-off-by: Vlad <shaamaan@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f7ea3fa-2970-4e79-8234-9d96b4d923cb

📥 Commits

Reviewing files that changed from the base of the PR and between 8aa43d3 and a7ad238.

📒 Files selected for processing (1)
  • libs/hbb_common
✅ Files skipped from review due to trivial changes (1)
  • libs/hbb_common

📝 Walkthrough

Walkthrough

Adds per-IP connection and UDP rate limiting, trusted-proxy header handling, peer/relay capacity controls with eviction/pruning, license/key validation across rendezvous and relay flows, database peer retention controls, protection stats API, and extensive unit and integration tests.

Changes

Cohort / File(s) Summary
Rate Limiting & Protection Infrastructure
src/common.rs
Added in-memory per-IP rate limiting for connections and UDP packets, loopback exemption, env-configurable windows/limits, protection event counters and reporting helpers, trusted-proxy header parsing (trust_proxy_headers, apply_trusted_proxy_addr), and unit tests.
Database Peer Capacity
src/database.rs
Added config for max/retention (MAX_TOTAL_PEER_RECORDS, PEER_RECORD_RETENTION_DAYS), peer_count and prune_old_peer_records, changed insert_peer to return InsertPeerResult (Inserted/PeerLimitReached), and prune-before-reject logic with tests.
Peer Cache & Registration Controls
src/peer.rs
Added peer-cache size and per-IP pending-registration limits, eviction selection (PeerEvictionEntry, select_peer_ids_to_evict), inactivity timeout, replaced get_or with get_or_for_registration(id, ip) -> Option<LockPeer) enforcing per-IP pending caps, and updated DB error mapping and tests.
Relay Server: Pending Relays & Admission
src/relay_server.rs
Pending-relay entries now store stream, IP, timestamp; added global/per-IP pending-relay caps and hold window; admission checks via allow_connection_from_ip, trusted-proxy addr handling via apply_trusted_proxy_addr, explicit relay refusals on key mismatch, pruning helpers, admin protection-stats command, and tests.
Rendezvous Server: Validation & Rate Limits
src/rendezvous_server.rs
Added registration-id and server-key validation, integrated get_or_for_registration usage with TOO_FREQUENT mapping, UDP/TCP/WS rate-limit gating, punch-request retention and caps with pruning, CLI protection-stats support, and tests.
Integration Tests & Test Helpers
tests/server_protection_process.rs
New integration test module that spawns hbbs/hbbr processes, verifies admin ps output and license-key behaviors, peer-limit/pruning semantics with temporary DBs, relay refusal reasons, and includes process/port/binary helpers and async framing utilities.
Dependency Update
libs/hbb_common
Submodule pointer updated from 83419b65... to 36bf31de....

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant ConnHandler as Connection<br/>Handler
    participant RateLimiter as Rate Limiter<br/>(common.rs)
    participant Cache as Peer Cache
    participant DB as Database

    Client->>ConnHandler: Incoming TCP/UDP request
    ConnHandler->>RateLimiter: allow_connection_from_ip(addr)<br/>or allow_udp_packet_from_ip(addr)
    
    alt Rate limit exceeded
        RateLimiter-->>ConnHandler: false (blocked)
        ConnHandler-->>Client: drop/refuse
    else Rate limit OK
        ConnHandler->>Cache: get_or_for_registration(id, ip)
        
        alt Pending registrations exceed per-IP limit
            Cache-->>ConnHandler: None (TOO_FREQUENT)
            ConnHandler-->>Client: error response
        else Within limits
            Cache->>Cache: prune_cache_for_insert()
            Cache->>DB: insert_peer(id, ...)
            
            alt Peer count at cap and retention enabled
                DB->>DB: prune_old_peer_records()
                DB->>DB: recheck peer count
            end
            
            alt Still at peer capacity
                DB-->>Cache: PeerLimitReached
                ConnHandler-->>Client: PEER_LIMIT_REACHED
            else Capacity available
                DB-->>Cache: Inserted(guid)
                ConnHandler-->>Client: success response
            end
        end
    end
Loading
sequenceDiagram
    actor Client
    participant ReqHandler as Relay<br/>RequestHandler
    participant RateLimiter as Rate Limiter<br/>(common.rs)
    participant RelayQueue as Pending Relay Queue
    participant DB as Database

    Client->>ReqHandler: relay_request(stream, key, ip)
    ReqHandler->>RateLimiter: validate license/server key
    
    alt License key mismatch
        ReqHandler->>ReqHandler: send_relay_refuse("Key mismatch")
        ReqHandler-->>Client: refusal
    else License key valid
        ReqHandler->>RelayQueue: prune_expired_pending_relays()
        RelayQueue->>RelayQueue: compute pending_count(global)<br/>and pending_count(per_ip)
        
        alt Global or per-IP cap exceeded
            RelayQueue-->>ReqHandler: limit reason
            ReqHandler->>ReqHandler: send_relay_refuse(reason)
            ReqHandler-->>Client: refusal
        else Capacity available
            RelayQueue->>RelayQueue: insert PendingRelay{stream, ip, created_at}
            ReqHandler->>DB: pair with waiting peer stream
            ReqHandler-->>Client: relay forwarding
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 I nibble logs and prune old peers with care,
I count the hops and packets in the air,
I peek at proxies, forward with a wink,
Evict the stale, let lively peers link,
A hop, a guard, a thump — the servers fare!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(security): harden hbbs/hbbr against unauthenticated abuse and state exhaustion' accurately and concisely describes the main purpose of the changeset: security hardening against abuse and state exhaustion attacks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/database.rs (1)

153-178: ⚠️ Potential issue | 🟠 Major

Make the peer-cap check atomic with the insert.

This is count -> maybe prune -> count -> insert across separate statements, and peer_count() grabs a fresh pooled connection each time. Concurrent registrations, or another server process writing the same SQLite file, can pass both checks and still insert past MAX_TOTAL_PEER_RECORDS, so the new cap is bypassable under load.

Suggested direction
-        if peer_limit_reached(self.peer_count().await?, self.max_total_peers) {
-            if peer_retention_prune_enabled(self.peer_record_retention_days) {
-                let deleted = self.prune_old_peer_records().await?;
-                if deleted > 0 {
-                    crate::common::record_protection_event("peer_records_pruned");
-                    log::info!("pruned {} old peer records before inserting {}", deleted, id);
-                }
-            }
-        }
-        if peer_limit_reached(self.peer_count().await?, self.max_total_peers) {
+        let mut conn = self.pool.get().await?;
+        sqlx::query("BEGIN IMMEDIATE")
+            .execute(conn.deref_mut())
+            .await?;
+
+        let total = sqlx::query!("select count(*) as count from peer")
+            .fetch_one(conn.deref_mut())
+            .await?
+            .count as i64;
+
+        if peer_limit_reached(total, self.max_total_peers) {
+            if peer_retention_prune_enabled(self.peer_record_retention_days) {
+                let cutoff = peer_retention_cutoff_arg(self.peer_record_retention_days);
+                let deleted = sqlx::query("delete from peer where created_at < datetime('now', ?)")
+                    .bind(cutoff)
+                    .execute(conn.deref_mut())
+                    .await?
+                    .rows_affected();
+                if deleted > 0 {
+                    crate::common::record_protection_event("peer_records_pruned");
+                    log::info!("pruned {} old peer records before inserting {}", deleted, id);
+                }
+            }
+        }
+
+        let total = sqlx::query!("select count(*) as count from peer")
+            .fetch_one(conn.deref_mut())
+            .await?
+            .count as i64;
+        if peer_limit_reached(total, self.max_total_peers) {
             crate::common::record_protection_event("peer_limit_reached");
             log::warn!("peer record limit reached, rejecting new peer {}", id);
+            sqlx::query("ROLLBACK").execute(conn.deref_mut()).await?;
             return Ok(InsertPeerResult::PeerLimitReached);
         }
         let guid = uuid::Uuid::new_v4().as_bytes().to_vec();
         sqlx::query!(
             "insert into peer(guid, id, uuid, pk, info) values(?, ?, ?, ?, ?)",
@@
-        .execute(self.pool.get().await?.deref_mut())
+        .execute(conn.deref_mut())
         .await?;
+        sqlx::query("COMMIT").execute(conn.deref_mut()).await?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database.rs` around lines 153 - 178, Wrap the count->prune->count->insert
sequence in a single DB transaction using the same pooled connection so the
peer-cap check is atomic: acquire a connection once (let conn =
self.pool.get().await?), begin a transaction with an immediate/serializable lock
(use sqlx transaction on that conn), then call peer_count (or replace it with an
inline SELECT COUNT(*)) and prune_old_peer_records using that same conn (modify
prune_old_peer_records to accept &mut conn or run its SQL inline inside the
transaction), re-check peer_limit_reached against self.max_total_peers, and only
then perform the insert into peer and commit the transaction; if limit reached,
rollback and return InsertPeerResult::PeerLimitReached. Ensure all queries use
the transaction/conn rather than grabbing fresh connections.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/common.rs`:
- Around line 188-205: The map currently allocates an entry for every new source
IP (via CONNECTION_RATE_LIMITS and the key variable), which allows memory
exhaustion; before inserting a new ConnectionRateEntry, enforce a global cap
(e.g., MAX_CONNECTION_RATE_ENTRIES) by checking lock.len() and pruning stale
entries (reuse prune_connection_rate_limits) and if still at capacity evict the
oldest entry (compare last_seen_at) or simply treat the unseen IP as
rate-limited and return false instead of inserting; update the logic around
lock.entry(key).or_insert(...) to only create the entry when under the cap and
to perform eviction via last_seen_at on CONNECTION_RATE_LIMITS when needed.
- Around line 272-281: apply_trusted_proxy_addr currently rewrites trusted-proxy
peer addresses to SocketAddr::new(ip, 0), collapsing all clients behind the same
public IP into one synthetic address and causing collisions where tcp_punch
(keyed by SocketAddr in src/rendezvous_server.rs) overwrites concurrent
connections. Change apply_trusted_proxy_addr so that when a forwarded_ip is
present it preserves the original connection's port (use addr.port()) instead of
0 (i.e., construct the new SocketAddr from the parsed ip and the original addr's
port), keeping the per-connection discriminator while still using
parse_forwarded_ip and trust_proxy_headers().

In `@src/peer.rs`:
- Around line 214-231: The current flow in get_or_for_registration lets multiple
concurrent callers bypass MAX_PENDING_REGISTRATIONS_PER_IP because
can_cache_pending_registration() inspects state before prune_cache_for_insert()
and before acquiring the write lock; fix by performing the pending-registration
count and pruning while holding the same write lock used to insert the
placeholder: acquire the write lock (the same map.write() used later), call
prune_cache_for_insert() or inline its logic under that lock, recompute the
per-IP pending count, and only insert the new Arc<RwLock<Peer>> if the count is
below MAX_PENDING_REGISTRATIONS_PER_IP; if the cap is reached return None.
Ensure you still short-circuit and return an existing peer when w.get(id) is
present after acquiring the write lock.

In `@src/relay_server.rs`:
- Around line 396-404: The WS listener currently calls allow_connection_from_ip
for "hbbr-ws" using the raw peer addr returned by listener2.accept(), which
rate-limits the reverse proxy rather than the real client when
TRUST_PROXY_HEADERS is enabled; change the logic so that handle_connection (or
the WebSocket handshake in make_pair) performs the allow_connection_from_ip
check using the reconstructed client IP when proxy headers are trusted, rather
than using the listener peer addr. Concretely, remove or bypass the
allow_connection_from_ip call in the listener2.accept() branch and ensure
make_pair or the handshake path invokes allow_connection_from_ip("hbbr-ws",
client_ip) after extracting and validating proxy headers (respecting
TRUST_PROXY_HEADERS) so quotas are applied per real client IP.

In `@src/rendezvous_server.rs`:
- Around line 304-312: The rate-limit check currently performed right after
listener3.accept() using allow_connection_from_ip("hbbs-ws", addr) applies the
proxy IP rather than the real client IP and should be moved or adjusted so it
uses the post-proxy/forwarded client address; update the logic to perform the
allow_connection_from_ip call inside handle_listener/handle_listener_inner after
apply_trusted_proxy_addr() has resolved the real client IP (or, if
TRUST_PROXY_HEADERS is enabled, extract the forwarded address first and pass
that IP to allow_connection_from_ip) so per-client buckets are enforced
correctly instead of rate-limiting the proxy.

In `@tests/server_protection_process.rs`:
- Around line 52-80: reserve_hbbs_port and reserve_hbbr_port race because they
bind sockets, drop them, and return only port numbers; another process can claim
the ports before the child is spawned. Fix by changing these helpers to keep the
actual bound sockets alive until the child process is started: in
reserve_hbbs_port return the bound TcpListener(s) and UdpSocket (or a struct
holding them) instead of u16, and in reserve_hbbr_port return the bound
TcpListener(s); then pass those open descriptors (or rely on inheritable fds)
into the child spawn so the ports remain reserved until the child binds them.
Ensure functions named reserve_hbbs_port and reserve_hbbr_port are updated
accordingly and all call sites are adjusted to accept and hold the returned
listeners/sockets until after child startup.

---

Outside diff comments:
In `@src/database.rs`:
- Around line 153-178: Wrap the count->prune->count->insert sequence in a single
DB transaction using the same pooled connection so the peer-cap check is atomic:
acquire a connection once (let conn = self.pool.get().await?), begin a
transaction with an immediate/serializable lock (use sqlx transaction on that
conn), then call peer_count (or replace it with an inline SELECT COUNT(*)) and
prune_old_peer_records using that same conn (modify prune_old_peer_records to
accept &mut conn or run its SQL inline inside the transaction), re-check
peer_limit_reached against self.max_total_peers, and only then perform the
insert into peer and commit the transaction; if limit reached, rollback and
return InsertPeerResult::PeerLimitReached. Ensure all queries use the
transaction/conn rather than grabbing fresh connections.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aee2f984-90d0-4037-9ee6-eebb7a48e720

📥 Commits

Reviewing files that changed from the base of the PR and between 815c728 and 8da0090.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • db_v2.sqlite3
  • libs/hbb_common
  • src/common.rs
  • src/database.rs
  • src/peer.rs
  • src/relay_server.rs
  • src/rendezvous_server.rs
  • tests/server_protection_process.rs

Comment thread src/common.rs
Comment thread src/common.rs Outdated
Comment thread src/peer.rs
Comment thread src/relay_server.rs
Comment thread src/rendezvous_server.rs
Comment thread tests/server_protection_process.rs Outdated
Introduce MAX_CONNECTION_RATE_ENTRIES with a default and enforce a cap on connection-rate entries: evict the oldest entry when the cap is reached, record eviction/rejection protection events, and expose the cap in protection summaries. Fix forwarded IP port handling and add websocket-forwarded rate-limit checks in relay and rendezvous servers.

Refactor peer caching insertion: compute eviction candidates, remove excess peers before inserting pending registrations, and enforce per-IP pending registration limits without the previous helper. Update select_peer_ids_to_evict usage and minor formatting/await handling.

Revamp test process spawn helpers: replace single-port reservation with retrying pick_*_port_candidate functions, make spawn_hbbs/spawn_hbbr return the chosen port and child, preserve temp dirs on certain errors, and add/adjust tests (including a new test to assert connection-rate entry bounding). Miscellaneous whitespace, logging, and small API-call formatting improvements across files.

Signed-off-by: Vlad <shaamaan@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/peer.rs (3)

271-297: prune_cache_for_insert has a TOCTOU gap between snapshot and removal.

This method takes a read lock to build the snapshot, releases it, then acquires a write lock to remove entries. Between these operations, the map state could change (entries added/removed by concurrent calls). While this is less critical here because:

  1. It's called from get() which is a read path
  2. Worst case is slightly more or fewer evictions than intended

The get_or_for_registration method correctly holds the write lock throughout, which is the critical path for pending registration limits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/peer.rs` around lines 271 - 297, prune_cache_for_insert has a TOCTOU
window: it builds a snapshot under a read lock then releases it and later takes
a write lock to remove peers, allowing concurrent map changes between snapshot
and removal; fix by performing eviction decision while holding the write lock
(i.e., acquire self.map.write().await before computing entries and calling
select_peer_ids_to_evict) or, if you must keep the read snapshot,
re-check/compute entries again after acquiring the write lock and only remove
IDs that still exist and meet eviction criteria; reference
prune_cache_for_insert, select_peer_ids_to_evict, and self.map (and mirror the
write-lock approach used in get_or_for_registration).

227-238: Potential performance concern with async read locks inside write lock scope.

Within the write lock scope (let mut w = self.map.write().await), the code iterates over all peers and calls peer.read().await on each LockPeer. This holds the map write lock while waiting on potentially many individual peer read locks. Under high load with many cached peers, this could cause contention.

Consider whether extracting only the needed fields (ip, guid, last_reg_time) into a snapshot structure before eviction analysis could reduce lock hold time. However, given the security-focused nature of this PR and the existing pattern in prune_cache_for_insert, this may be acceptable for now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/peer.rs` around lines 227 - 238, The current loop holds
self.map.write().await while awaiting peer.read().await for each LockPeer,
causing contention; change to first capture a snapshot of the peer handles
(clone the peer_id and Arc/handle to LockPeer from self.map) while holding the
write lock, then immediately drop the write lock and iterate over that snapshot
performing peer.read().await to extract ip, guid and last_reg_time and build
PeerEvictionEntry and pending_ids_for_ip; this mirrors the behavior in
prune_cache_for_insert and ensures the write lock is not held during per-peer
async reads.

79-89: Consider accepting max_pending=0 as "unlimited" or documenting the behavior.

When max_pending is 0, pending_registration_limit_exceeded always returns false (no limit), which seems intentional. However, env_usize_or filters out 0 values via .filter(|value| *value > 0), so an operator cannot explicitly set MAX_PENDING_REGISTRATIONS_PER_IP=0 to disable the limit. This is fine if 0 means "use default", but may be surprising if an operator wants to disable the feature entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/peer.rs` around lines 79 - 89, The current behavior treats
MAX_PENDING_REGISTRATIONS_PER_IP=0 as "use default" because env_usize_or filters
out zero, while pending_registration_limit_exceeded already treats 0 as
"unlimited"; either make zero explicitly settable or document it. To fix, update
env_usize_or (or add env_usize_or_allow_zero) to stop filtering out zero so a
parsed 0 is returned, and keep pending_registration_limit_exceeded as-is (it
will then correctly treat 0 as unlimited), or alternatively add a comment on
pending_registration_limit_exceeded explaining that max_pending==0 means
unlimited and leave env_usize_or unchanged if you prefer default-on-missing
semantics. Ensure changes reference env_usize_or and
pending_registration_limit_exceeded so reviewers can find the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/peer.rs`:
- Around line 271-297: prune_cache_for_insert has a TOCTOU window: it builds a
snapshot under a read lock then releases it and later takes a write lock to
remove peers, allowing concurrent map changes between snapshot and removal; fix
by performing eviction decision while holding the write lock (i.e., acquire
self.map.write().await before computing entries and calling
select_peer_ids_to_evict) or, if you must keep the read snapshot,
re-check/compute entries again after acquiring the write lock and only remove
IDs that still exist and meet eviction criteria; reference
prune_cache_for_insert, select_peer_ids_to_evict, and self.map (and mirror the
write-lock approach used in get_or_for_registration).
- Around line 227-238: The current loop holds self.map.write().await while
awaiting peer.read().await for each LockPeer, causing contention; change to
first capture a snapshot of the peer handles (clone the peer_id and Arc/handle
to LockPeer from self.map) while holding the write lock, then immediately drop
the write lock and iterate over that snapshot performing peer.read().await to
extract ip, guid and last_reg_time and build PeerEvictionEntry and
pending_ids_for_ip; this mirrors the behavior in prune_cache_for_insert and
ensures the write lock is not held during per-peer async reads.
- Around line 79-89: The current behavior treats
MAX_PENDING_REGISTRATIONS_PER_IP=0 as "use default" because env_usize_or filters
out zero, while pending_registration_limit_exceeded already treats 0 as
"unlimited"; either make zero explicitly settable or document it. To fix, update
env_usize_or (or add env_usize_or_allow_zero) to stop filtering out zero so a
parsed 0 is returned, and keep pending_registration_limit_exceeded as-is (it
will then correctly treat 0 as unlimited), or alternatively add a comment on
pending_registration_limit_exceeded explaining that max_pending==0 means
unlimited and leave env_usize_or unchanged if you prefer default-on-missing
semantics. Ensure changes reference env_usize_or and
pending_registration_limit_exceeded so reviewers can find the code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ec488ab-b976-4661-8692-62aaf30bdb07

📥 Commits

Reviewing files that changed from the base of the PR and between 8da0090 and 8aa43d3.

📒 Files selected for processing (5)
  • src/common.rs
  • src/peer.rs
  • src/relay_server.rs
  • src/rendezvous_server.rs
  • tests/server_protection_process.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/common.rs
  • tests/server_protection_process.rs

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant