feat: isolate and track forester worker concurrency#2343
feat: isolate and track forester worker concurrency#2343sergeytimoshin wants to merge 8 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (61)
📒 Files selected for processing (63)
📝 WalkthroughWalkthroughThis PR introduces comprehensive error handling improvements, async concurrency refactoring, and type system enhancements across the forester, prover, and SDK libraries. Key changes include converting panic-prone unwraps to Result-based error propagation, refactoring EpochManager to use Arc for shared task ownership, introducing const generic HEIGHT parameters for proof handling, adding new worker orchestration, and stabilizing proof serialization APIs. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant EpochManager
participant TaskPool as FuturesUnordered<Tasks>
participant TreeWorker as NewTreeWorker
participant ProcessQueue
Client->>EpochManager: process_queue(Arc<Self>, epoch_info, ...)
EpochManager->>EpochManager: Create FuturesUnordered for concurrent tasks
loop For each epoch task
EpochManager->>TaskPool: Spawn async task with Arc clone
TaskPool->>ProcessQueue: process_light_slot(Arc<Self>, ...)
ProcessQueue->>ProcessQueue: Execute processing
ProcessQueue-->>TaskPool: Return Result
end
par Task Monitoring
EpochManager->>TaskPool: Poll next completed task
TaskPool-->>EpochManager: Task result or panic
and Tree Worker Management
EpochManager->>TreeWorker: spawn NewTreeWorker for background processing
TreeWorker->>TreeWorker: Run independently
EpochManager->>TreeWorker: shutdown with timeout on cleanup
end
EpochManager-->>Client: Return updated state (ForesterEpochPda, TreeForesterSchedule)
sequenceDiagram
participant ProofClient
participant Serializer as Proof Serializer
participant Compressor
participant ProverBackend
ProofClient->>Serializer: to_json(proof_input)?
alt Serialization Success
Serializer->>Serializer: create_json_from_struct returns Result
Serializer-->>ProofClient: Ok(json_string)
else Serialization Failure
Serializer-->>ProofClient: Err(ProverClientError::JsonSerialization)
ProofClient->>ProofClient: Propagate error with ?
end
ProofClient->>ProverBackend: submit_and_poll_proof(json)
alt Proof Generation Success
ProverBackend-->>ProofClient: Ok(proof_json)
ProofClient->>Compressor: compress_proof(proof_json)?
Compressor-->>ProofClient: Ok(CompressedProofParts)
else Error in Any Step
ProofClient->>ProofClient: Return ProofJobResult with error
ProofClient-->>ProofClient: Log error and terminate
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The review involves substantial logic changes across multiple systems. Key areas demanding careful attention:
Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
dc3b8b4 to
fe08bcd
Compare
fe08bcd to
ef9d7fe
Compare
aaaa67c to
fbf48e5
Compare
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (9)
prover/client/src/helpers.rs (1)
38-48:⚠️ Potential issue | 🟠 MajorReject negative values in
bigint_to_u8_32.
to_bytes_be()returns(sign, magnitude). Right now-1serializes as0x…01instead of failing, which can turn invalid proof inputs into a different root/hash without any error.Suggested fix
pub fn bigint_to_u8_32(n: &BigInt) -> Result<[u8; 32], ProverClientError> { - let (_, bytes_be) = n.to_bytes_be(); + let (sign, bytes_be) = n.to_bytes_be(); + if sign == num_bigint::Sign::Minus { + return Err(ProverClientError::InvalidProofData( + "Negative values are not valid 32-byte field elements".to_string(), + )); + } if bytes_be.len() > 32 { return Err(ProverClientError::InvalidProofData( "Number too large to fit in [u8; 32]".to_string(), )); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prover/client/src/helpers.rs` around lines 38 - 48, The function bigint_to_u8_32 currently ignores the sign from BigInt::to_bytes_be and therefore accepts negative values; update bigint_to_u8_32 to inspect the sign returned by to_bytes_be and return Err(ProverClientError::InvalidProofData(...)) when the sign is negative (i.e., not Sign::Plus), leaving the rest of the logic unchanged for non-negative values; reference the bigint_to_u8_32 function and the ProverClientError::InvalidProofData variant when implementing this check.forester/tests/legacy/batched_state_async_indexer_test.rs (1)
79-90:⚠️ Potential issue | 🟠 MajorRemove the second prover startup.
init()already starts the local validator withenable_prover: true. Spawning another prover here can race for port3001and fail the test before the worker-concurrency path is exercised.Based on learnings The light test validator must spawn three background processes: solana test validator on port 8899, prover server on port 3001, and photon indexer on port 8784.Simplest fix
- spawn_prover().await.unwrap();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forester/tests/legacy/batched_state_async_indexer_test.rs` around lines 79 - 90, The test starts a prover twice causing a port race: remove the explicit spawn_prover().await.unwrap() call so the prover started by init(Some(LightValidatorConfig { enable_prover: true, ... })) is the only prover; ensure init() is called with enable_indexer: true and the LightValidatorConfig remains as shown so the test validator still spawns the Solana validator, prover (port 3001) and photon indexer (port 8784) in the background.prover/client/src/proof_client.rs (1)
671-677:⚠️ Potential issue | 🔴 CriticalSerialize
low_element_next_indicesin the address-append payload.The
BatchAddressAppendInputsstruct carrieslow_element_next_indices, but the JSON serializer atprover/client/src/proof_types/batch_address_append/json.rsomits this field entirely. The prover circuit needs both the values and indices of boundary elements to validate indexed Merkle tree insertions (verifying that new addresses fit betweenlow_valueandlow_next_valueat the correct tree positions). Dropping the indices breaks this validation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prover/client/src/proof_client.rs` around lines 671 - 677, The JSON serializer for BatchAddressAppendInputs is omitting the low_element_next_indices field, breaking prover validation; update the batch_address_append JSON serialization (the module that produces the payload consumed by generate_batch_address_append_proof) to include low_element_next_indices alongside low_element_values/low_next_values, serialize it with the exact key/name the prover circuit expects, and ensure the values are converted to JSON (e.g., a numeric array of indices) so to_json(&inputs) used in generate_batch_address_append_proof contains both the indices and values required by the circuit.forester/src/forester_status.rs (1)
674-685:⚠️ Potential issue | 🟠 MajorDon’t silently mask queue parse failures as “empty queue.”
At Line 674 and Line 729, deserialization errors are collapsed into
(0, 0). That makes malformed/unreadable queue data indistinguishable from truly empty queues and can produce misleading status/metrics output.Please either propagate the parse error (preferred) or at least emit a warning with tree/queue identifiers before fallback.
Also applies to: 729-740
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forester/src/forester_status.rs` around lines 674 - 685, The code currently swallows deserialization errors by calling parse_hash_set_from_bytes::<QueueAccount>(&acc.data).ok().map(...).unwrap_or((0,0)), which makes malformed queue data indistinguishable from an empty queue; replace these silent fallbacks (the two occurrences using parse_hash_set_from_bytes::<QueueAccount>) with proper error handling: ideally change the enclosing function to return Result and propagate the parse error (use parse_hash_set_from_bytes::<QueueAccount>(&acc.data)? and compute (len, cap) on Ok), otherwise at minimum catch Err(e) and emit a warning including identifying info (e.g., the tree/queue identifiers available in scope such as acc.pubkey or tree_id) before returning the fallback (0,0); apply the same fix at both locations (the parse_hash_set_from_bytes::<QueueAccount> usages around the shown block and the 729–740 block).prover/client/src/proof_types/batch_append/proof_inputs.rs (1)
115-130:⚠️ Potential issue | 🟠 MajorReject length mismatches before the zipped loop.
zip()stops at the shortest input. Ifold_leaves,leaves, ormerkle_proofsis shorter thanbatch_size, this still returnsbatch_sizeunchanged but computes the root from only a partial batch. That will generate inconsistent public inputs and very opaque prover failures.Suggested guard
) -> Result<(BatchAppendsCircuitInputs, Vec<ChangelogEntry<HEIGHT>>), ProverClientError> { + let expected_len = batch_size as usize; + if leaves.len() != expected_len + || old_leaves.len() != expected_len + || merkle_proofs.len() != expected_len + { + return Err(ProverClientError::GenericError(format!( + "batch append input length mismatch: leaves={}, old_leaves={}, merkle_proofs={}, expected={}", + leaves.len(), + old_leaves.len(), + merkle_proofs.len(), + expected_len, + ))); + } + let mut new_root = [0u8; 32];Also applies to: 135-200
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prover/client/src/proof_types/batch_append/proof_inputs.rs` around lines 115 - 130, The function get_batch_append_inputs currently zips old_leaves, leaves, and merkle_proofs and can silently process a partial batch if any of those vectors are shorter than batch_size; add explicit length checks before the zipped loop: verify old_leaves.len() == batch_size as usize, leaves.len() == batch_size as usize, and merkle_proofs.len() == batch_size as usize (or otherwise reject mismatches), and return an appropriate ProverClientError when any mismatch is detected so the function fails fast and never computes roots/inputs from a partial batch.sdk-libs/client/src/interface/initialize_config.rs (1)
20-27:⚠️ Potential issue | 🔴 CriticalConsolidate to shared
InitializeLightConfigParamspayload and useanchor_lang::Discriminator.The
InitializeRentFreeConfig::build()method will produce an incompatible instruction payload. It serializesInitializeCompressionConfigAnchorData(which lacksconfig_bump), but the siblinginitialize_config()function ininstructions.rscorrectly usesInitializeLightConfigParamswithconfig_bumpincluded. Both target the same on-chain instruction, so they must produce identical wire formats—presently they do not.Additionally, line 122 hardcodes the discriminator bytes. Replace both issues by:
- Reusing
InitializeLightConfigParamshere instead of the local struct- Using
anchor_lang::Discriminatortrait to derive the discriminator (as required by coding guidelines forsdk-libs/**/src/**/*.rs) rather than hardcoding bytes🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk-libs/client/src/interface/initialize_config.rs` around lines 20 - 27, The build path is serializing the wrong payload and hardcodes the discriminator: replace the local struct InitializeCompressionConfigAnchorData with the shared InitializeLightConfigParams so InitializeRentFreeConfig::build() emits the same wire format as the existing initialize_config() function, and remove the manual discriminator bytes in favor of deriving and using anchor_lang::Discriminator (call the trait implementation to obtain the discriminator) rather than hardcoding values; update any serialization calls in InitializeRentFreeConfig::build() to use InitializeLightConfigParams and the Discriminator API so both instruction creators produce identical payloads.sdk-libs/program-test/src/indexer/test_indexer.rs (1)
2502-2535:⚠️ Potential issue | 🟠 MajorHandle prover HTTP failures explicitly instead of spinning or panicking.
This loop only decrements
retrieson transport errors. A prover that keeps returning 4xx/5xx will spin forever, and a 200 with an unreadable body still hitsunwrap()ontext()/deserialize_gnark_proof_json(). Please convert both HTTP-status failures and parse failures intoIndexerErrorand consume a retry in those branches.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk-libs/program-test/src/indexer/test_indexer.rs` around lines 2502 - 2535, The HTTP loop currently only decrements retries on transport errors and uses unwrap() on response.text() and deserialize_gnark_proof_json(), so server 4xx/5xx and parse failures can spin or panic; update the client.post handling (the block around client.post(...).send().await and the success branch that calls response.text(), deserialize_gnark_proof_json, proof_from_json_struct, and compress_proof) to: treat non-success status codes as IndexerError::CustomError (include status and body if available), convert any failures from response.text(), deserialize_gnark_proof_json, proof_from_json_struct, or compress_proof into IndexerError::CustomError rather than unwrap/panic, and ensure each of those error branches decrements retries and follows the same retry/sleep logic so the function eventually returns Err(IndexerError::CustomError("Failed to get proof from server")) after exhausting retries; reference functions/values: client.post, response.text(), deserialize_gnark_proof_json, proof_from_json_struct, compress_proof, IndexerError, and ValidityProofWithContext.forester/src/epoch_manager.rs (2)
4775-4785:⚠️ Potential issue | 🟠 MajorShutdown path can bypass tracked new-tree worker teardown.
When shutdown wins the
tokio::select!in Line 4775,epoch_manager.run()is dropped before reaching Line 680 cleanup, so trackedstd::threadworkers may outlive service shutdown and continue processing.💡 Proposed fix
- let result = tokio::select! { - result = epoch_manager.run() => result, + let epoch_manager_for_run = epoch_manager.clone(); + let result = tokio::select! { + result = epoch_manager_for_run.run() => result, _ = shutdown => { info!( event = "shutdown_received", run_id = %run_id_for_logs, phase = "service_run", "Received shutdown signal. Stopping the service." ); + epoch_manager + .shutdown_new_tree_workers(NEW_TREE_WORKER_SHUTDOWN_TIMEOUT) + .await; Ok(()) } };Also applies to: 680-681
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forester/src/epoch_manager.rs` around lines 4775 - 4785, The shutdown branch in the tokio::select! currently just returns Ok(()) which drops epoch_manager.run() and skips the cleanup that joins tracked std::thread workers; instead, on shutdown you must trigger a graceful stop on the EpochManager and await its run future to complete so its cleanup code (the tracked-worker teardown at/around epoch_manager.run() and the cleanup near lines ~680) runs: send the same shutdown signal or call the epoch_manager.shutdown()/stop() method (or set a shutdown flag) and then await epoch_manager.run() (or call a provided join/await method) before returning Ok(()) so all tracked std::thread workers are torn down and joined. Ensure you reference and use the epoch_manager.run() future and any explicit join/teardown methods on EpochManager to perform the worker joins instead of dropping the future.
2181-2192:⚠️ Potential issue | 🔴 Critical
RegistrationTrackerclaim can get stuck on early error paths.After
try_claim_refinalize()succeeds, Line 2188 and Line 2191 use?. If either fails,complete_refinalize()is never called, leavingrefinalize_in_progress=trueand potentially blocking other workers forever inwait_for_refinalize().💡 Proposed fix
if registration_tracker.try_claim_refinalize() { - // This task sends the finalize_registration tx - let ix = create_finalize_registration_instruction( - &self.config.payer_keypair.pubkey(), - &self.config.derivation_pubkey, - epoch_info.epoch, - ); - let priority_fee = self - .resolve_epoch_priority_fee(&*rpc, epoch_info.epoch) - .await?; - let current_slot = rpc.get_slot().await?; - let Some(confirmation_deadline) = scheduled_confirmation_deadline( - epoch_info.phases.active.end.saturating_sub(current_slot), - ) else { - info!(...); - registration_tracker.complete_refinalize(cached_weight); - return Ok((forester_epoch_pda, tree_schedule)); - }; - ... - match send_smart_transaction(...).await.map_err(RpcError::from) { - Ok(_) => { ... registration_tracker.complete_refinalize(post_finalize_weight); } - Err(e) => { - registration_tracker.complete_refinalize(cached_weight); - return Err(e.into()); - } - } + let mut completion_weight = cached_weight; + let claim_result: Result<()> = async { + let ix = create_finalize_registration_instruction( + &self.config.payer_keypair.pubkey(), + &self.config.derivation_pubkey, + epoch_info.epoch, + ); + let priority_fee = self + .resolve_epoch_priority_fee(&*rpc, epoch_info.epoch) + .await?; + let current_slot = rpc.get_slot().await?; + let Some(confirmation_deadline) = scheduled_confirmation_deadline( + epoch_info.phases.active.end.saturating_sub(current_slot), + ) else { + info!(...); + return Ok(()); + }; + match send_smart_transaction(...).await.map_err(RpcError::from) { + Ok(_) => { + completion_weight = match rpc.get_anchor_account::<EpochPda>(&epoch_pda_address).await { + Ok(Some(pda)) => pda.registered_weight, + _ => on_chain_weight, + }; + Ok(()) + } + Err(e) => Err(e.into()), + } + }.await; + + registration_tracker.complete_refinalize(completion_weight); + claim_result?; } else { registration_tracker.wait_for_refinalize().await; }Also applies to: 2246-2250
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@forester/src/epoch_manager.rs` around lines 2181 - 2192, After a successful RegistrationTracker::try_claim_refinalize() return ensure RegistrationTracker::complete_refinalize() is always called even if subsequent calls like resolve_epoch_priority_fee(), rpc.get_slot(), or scheduled_confirmation_deadline(...) fail; wrap the post-claim work in a scope that guarantees cleanup (e.g., a scope guard or an explicit match/let guard) so complete_refinalize() runs on all early-return/error paths, and apply the same pattern around the other similar block (lines ~2246-2250) to avoid leaving refinalize_in_progress = true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@forester/src/config.rs`:
- Around line 341-347: The current use of partition() makes the Err branch
already returned, so the map_err fallback on valid.into_iter().collect(...) is
unreachable and gives a misleading generic error; change the logic in the tree
ID parsing flow (the variables/blocks using partition(), valid, invalid and the
ConfigError::InvalidArguments with field "tree_ids") to either directly
unwrap/collect the already-validated `valid` Vec<Pubkey> (since `invalid` was
handled) or perform a single-pass parse that collects invalid inputs and returns
ConfigError::InvalidArguments with the actual invalid_values; update the code
around the partition() usage to return the real invalid strings instead of the
generic "failed to parse tree_ids".
In `@forester/src/processor/v2/helpers.rs`:
- Around line 495-524: The function currently treats truncated batches as valid
by using actual_end = end.min(available) and proceeding; change the guard so
that if wait_for_batch(end) returns available < end the function returns
Ok(None) (i.e., do not proceed with partial batches). Specifically, in
get_batch_snapshot replace the current if start >= available { return Ok(None);
} / actual_end logic with a check like if available < end { return Ok(None); }
(still keep the start >= available check as needed), so code paths that call
lock_recover, read data.addresses[start..actual_end], or derive
leaves_hash_chain (data.leaves_hash_chains, create_hash_chain_from_slice) never
run for undersized batches.
In `@forester/src/processor/v2/proof_worker.rs`:
- Around line 197-214: The serialization-failure path uses two different
timestamps (submitted_at = Instant::now() here and round_trip_start elsewhere)
and silently ignores closed result_tx; consolidate this by adding a helper
(e.g., send_error_result or report_serialization_failure) used by both the
job.inputs.to_json error branch and the other serialization-before-submit branch
that: captures one Instant::now() into a single variable (round_trip_start or
submitted_at) at the failure site, logs the error with error! including the
error details, constructs the ProofJobResult with that single timestamp and
consistent round_trip_ms/proof_duration_ms values, and attempts to send via
job.result_tx while handling a closed channel by logging the send failure;
update both locations (the error branch around job.inputs.to_json and the other
branch that currently reuses round_trip_start) to call this helper so latency
metrics and logging are consistent.
In `@forester/tests/test_batch_append_spent.rs`:
- Around line 332-346: The spawned closure passed to tokio::task::spawn_blocking
(bound to service_handle) can fail at runtime construction
(tokio::runtime::Builder::build), during
runtime.block_on(run_pipeline::<LightClient>(...)), and when joining the spawned
task; currently all results are ignored—update the test to explicitly handle
each error: unwrap or propagate the runtime build Result (ensure the build error
is surfaced), propagate/unwrap the Result returned by
run_pipeline::<LightClient> so pipeline failures fail the test, and await/join
the JoinHandle and check its JoinError (or unwrap) instead of discarding with
let _ = ..., failing the test on panic or join failure; reference
service_handle, runtime.block_on, run_pipeline::<LightClient>,
shutdown_receiver, shutdown_compressible_receiver, and work_report_sender when
making these changes.
In `@forester/tests/test_compressible_pda.rs`:
- Around line 312-313: Replace the bare .unwrap() calls on builder results with
.expect(...) that includes a descriptive message; specifically change the
.build().unwrap() invocations that call InitializeRentFreeConfig::build() (and
the two other similar setup builder calls in this test) to
.build().expect("InitializeRentFreeConfig::build() failed") or an equivalent
message that names the failing builder so failures in setup are clear. Ensure
each replaced call uses a distinct message if helpful to identify which builder
failed.
In `@prover/client/src/proof_types/batch_address_append/proof_inputs.rs`:
- Around line 190-195: The code currently slices
new_element_values[..zkp_batch_size] and then indexes parallel inputs
(low_element_values, low_element_next_values, low_element_indices,
low_element_next_indices, low_element_proofs) by that length, which can panic if
any of those slices are shorter; add explicit guards at the start of the
function that compute required_len = zkp_batch_size and verify each parallel
input's len() >= required_len (for low_element_proofs check len() >=
required_len and each inner-proof has HEIGHT elements) and return a
ProverClientError on failure instead of allowing a panic; reference the
parameter names low_element_values, low_element_next_values,
low_element_indices, low_element_next_indices, low_element_proofs,
new_element_values and zkp_batch_size when adding these checks.
In `@prover/client/src/proof_types/batch_append/json.rs`:
- Around line 76-79: Rename the inherent fallible method `to_string(&self) ->
Result<String, ProverClientError>` to a clearly-fallible name like
`try_to_json_string(&self) -> Result<String, ProverClientError>` (or
`to_json_string` if you prefer), update all call sites that invoke `to_string()`
to call the new name, keep the same return type and body
(`create_json_from_struct(&self)`), and remove the
`#[allow(clippy::inherent_to_string)]` attribute since the method no longer
shadows the conventional `to_string`. Ensure unit tests and any trait impls or
uses are updated to the new identifier.
In `@prover/client/src/proof_types/batch_update/proof_inputs.rs`:
- Around line 115-118: The public_inputs method currently indexes self.0[0],
which will panic on an empty batch and only returns the first entry; instead
iterate over the inner slice (self.0), call public_inputs_arr() on each item,
collect the results into the Vec<[u8; 32]>, and propagate any errors as
ProverClientError; update public_inputs to map/try_collect (or equivalent) over
self.0 so all batch entries are included and empty slices are handled without
panicking (or explicitly return an error if an empty batch is considered
invalid).
In `@prover/client/src/proof_types/non_inclusion/v2/json.rs`:
- Around line 54-62: The JSON serialization currently forces path_index to a u32
by converting input.index_hashed_indexed_element_leaf via to_u32(), which will
fail for trees deeper than 32 bits; update the contract to allow larger indices
by switching path_index to a wider representation (for example: use to_u64() and
a u64 field, or serialize the index as a string/BigInt) and change the
conversion call from to_u32() to the chosen conversion (e.g., to_u64() or
to_string()) and update any consumers of path_index accordingly so indices >
u32::MAX are supported; ensure error handling still returns
ProverClientError::IntegerConversion with an appropriate message if conversion
to the new representation fails.
In `@prover/client/src/proof.rs`:
- Around line 167-170: The negate_g1 function currently deserializes the
incoming G1 point with Validate::No allowing malformed/off-curve points through;
change the call to G1::deserialize_with_mode(g1_le.as_slice(), Compress::No,
Validate::Yes) so the point is validated at deserialization and any error still
maps to ProverClientError::InvalidProofData; this uses the existing
convert_endianness and error mapping, so no further changes to error handling
are needed.
In `@sdk-libs/client/src/indexer/types/proof.rs`:
- Around line 214-223: The loop that computes output tree indices (using
account.tree_info.next_tree_info and account.tree_info.pack_output_tree_index on
each Account) currently keeps the first index in output_tree_index and ignores
mismatches; update PackedStateTreeInfos construction to validate that every call
to pack_output_tree_index returns the same index and return an error if any
subsequent index differs. Specifically, in the code paths that set
output_tree_index (the blocks referencing output_tree_index,
account.tree_info.next_tree_info, and pack_output_tree_index), compare the newly
computed index to the previously stored Some(index) and propagate a descriptive
Err when they diverge instead of silently ignoring the difference; apply the
same validation to the other occurrence noted (around lines 243-248).
In `@sdk-libs/client/src/interface/load_accounts.rs`:
- Around line 120-124: The missing-PDA error loses the original position because
collect_pda_hashes restarts enumeration for each group; update the flow so the
original flat index from cold_pdas is preserved: modify
group_pda_specs/collect_pda_hashes usage so each group carries a base_index (or
have collect_pda_hashes return (hash, original_index) pairs) and use that
original index when constructing MissingPdaCompressed; apply the same change to
the other occurrence referenced (lines ~164-175) so every MissingPdaCompressed
contains the flattened index from cold_pdas rather than the per-group 0-based
index.
In `@sdk-libs/client/src/local_test_validator.rs`:
- Around line 127-136: The non-surfpool branch currently spawns the validator
with Command::new(...).spawn() and immediately drops the Child, so failures
during the sleep are ignored; change this to keep the Child (e.g., let mut child
= Command::new(...).spawn()?), await the
tokio::time::sleep(Duration::from_secs(config.wait_time)).await while retaining
the child, then call child.try_wait() (or child.wait() if blocking is
acceptable) and inspect the ExitStatus; if the child has exited with a
non-success status return an io::Error (or map the ExitStatus to Err) so callers
receive an error instead of Ok(()). Ensure you mirror the surfpool branch
pattern (spawn -> keep handle -> wait/try_wait -> check status) and propagate
any non-zero exit as an io::Error.
- Around line 69-115: The current code concatenates command pieces into a single
shell string and spawns it via Command::new("sh").arg("-c").arg(path), which
breaks on spaces/metacharacters (see path construction and uses of
config.enable_indexer, config.limit_ledger_size, config.sbf_programs,
config.upgradeable_programs, config.validator_args and config.use_surfpool).
Change to build a Vec<String> (or Vec<&str>) of argv: start with the executable
built from project_root + "/cli/test_bin/run", then push "test-validator" and
each flag/value as separate .arg(...) entries (one per sbf_program pair and per
upgradeable_program fields, and each validator_arg as its own arg) and call
Command::new(executable).args(&args).spawn() instead of using "sh -c". Also make
the non-surfpool branch behave like the surfpool branch: keep the child handle,
call child.wait(), check the exit Status and propagate or log failures instead
of dropping the child and sleeping unconditionally.
---
Outside diff comments:
In `@forester/src/epoch_manager.rs`:
- Around line 4775-4785: The shutdown branch in the tokio::select! currently
just returns Ok(()) which drops epoch_manager.run() and skips the cleanup that
joins tracked std::thread workers; instead, on shutdown you must trigger a
graceful stop on the EpochManager and await its run future to complete so its
cleanup code (the tracked-worker teardown at/around epoch_manager.run() and the
cleanup near lines ~680) runs: send the same shutdown signal or call the
epoch_manager.shutdown()/stop() method (or set a shutdown flag) and then await
epoch_manager.run() (or call a provided join/await method) before returning
Ok(()) so all tracked std::thread workers are torn down and joined. Ensure you
reference and use the epoch_manager.run() future and any explicit join/teardown
methods on EpochManager to perform the worker joins instead of dropping the
future.
- Around line 2181-2192: After a successful
RegistrationTracker::try_claim_refinalize() return ensure
RegistrationTracker::complete_refinalize() is always called even if subsequent
calls like resolve_epoch_priority_fee(), rpc.get_slot(), or
scheduled_confirmation_deadline(...) fail; wrap the post-claim work in a scope
that guarantees cleanup (e.g., a scope guard or an explicit match/let guard) so
complete_refinalize() runs on all early-return/error paths, and apply the same
pattern around the other similar block (lines ~2246-2250) to avoid leaving
refinalize_in_progress = true.
In `@forester/src/forester_status.rs`:
- Around line 674-685: The code currently swallows deserialization errors by
calling
parse_hash_set_from_bytes::<QueueAccount>(&acc.data).ok().map(...).unwrap_or((0,0)),
which makes malformed queue data indistinguishable from an empty queue; replace
these silent fallbacks (the two occurrences using
parse_hash_set_from_bytes::<QueueAccount>) with proper error handling: ideally
change the enclosing function to return Result and propagate the parse error
(use parse_hash_set_from_bytes::<QueueAccount>(&acc.data)? and compute (len,
cap) on Ok), otherwise at minimum catch Err(e) and emit a warning including
identifying info (e.g., the tree/queue identifiers available in scope such as
acc.pubkey or tree_id) before returning the fallback (0,0); apply the same fix
at both locations (the parse_hash_set_from_bytes::<QueueAccount> usages around
the shown block and the 729–740 block).
In `@forester/tests/legacy/batched_state_async_indexer_test.rs`:
- Around line 79-90: The test starts a prover twice causing a port race: remove
the explicit spawn_prover().await.unwrap() call so the prover started by
init(Some(LightValidatorConfig { enable_prover: true, ... })) is the only
prover; ensure init() is called with enable_indexer: true and the
LightValidatorConfig remains as shown so the test validator still spawns the
Solana validator, prover (port 3001) and photon indexer (port 8784) in the
background.
In `@prover/client/src/helpers.rs`:
- Around line 38-48: The function bigint_to_u8_32 currently ignores the sign
from BigInt::to_bytes_be and therefore accepts negative values; update
bigint_to_u8_32 to inspect the sign returned by to_bytes_be and return
Err(ProverClientError::InvalidProofData(...)) when the sign is negative (i.e.,
not Sign::Plus), leaving the rest of the logic unchanged for non-negative
values; reference the bigint_to_u8_32 function and the
ProverClientError::InvalidProofData variant when implementing this check.
In `@prover/client/src/proof_client.rs`:
- Around line 671-677: The JSON serializer for BatchAddressAppendInputs is
omitting the low_element_next_indices field, breaking prover validation; update
the batch_address_append JSON serialization (the module that produces the
payload consumed by generate_batch_address_append_proof) to include
low_element_next_indices alongside low_element_values/low_next_values, serialize
it with the exact key/name the prover circuit expects, and ensure the values are
converted to JSON (e.g., a numeric array of indices) so to_json(&inputs) used in
generate_batch_address_append_proof contains both the indices and values
required by the circuit.
In `@prover/client/src/proof_types/batch_append/proof_inputs.rs`:
- Around line 115-130: The function get_batch_append_inputs currently zips
old_leaves, leaves, and merkle_proofs and can silently process a partial batch
if any of those vectors are shorter than batch_size; add explicit length checks
before the zipped loop: verify old_leaves.len() == batch_size as usize,
leaves.len() == batch_size as usize, and merkle_proofs.len() == batch_size as
usize (or otherwise reject mismatches), and return an appropriate
ProverClientError when any mismatch is detected so the function fails fast and
never computes roots/inputs from a partial batch.
In `@sdk-libs/client/src/interface/initialize_config.rs`:
- Around line 20-27: The build path is serializing the wrong payload and
hardcodes the discriminator: replace the local struct
InitializeCompressionConfigAnchorData with the shared
InitializeLightConfigParams so InitializeRentFreeConfig::build() emits the same
wire format as the existing initialize_config() function, and remove the manual
discriminator bytes in favor of deriving and using anchor_lang::Discriminator
(call the trait implementation to obtain the discriminator) rather than
hardcoding values; update any serialization calls in
InitializeRentFreeConfig::build() to use InitializeLightConfigParams and the
Discriminator API so both instruction creators produce identical payloads.
In `@sdk-libs/program-test/src/indexer/test_indexer.rs`:
- Around line 2502-2535: The HTTP loop currently only decrements retries on
transport errors and uses unwrap() on response.text() and
deserialize_gnark_proof_json(), so server 4xx/5xx and parse failures can spin or
panic; update the client.post handling (the block around
client.post(...).send().await and the success branch that calls response.text(),
deserialize_gnark_proof_json, proof_from_json_struct, and compress_proof) to:
treat non-success status codes as IndexerError::CustomError (include status and
body if available), convert any failures from response.text(),
deserialize_gnark_proof_json, proof_from_json_struct, or compress_proof into
IndexerError::CustomError rather than unwrap/panic, and ensure each of those
error branches decrements retries and follows the same retry/sleep logic so the
function eventually returns Err(IndexerError::CustomError("Failed to get proof
from server")) after exhausting retries; reference functions/values:
client.post, response.text(), deserialize_gnark_proof_json,
proof_from_json_struct, compress_proof, IndexerError, and
ValidityProofWithContext.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b8aeca30-1184-4e0f-84c8-f3cb3237bc1f
⛔ Files ignored due to path filters (63)
forester-utils/src/account_zero_copy.rsis excluded by none and included by noneforester-utils/src/address_merkle_tree_config.rsis excluded by none and included by noneforester-utils/src/address_staging_tree.rsis excluded by none and included by noneforester-utils/src/forester_epoch.rsis excluded by none and included by noneforester-utils/src/instructions/compress_and_close_mint.rsis excluded by none and included by noneforester-utils/src/rate_limiter.rsis excluded by none and included by noneforester-utils/src/registry.rsis excluded by none and included by noneprogram-tests/account-compression-test/tests/address_merkle_tree_tests.rsis excluded by none and included by noneprogram-tests/account-compression-test/tests/merkle_tree_tests.rsis excluded by none and included by noneprogram-tests/batched-merkle-tree-test/tests/merkle_tree.rsis excluded by none and included by noneprogram-tests/compressed-token-test/tests/freeze/functional.rsis excluded by none and included by noneprogram-tests/compressed-token-test/tests/v1.rsis excluded by none and included by noneprogram-tests/system-cpi-v2-test/tests/event.rsis excluded by none and included by noneprogram-tests/system-cpi-v2-test/tests/invoke_cpi_with_read_only.rsis excluded by none and included by noneprogram-tests/utils/src/account_zero_copy.rsis excluded by none and included by noneprogram-tests/utils/src/actions/legacy/instructions/transfer2.rsis excluded by none and included by noneprogram-tests/utils/src/address_tree_rollover.rsis excluded by none and included by noneprogram-tests/utils/src/assert_compressed_tx.rsis excluded by none and included by noneprogram-tests/utils/src/assert_merkle_tree.rsis excluded by none and included by noneprogram-tests/utils/src/assert_queue.rsis excluded by none and included by noneprogram-tests/utils/src/batched_address_tree.rsis excluded by none and included by noneprogram-tests/utils/src/e2e_test_env.rsis excluded by none and included by noneprogram-tests/utils/src/lib.rsis excluded by none and included by noneprogram-tests/utils/src/mock_batched_forester.rsis excluded by none and included by noneprogram-tests/utils/src/state_tree_rollover.rsis excluded by none and included by noneprogram-tests/utils/src/test_batch_forester.rsis excluded by none and included by noneprogram-tests/utils/src/test_forester.rsis excluded by none and included by nonesdk-tests/anchor-manual-test/tests/shared.rsis excluded by none and included by nonesdk-tests/anchor-semi-manual-test/tests/shared/mod.rsis excluded by none and included by nonesdk-tests/anchor-semi-manual-test/tests/stress_test.rsis excluded by none and included by nonesdk-tests/client-test/tests/light_client.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/amm_stress_test.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/amm_test.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/basic_test.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/d10_ata_idempotent_test.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/d10_token_accounts_test.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/d11_zero_copy_test.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/failing_tests.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/integration_tests.rsis excluded by none and included by nonesdk-tests/csdk-anchor-full-derived-test/tests/shared.rsis excluded by none and included by nonesdk-tests/pinocchio-light-program-test/tests/shared/mod.rsis excluded by none and included by nonesdk-tests/pinocchio-light-program-test/tests/stress_test.rsis excluded by none and included by nonesdk-tests/pinocchio-manual-test/tests/shared.rsis excluded by none and included by nonesdk-tests/sdk-anchor-test/programs/sdk-anchor-test/tests/read_only.rsis excluded by none and included by nonesdk-tests/sdk-anchor-test/programs/sdk-anchor-test/tests/test.rsis excluded by none and included by nonesdk-tests/sdk-native-test/tests/test.rsis excluded by none and included by nonesdk-tests/sdk-pinocchio-v1-test/tests/test.rsis excluded by none and included by nonesdk-tests/sdk-pinocchio-v2-test/tests/test.rsis excluded by none and included by nonesdk-tests/sdk-token-test/tests/ctoken_pda.rsis excluded by none and included by nonesdk-tests/sdk-token-test/tests/decompress_full_cpi.rsis excluded by none and included by nonesdk-tests/sdk-token-test/tests/pda_ctoken.rsis excluded by none and included by nonesdk-tests/sdk-token-test/tests/test.rsis excluded by none and included by nonesdk-tests/sdk-token-test/tests/test_4_invocations.rsis excluded by none and included by nonesdk-tests/sdk-token-test/tests/test_4_transfer2.rsis excluded by none and included by nonesdk-tests/sdk-token-test/tests/test_deposit.rsis excluded by none and included by nonesdk-tests/sdk-v1-native-test/tests/test.rsis excluded by none and included by nonesdk-tests/single-account-loader-test/tests/test.rsis excluded by none and included by nonesdk-tests/single-ata-test/tests/test.rsis excluded by none and included by nonesdk-tests/single-mint-test/tests/test.rsis excluded by none and included by nonesdk-tests/single-pda-test/tests/test.rsis excluded by none and included by nonesdk-tests/single-token-test/tests/test.rsis excluded by none and included by nonesparse-merkle-tree/src/indexed_changelog.rsis excluded by none and included by nonesparse-merkle-tree/tests/indexed_changelog.rsis excluded by none and included by none
📒 Files selected for processing (66)
forester/src/config.rsforester/src/epoch_manager.rsforester/src/forester_status.rsforester/src/main.rsforester/src/metrics.rsforester/src/priority_fee.rsforester/src/processor/v2/helpers.rsforester/src/processor/v2/proof_worker.rsforester/src/processor/v2/strategy/address.rsforester/src/queue_helpers.rsforester/tests/e2e_test.rsforester/tests/legacy/batched_state_async_indexer_test.rsforester/tests/legacy/test_utils.rsforester/tests/test_batch_append_spent.rsforester/tests/test_compressible_pda.rsforester/tests/test_indexer_interface.rsforester/tests/test_utils.rsprover/client/src/errors.rsprover/client/src/helpers.rsprover/client/src/proof.rsprover/client/src/proof_client.rsprover/client/src/proof_types/batch_address_append/json.rsprover/client/src/proof_types/batch_address_append/proof_inputs.rsprover/client/src/proof_types/batch_append/json.rsprover/client/src/proof_types/batch_append/proof_inputs.rsprover/client/src/proof_types/batch_update/json.rsprover/client/src/proof_types/batch_update/proof_inputs.rsprover/client/src/proof_types/combined/v1/json.rsprover/client/src/proof_types/combined/v2/json.rsprover/client/src/proof_types/combined/v2/proof_inputs.rsprover/client/src/proof_types/inclusion/v1/json.rsprover/client/src/proof_types/inclusion/v1/proof_inputs.rsprover/client/src/proof_types/inclusion/v2/json.rsprover/client/src/proof_types/inclusion/v2/proof_inputs.rsprover/client/src/proof_types/non_inclusion/v1/json.rsprover/client/src/proof_types/non_inclusion/v1/proof_inputs.rsprover/client/src/proof_types/non_inclusion/v2/json.rsprover/client/src/proof_types/non_inclusion/v2/proof_inputs.rsprover/client/src/prover.rsprover/client/tests/batch_address_append.rsprover/client/tests/batch_append.rsprover/client/tests/batch_update.rsprover/client/tests/combined.rsprover/client/tests/inclusion.rsprover/client/tests/init_merkle_tree.rsprover/client/tests/non_inclusion.rssdk-libs/client/src/fee.rssdk-libs/client/src/indexer/error.rssdk-libs/client/src/indexer/options.rssdk-libs/client/src/indexer/photon_indexer.rssdk-libs/client/src/indexer/types/proof.rssdk-libs/client/src/indexer/types/queue.rssdk-libs/client/src/indexer/types/token.rssdk-libs/client/src/interface/initialize_config.rssdk-libs/client/src/interface/instructions.rssdk-libs/client/src/interface/load_accounts.rssdk-libs/client/src/interface/mod.rssdk-libs/client/src/interface/pack.rssdk-libs/client/src/interface/serialize.rssdk-libs/client/src/local_test_validator.rssdk-libs/client/src/rpc/client.rssdk-libs/client/src/utils.rssdk-libs/macros/src/light_pdas/seeds/extract.rssdk-libs/program-test/src/indexer/test_indexer.rssdk-libs/program-test/src/program_test/compressible_setup.rssdk-libs/program-test/src/program_test/light_program_test.rs
| valid | ||
| .into_iter() | ||
| .collect::<std::result::Result<Vec<_>, _>>() | ||
| .map_err(|_| ConfigError::InvalidArguments { | ||
| field: "tree_ids", | ||
| invalid_values: vec!["failed to parse tree_ids".to_string()], | ||
| })? |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
This fallback parse error is unreachable after the partition() above.
valid can only contain Ok(Pubkey) values once the invalid branch has returned, so the generic "failed to parse tree_ids" path never executes. I'd either unwrap the partitioned values directly or collapse the parse into a single pass so the error surface stays aligned with the real invalid inputs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@forester/src/config.rs` around lines 341 - 347, The current use of
partition() makes the Err branch already returned, so the map_err fallback on
valid.into_iter().collect(...) is unreachable and gives a misleading generic
error; change the logic in the tree ID parsing flow (the variables/blocks using
partition(), valid, invalid and the ConfigError::InvalidArguments with field
"tree_ids") to either directly unwrap/collect the already-validated `valid`
Vec<Pubkey> (since `invalid` was handled) or perform a single-pass parse that
collects invalid inputs and returns ConfigError::InvalidArguments with the
actual invalid_values; update the code around the partition() usage to return
the real invalid strings instead of the generic "failed to parse tree_ids".
| let available = self.wait_for_batch(end); | ||
| if start >= available { | ||
| return None; | ||
| return Ok(None); | ||
| } | ||
| let actual_end = end.min(available); | ||
| let data = lock_recover(&self.data, "streaming_address_queue.data"); | ||
| Some(BatchDataSlice { | ||
| addresses: data.addresses[start..actual_end].to_vec(), | ||
|
|
||
| let addresses = data.addresses[start..actual_end].to_vec(); | ||
| if addresses.is_empty() { | ||
| return Err(anyhow!("Empty batch at start={}", start)); | ||
| } | ||
|
|
||
| let leaves_hashchain = match data.leaves_hash_chains.get(hashchain_idx).copied() { | ||
| Some(hashchain) => hashchain, | ||
| None => { | ||
| tracing::debug!( | ||
| "Missing leaves_hash_chain for batch {} (available: {}), deriving from addresses", | ||
| hashchain_idx, | ||
| data.leaves_hash_chains.len() | ||
| ); | ||
| create_hash_chain_from_slice(&addresses).map_err(|error| { | ||
| anyhow!( | ||
| "Failed to derive leaves_hash_chain for batch {} from {} addresses: {}", | ||
| hashchain_idx, | ||
| addresses.len(), | ||
| error | ||
| ) | ||
| })? | ||
| } | ||
| }; |
There was a problem hiding this comment.
get_batch_snapshot should not return a partial batch as a successful snapshot.
When wait_for_batch(end) returns with available < end, this code truncates to actual_end and still returns Some(...). Downstream callers treat that as a real ZKP batch and can derive a hashchain/proof inputs for an undersized batch instead of waiting or cleanly skipping it.
Suggested guard
- if start >= available {
+ if available < end || start >= available {
return Ok(None);
}
- let actual_end = end.min(available);
+ let actual_end = end;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@forester/src/processor/v2/helpers.rs` around lines 495 - 524, The function
currently treats truncated batches as valid by using actual_end =
end.min(available) and proceeding; change the guard so that if
wait_for_batch(end) returns available < end the function returns Ok(None) (i.e.,
do not proceed with partial batches). Specifically, in get_batch_snapshot
replace the current if start >= available { return Ok(None); } / actual_end
logic with a check like if available < end { return Ok(None); } (still keep the
start >= available check as needed), so code paths that call lock_recover, read
data.addresses[start..actual_end], or derive leaves_hash_chain
(data.leaves_hash_chains, create_hash_chain_from_slice) never run for undersized
batches.
| let inputs_json = match job.inputs.to_json(&job.tree_id, job.seq) { | ||
| Ok(inputs_json) => inputs_json, | ||
| Err(error) => { | ||
| let _ = job | ||
| .result_tx | ||
| .send(ProofJobResult { | ||
| seq: job.seq, | ||
| result: Err(format!("Failed to serialize proof inputs: {}", error)), | ||
| old_root: [0u8; 32], | ||
| new_root: [0u8; 32], | ||
| proof_duration_ms: 0, | ||
| round_trip_ms: 0, | ||
| submitted_at: std::time::Instant::now(), | ||
| }) | ||
| .await; | ||
| return; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Use one timestamp/logging path for serialization failures.
Line 209 stamps submitted_at with Instant::now(), while Line 306 reuses round_trip_start for the same “serialization failed before submit” outcome. That makes latency metrics path-dependent, and these branches also go silent if result_tx is closed. Factor this into one helper with a single timestamp convention and an error! before the send.
Suggested direction
+async fn send_serialization_failure(
+ result_tx: &mpsc::Sender<ProofJobResult>,
+ seq: u64,
+ error: &ProverClientError,
+ submitted_at: std::time::Instant,
+ round_trip_ms: u64,
+) {
+ error!("Failed to serialize proof inputs for seq={}: {}", seq, error);
+ let _ = result_tx
+ .send(ProofJobResult {
+ seq,
+ result: Err(format!("Failed to serialize proof inputs: {}", error)),
+ old_root: [0u8; 32],
+ new_root: [0u8; 32],
+ proof_duration_ms: 0,
+ round_trip_ms,
+ submitted_at,
+ })
+ .await;
+}Also applies to: 296-311
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@forester/src/processor/v2/proof_worker.rs` around lines 197 - 214, The
serialization-failure path uses two different timestamps (submitted_at =
Instant::now() here and round_trip_start elsewhere) and silently ignores closed
result_tx; consolidate this by adding a helper (e.g., send_error_result or
report_serialization_failure) used by both the job.inputs.to_json error branch
and the other serialization-before-submit branch that: captures one
Instant::now() into a single variable (round_trip_start or submitted_at) at the
failure site, logs the error with error! including the error details, constructs
the ProofJobResult with that single timestamp and consistent
round_trip_ms/proof_duration_ms values, and attempts to send via job.result_tx
while handling a closed channel by logging the send failure; update both
locations (the error branch around job.inputs.to_json and the other branch that
currently reuses round_trip_start) to call this helper so latency metrics and
logging are consistent.
| let service_handle = tokio::task::spawn_blocking(move || { | ||
| let runtime = tokio::runtime::Builder::new_multi_thread() | ||
| .worker_threads(2) | ||
| .enable_all() | ||
| .build()?; | ||
| runtime.block_on(run_pipeline::<LightClient>( | ||
| config, | ||
| None, | ||
| None, | ||
| shutdown_receiver, | ||
| Some(shutdown_compressible_receiver), | ||
| None, // shutdown_bootstrap | ||
| work_report_sender, | ||
| )) | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the task result is currently ignored and where it happens.
rg -n "spawn_blocking|timeout\\(Duration::from_secs\\(5\\), service_handle\\)\\.await" forester/tests/test_batch_append_spent.rsRepository: Lightprotocol/light-protocol
Length of output: 209
🏁 Script executed:
sed -n '330,355p' forester/tests/test_batch_append_spent.rs | cat -nRepository: Lightprotocol/light-protocol
Length of output: 1079
🏁 Script executed:
rg -B2 "fn run_pipeline" forester/src/lib.rs | head -20Repository: Lightprotocol/light-protocol
Length of output: 217
🏁 Script executed:
rg -A8 "pub async fn run_pipeline<R: Rpc" forester/src/lib.rs | head -30Repository: Lightprotocol/light-protocol
Length of output: 480
Surface pipeline task failures instead of dropping them.
This closure can fail at three levels—runtime construction (line 334-337), run_pipeline execution, and task joining—but the caller discards the join result with let _ = ... at line 352. This silently hides startup and runtime failures, producing false-positive test passes.
The timeout wrapping a JoinHandle<Result<()>> creates nested error layers that all need explicit handling. Currently all are ignored.
💡 Suggested fix
- let _ = timeout(Duration::from_secs(5), service_handle).await;
+ match timeout(Duration::from_secs(5), service_handle).await {
+ Ok(Ok(Ok(()))) => {}
+ Ok(Ok(Err(e))) => panic!("forester pipeline returned error: {e:#}"),
+ Ok(Err(e)) => panic!("forester task join error: {e}"),
+ Err(_) => panic!("timed out waiting for forester shutdown"),
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@forester/tests/test_batch_append_spent.rs` around lines 332 - 346, The
spawned closure passed to tokio::task::spawn_blocking (bound to service_handle)
can fail at runtime construction (tokio::runtime::Builder::build), during
runtime.block_on(run_pipeline::<LightClient>(...)), and when joining the spawned
task; currently all results are ignored—update the test to explicitly handle
each error: unwrap or propagate the runtime build Result (ensure the build error
is surfaced), propagate/unwrap the Result returned by
run_pipeline::<LightClient> so pipeline failures fail the test, and await/join
the JoinHandle and check its JoinError (or unwrap) instead of discarding with
let _ = ..., failing the test on panic or join failure; reference
service_handle, runtime.block_on, run_pipeline::<LightClient>,
shutdown_receiver, shutdown_compressible_receiver, and work_report_sender when
making these changes.
| .build() | ||
| .unwrap(); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Prefer expect(...) over bare unwrap() in these setup builders.
These are high-churn setup steps, and the rest of the file already uses descriptive expects. Keeping the message here makes it much faster to tell whether InitializeRentFreeConfig::build() failed or the later transaction path broke.
Minimal tweak
- .build()
- .unwrap();
+ .build()
+ .expect("InitializeRentFreeConfig::build should succeed");Also applies to: 508-509, 750-751
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@forester/tests/test_compressible_pda.rs` around lines 312 - 313, Replace the
bare .unwrap() calls on builder results with .expect(...) that includes a
descriptive message; specifically change the .build().unwrap() invocations that
call InitializeRentFreeConfig::build() (and the two other similar setup builder
calls in this test) to .build().expect("InitializeRentFreeConfig::build()
failed") or an equivalent message that names the failing builder so failures in
setup are clear. Ensure each replaced call uses a distinct message if helpful to
identify which builder failed.
| if health_check(10, 1).await { | ||
| return Ok(()); | ||
| } | ||
|
|
||
| let _ = command.wait_with_output(); | ||
| let loading_guard = IS_LOADING | ||
| .compare_exchange(false, true, Ordering::AcqRel, Ordering::Acquire) | ||
| .is_ok(); | ||
|
|
||
| let health_result = health_check(120, 1).await; | ||
| if health_result { | ||
| info!("Prover started successfully"); | ||
| } else { | ||
| panic!("Failed to start prover, health check failed."); | ||
| } | ||
| if !loading_guard { | ||
| return if health_check(120, 1).await { | ||
| Ok(()) | ||
| } else { | ||
| Err(ProverClientError::HealthCheckFailed) | ||
| }; | ||
| } | ||
|
|
||
| let spawn_result = async { | ||
| let command = Command::new(&prover_path) | ||
| .arg("start-prover") | ||
| .spawn() | ||
| .map_err(|error| ProverClientError::ProcessStart(error.to_string()))?; | ||
|
|
||
| command | ||
| .wait_with_output() | ||
| .map_err(|error| ProverClientError::ProcessWait(error.to_string()))?; | ||
|
|
||
| if health_check(120, 1).await { | ||
| info!("Prover started successfully"); | ||
| Ok(()) | ||
| } else { | ||
| Err(ProverClientError::HealthCheckFailed) | ||
| } | ||
| } else { | ||
| panic!("Failed to find project root."); | ||
| }; | ||
| } | ||
| .await; | ||
|
|
||
| IS_LOADING.store(false, Ordering::Release); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== spawn_prover / health_check blocking calls =="
rg -n --glob '*.rs' 'spawn_prover\s*\(|async fn health_check|wait_with_output|thread::sleep|std::thread::sleep'
echo
echo "== start-prover command implementation =="
rg -n --glob '*.rs' 'start-prover|start_prover'Repository: Lightprotocol/light-protocol
Length of output: 4314
🏁 Script executed:
cat -n prover/client/src/prover.rs | head -100Repository: Lightprotocol/light-protocol
Length of output: 3115
🏁 Script executed:
find . -name "*.rs" -type f | xargs grep -l "start-prover" | grep -v test | grep -v targetRepository: Lightprotocol/light-protocol
Length of output: 101
🏁 Script executed:
find . -type f -name "*.rs" | xargs grep -l "start_prover\|start-prover" | head -20Repository: Lightprotocol/light-protocol
Length of output: 151
🏁 Script executed:
find . -name "cli" -type d | head -5Repository: Lightprotocol/light-protocol
Length of output: 77
🏁 Script executed:
find ./cli -type f -name "*.rs" | head -20Repository: Lightprotocol/light-protocol
Length of output: 54
🏁 Script executed:
ls -la ./cli/Repository: Lightprotocol/light-protocol
Length of output: 1226
🏁 Script executed:
ls -la ./cli/test_bin/Repository: Lightprotocol/light-protocol
Length of output: 284
🏁 Script executed:
cat ./cli/test_bin/runRepository: Lightprotocol/light-protocol
Length of output: 177
🏁 Script executed:
find ./cli/src -type f \( -name "*.ts" -o -name "*.js" \) | xargs grep -l "start-prover\|start_prover"Repository: Lightprotocol/light-protocol
Length of output: 56
🏁 Script executed:
find ./cli/src -type f -name "*.ts" | head -20Repository: Lightprotocol/light-protocol
Length of output: 841
🏁 Script executed:
cat ./cli/src/commands/start-prover/index.tsRepository: Lightprotocol/light-protocol
Length of output: 1362
🏁 Script executed:
cat ./cli/src/utils/processProverServer.tsRepository: Lightprotocol/light-protocol
Length of output: 4457
The async startup path blocks the Tokio executor with synchronous I/O; this will deadlock if the prover daemonizes.
The health_check() function blocks on std::thread::sleep() (line 89) within an async context, and more critically, wait_with_output() (line 58) blocks the entire async block waiting for the spawned process to exit. If start-prover is designed to daemonize (as the TypeScript CLI suggests), this call never returns, IS_LOADING never clears, and subsequent calls retry for 120 seconds each while blocking a Tokio worker thread.
Three specific problems:
- Blocking sleep in async: Each retry loop in
health_check()sleeps synchronously, wasting a worker thread - Child process wait blocks executor:
wait_with_output()blocks indefinitely if the prover detaches - State machine wedge: If the child blocks indefinitely, line 70 never executes, leaving
IS_LOADING=truepermanently; any concurrent caller then runshealth_check(120, 1), sleeping up to 120 seconds in the executor
Replace std::thread::sleep with tokio::time::sleep, and either spawn the prover as a detached child (if it daemonizes) or refactor to poll readiness without blocking the executor. Note: the code passes start-prover as the subcommand, but the TypeScript CLI expects start—verify which is correct.
| if let Some(next) = account.tree_info.next_tree_info { | ||
| // SAFETY: account will always have a state Merkle tree context. | ||
| // pack_output_tree_index only panics on an address Merkle tree context. | ||
| let index = next.pack_output_tree_index(packed_accounts).unwrap(); | ||
| let index = next.pack_output_tree_index(packed_accounts)?; | ||
| if output_tree_index.is_none() { | ||
| output_tree_index = Some(index); | ||
| } | ||
| } else { | ||
| // SAFETY: account will always have a state Merkle tree context. | ||
| // pack_output_tree_index only panics on an address Merkle tree context. | ||
| let index = account | ||
| .tree_info | ||
| .pack_output_tree_index(packed_accounts) | ||
| .unwrap(); | ||
| let index = account.tree_info.pack_output_tree_index(packed_accounts)?; | ||
| if output_tree_index.is_none() { | ||
| output_tree_index = Some(index); | ||
| } |
There was a problem hiding this comment.
Validate that every state input resolves to the same output tree.
PackedStateTreeInfos only carries one output_tree_index, but this loop keeps the first computed index and silently ignores different indices from later accounts. A proof that mixes accounts whose current/next output trees diverge will therefore pack the wrong output tree without error.
Also applies to: 243-248
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk-libs/client/src/indexer/types/proof.rs` around lines 214 - 223, The loop
that computes output tree indices (using account.tree_info.next_tree_info and
account.tree_info.pack_output_tree_index on each Account) currently keeps the
first index in output_tree_index and ignores mismatches; update
PackedStateTreeInfos construction to validate that every call to
pack_output_tree_index returns the same index and return an error if any
subsequent index differs. Specifically, in the code paths that set
output_tree_index (the blocks referencing output_tree_index,
account.tree_info.next_tree_info, and pack_output_tree_index), compare the newly
computed index to the previously stored Some(index) and propagate a descriptive
Err when they diverge instead of silently ignoring the difference; apply the
same validation to the other occurrence noted (around lines 243-248).
| let pda_groups = group_pda_specs(&cold_pdas, MAX_PDAS_PER_IX); | ||
| let pda_hashes = pda_groups | ||
| .iter() | ||
| .map(|group| collect_pda_hashes(group)) | ||
| .collect::<Result<Vec<_>, _>>()?; |
There was a problem hiding this comment.
Preserve the original PDA index in MissingPdaCompressed.
After group_pda_specs, each collect_pda_hashes(group) starts enumerating from zero again, so the 9th/17th/etc. cold PDA now reports index 0 instead of its position in cold_pdas. Carry the flattened index through grouping so this error stays actionable once there is more than one PDA batch.
Also applies to: 164-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk-libs/client/src/interface/load_accounts.rs` around lines 120 - 124, The
missing-PDA error loses the original position because collect_pda_hashes
restarts enumeration for each group; update the flow so the original flat index
from cold_pdas is preserved: modify group_pda_specs/collect_pda_hashes usage so
each group carries a base_index (or have collect_pda_hashes return (hash,
original_index) pairs) and use that original index when constructing
MissingPdaCompressed; apply the same change to the other occurrence referenced
(lines ~164-175) so every MissingPdaCompressed contains the flattened index from
cold_pdas rather than the per-group 0-based index.
| let path = "cli/test_bin/run test-validator"; | ||
| let mut path = format!("{}/{}", project_root.trim(), path); | ||
| if !config.enable_indexer { | ||
| path.push_str(" --skip-indexer"); | ||
| } | ||
|
|
||
| for sbf_program in config.sbf_programs.iter() { | ||
| path.push_str(&format!( | ||
| " --sbf-program {} {}", | ||
| sbf_program.0, sbf_program.1 | ||
| )); | ||
| } | ||
| if let Some(limit_ledger_size) = config.limit_ledger_size { | ||
| path.push_str(&format!(" --limit-ledger-size {}", limit_ledger_size)); | ||
| } | ||
|
|
||
| for upgradeable_program in config.upgradeable_programs.iter() { | ||
| path.push_str(&format!( | ||
| " --upgradeable-program {} {} {}", | ||
| upgradeable_program.program_id, | ||
| upgradeable_program.program_path, | ||
| upgradeable_program.upgrade_authority | ||
| )); | ||
| } | ||
| for sbf_program in config.sbf_programs.iter() { | ||
| path.push_str(&format!( | ||
| " --sbf-program {} {}", | ||
| sbf_program.0, sbf_program.1 | ||
| )); | ||
| } | ||
|
|
||
| if !config.enable_prover { | ||
| path.push_str(" --skip-prover"); | ||
| } | ||
| for upgradeable_program in config.upgradeable_programs.iter() { | ||
| path.push_str(&format!( | ||
| " --upgradeable-program {} {} {}", | ||
| upgradeable_program.program_id, | ||
| upgradeable_program.program_path, | ||
| upgradeable_program.upgrade_authority | ||
| )); | ||
| } | ||
|
|
||
| if config.use_surfpool { | ||
| path.push_str(" --use-surfpool"); | ||
| } | ||
| if !config.enable_prover { | ||
| path.push_str(" --skip-prover"); | ||
| } | ||
|
|
||
| for arg in config.validator_args.iter() { | ||
| path.push_str(&format!(" {}", arg)); | ||
| } | ||
| if config.use_surfpool { | ||
| path.push_str(" --use-surfpool"); | ||
| } | ||
|
|
||
| println!("Starting validator with command: {}", path); | ||
|
|
||
| if config.use_surfpool { | ||
| // The CLI starts surfpool, prover, and photon, then exits once all | ||
| // services are ready. Wait for it to finish so we know everything | ||
| // is up before the test proceeds. | ||
| let mut child = Command::new("sh") | ||
| .arg("-c") | ||
| .arg(path) | ||
| .stdin(Stdio::null()) | ||
| .stdout(Stdio::inherit()) | ||
| .stderr(Stdio::inherit()) | ||
| .spawn() | ||
| .expect("Failed to start server process"); | ||
| let status = child.wait().expect("Failed to wait for CLI process"); | ||
| assert!(status.success(), "CLI exited with error: {}", status); | ||
| } else { | ||
| let child = Command::new("sh") | ||
| .arg("-c") | ||
| .arg(path) | ||
| .stdin(Stdio::null()) | ||
| .stdout(Stdio::null()) | ||
| .stderr(Stdio::null()) | ||
| .spawn() | ||
| .expect("Failed to start server process"); | ||
| std::mem::drop(child); | ||
| tokio::time::sleep(tokio::time::Duration::from_secs(config.wait_time)).await; | ||
| for arg in config.validator_args.iter() { | ||
| path.push_str(&format!(" {}", arg)); | ||
| } | ||
|
|
||
| println!("Starting validator with command: {}", path); | ||
|
|
||
| if config.use_surfpool { | ||
| // The CLI starts surfpool, prover, and photon, then exits once all | ||
| // services are ready. Wait for it to finish so we know everything | ||
| // is up before the test proceeds. | ||
| let mut child = Command::new("sh") | ||
| .arg("-c") | ||
| .arg(path) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n sdk-libs/client/src/local_test_validator.rs | head -150Repository: Lightprotocol/light-protocol
Length of output: 5342
🏁 Script executed:
rg "validator_args" sdk-libs/client/src/ -A 2 -B 2Repository: Lightprotocol/light-protocol
Length of output: 1418
🏁 Script executed:
rg "LightValidatorConfig" sdk-libs/client/src/ -A 5 -B 2Repository: Lightprotocol/light-protocol
Length of output: 2832
Build and pass validator command as structured argv, not a concatenated shell string.
Lines 69–105 concatenate project_root, program paths, and validator_args into a single shell string. This breaks when paths or arguments contain spaces or shell metacharacters—for example, a project root like /home/user/light protocol/... or a validator arg like --account ADDRESS /path/with spaces/file.json. The documented example for validator_args (line 42) explicitly shows file paths, confirming this is a real use case.
Instead of passing the full command to sh -c, invoke cli/test_bin/run directly as the executable and build the argument list as separate items: Command::new("cli/test_bin/run").arg("test-validator").arg("--skip-indexer")... This avoids shell tokenization entirely.
Additionally, the non-surfpool branch (lines 127–136) drops the child process and sleeps unconditionally, so startup failures are hidden. The surfpool branch (lines 109–126) correctly waits for the process and checks the exit status; apply the same pattern to the non-surfpool path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk-libs/client/src/local_test_validator.rs` around lines 69 - 115, The
current code concatenates command pieces into a single shell string and spawns
it via Command::new("sh").arg("-c").arg(path), which breaks on
spaces/metacharacters (see path construction and uses of config.enable_indexer,
config.limit_ledger_size, config.sbf_programs, config.upgradeable_programs,
config.validator_args and config.use_surfpool). Change to build a Vec<String>
(or Vec<&str>) of argv: start with the executable built from project_root +
"/cli/test_bin/run", then push "test-validator" and each flag/value as separate
.arg(...) entries (one per sbf_program pair and per upgradeable_program fields,
and each validator_arg as its own arg) and call
Command::new(executable).args(&args).spawn() instead of using "sh -c". Also make
the non-surfpool branch behave like the surfpool branch: keep the child handle,
call child.wait(), check the exit Status and propagate or log failures instead
of dropping the child and sleeping unconditionally.
| } else { | ||
| let child = Command::new("sh") | ||
| .arg("-c") | ||
| .arg(path) | ||
| .stdin(Stdio::null()) | ||
| .stdout(Stdio::null()) | ||
| .stderr(Stdio::null()) | ||
| .spawn()?; | ||
| std::mem::drop(child); | ||
| tokio::time::sleep(tokio::time::Duration::from_secs(config.wait_time)).await; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd sdk-libs/client && head -200 src/local_test_validator.rs | tail -100Repository: Lightprotocol/light-protocol
Length of output: 3074
🏁 Script executed:
cd sdk-libs/client && cat -n src/local_test_validator.rs | sed -n '100,160p'Repository: Lightprotocol/light-protocol
Length of output: 1593
🏁 Script executed:
cd sdk-libs/client && cat -n src/local_test_validator.rs | sed -n '1,50p'Repository: Lightprotocol/light-protocol
Length of output: 2023
Keep the child handle through the delay and check for process failures in the non-surfpool branch.
The non-surfpool branch spawns the validator process but immediately drops the handle without checking if it succeeds. If the process dies during wait_time due to a bad flag, missing binary, occupied port, or other issue, callers still receive Ok(()). The surfpool branch (lines 109-126) shows the correct pattern: hold the child handle, call wait(), and check the exit status. Apply the same approach here: keep the child handle through the sleep, then call try_wait() afterward and surface any non-zero exit status as an io::Error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk-libs/client/src/local_test_validator.rs` around lines 127 - 136, The
non-surfpool branch currently spawns the validator with
Command::new(...).spawn() and immediately drops the Child, so failures during
the sleep are ignored; change this to keep the Child (e.g., let mut child =
Command::new(...).spawn()?), await the
tokio::time::sleep(Duration::from_secs(config.wait_time)).await while retaining
the child, then call child.try_wait() (or child.wait() if blocking is
acceptable) and inspect the ExitStatus; if the child has exited with a
non-success status return an io::Error (or map the ExitStatus to Err) so callers
receive an error instead of Ok(()). Ensure you mirror the surfpool branch
pattern (spawn -> keep handle -> wait/try_wait -> check status) and propagate
any non-zero exit as an io::Error.
fbf48e5 to
46b80e9
Compare
46b80e9 to
786422b
Compare
Summary by CodeRabbit
Bug Fixes
New Features
Refactor