LTE-2775 : Observing RDKB_PROCESS_CRASHED : zebra is not running, restarting the zebra in XLE#181
LTE-2775 : Observing RDKB_PROCESS_CRASHED : zebra is not running, restarting the zebra in XLE#181aj970-crypto wants to merge 4 commits intodevelopfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses zebra process crashes (RDKB_PROCESS_CRASHED) in XLE by adding LAN status checking before triggering zebra and enhancing debug logging throughout the routing service components.
Changes:
- Added comprehensive debug logging (APPLY_PRINT macro) to track zebra lifecycle and LAN/WAN status
- Introduced zebra process validation after startup to detect failed starts
- Added LAN status monitoring in shell script to trigger zebra restart when LAN becomes ready
- Enhanced error reporting with errno details for better debugging
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| source/service_routed/service_routed.c | Added APPLY_PRINT logging macro and debug prints throughout radv/zebra lifecycle functions; added post-startup zebra process validation; enhanced error reporting |
| source/service_deviceMode/service_devicemode.c | Added APPLY_PRINT logging macro for device mode service; commented out zebra-restart sysevent; added LAN status retrieval (unused) |
| source/scripts/init/service.d/service_routed.sh | Added log function with timestamp; added zebra restart trigger when LAN status becomes "started" |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local timestamp | ||
| local log_file | ||
| timestamp=$(date +'%Y-%m-%d %H:%M:%S') | ||
| echo "$timestamp [$level] $message" | tee -a "$LOG_FILE" |
There was a problem hiding this comment.
The log_file variable is assigned but never used in this function. The actual logging uses the LOG_FILE variable from the tee command. Either remove the unused log_file variable or implement the intended daily log rotation functionality.
| echo "$timestamp [$level] $message" | tee -a "$LOG_FILE" |
| local timestamp | ||
| local log_file | ||
| timestamp=$(date +'%Y-%m-%d %H:%M:%S') | ||
| echo "$timestamp [$level] $message" | tee -a "$LOG_FILE" |
There was a problem hiding this comment.
The LOG_FILE variable is referenced on line 49 but is never defined in this script. This will cause the log function to fail or write to an unintended location. Define LOG_FILE before it's used, for example: LOG_FILE="/rdklogs/logs/service_routed.log"
| }\ | ||
| } | ||
|
|
||
| #include <time.h> |
There was a problem hiding this comment.
The time.h header is included after the macro definition that uses time() and localtime(). While this may work in some cases due to transitive includes, it's better practice to include time.h before line 98 where the APPLY_PRINT macro is defined, since the macro uses time_t, time(), and localtime() functions.
| fprintf(logfptr, "getLanIpv6Info failed"); | ||
| APPLY_PRINT("%s: getLanIpv6Info failed", __FUNCTION__); |
There was a problem hiding this comment.
The missing newline character at the end of the format string will cause incomplete log entries. The fprintf call should include "\n" at the end of the format string to ensure proper log formatting.
| fprintf(logfptr, "getLanIpv6Info failed"); | |
| APPLY_PRINT("%s: getLanIpv6Info failed", __FUNCTION__); | |
| fprintf(logfptr, "getLanIpv6Info failed\n"); | |
| APPLY_PRINT("%s: getLanIpv6Info failed\n", __FUNCTION__); |
| char lanStartVal[64] = {0}; | ||
| sysevent_get(sysevent_fd, sysevent_token, "lan-status", lanStartVal, sizeof(lanStartVal)); | ||
| APPLY_PRINT("%s: lan-status value is %s \n", __FUNCTION__, lanStartVal); | ||
| // sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); |
There was a problem hiding this comment.
The zebra-restart sysevent has been commented out without any replacement logic or explanation in the code. This appears to be intentional based on the PR description about checking LAN status, but there's no corresponding logic that ensures zebra is restarted appropriately when LAN becomes ready. Consider either uncommenting this with proper LAN status checks, or adding clear documentation about why it was removed and how zebra restart is now triggered.
| // sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); | |
| /* | |
| * Note: | |
| * The zebra-restart sysevent used to be triggered unconditionally here | |
| * when entering ROUTER mode. As part of changes to the LAN bring-up | |
| * sequence, zebra lifecycle (start/restart) is now managed by the | |
| * LAN/WAN management components once LAN is fully up. | |
| * | |
| * To avoid duplicate restarts and race conditions with those | |
| * components, devicemode no longer restarts zebra directly. | |
| * If zebra restart behavior needs to be modified, it should be done | |
| * in the component responsible for routing daemon management rather | |
| * than by re-enabling this call. | |
| */ | |
| // sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); |
| log INFO "LAN status started. Starting routed service." | ||
| service_routed start | ||
| log INFO "LAN status started. Restarting zebra." | ||
| service_routed radv-restart |
There was a problem hiding this comment.
Inconsistent indentation: line 90 has an extra leading space before 'service_routed radv-restart' compared to line 88. This should use consistent indentation with the surrounding lines.
| service_routed radv-restart | |
| service_routed radv-restart |
| service_routed start | ||
| log INFO "LAN status started. Restarting zebra." | ||
| service_routed radv-restart |
There was a problem hiding this comment.
The radv-restart command is called immediately after service_routed start when lan-status becomes "started". This could lead to a race condition where zebra is restarted before it's fully initialized from the initial start command. Consider adding a small delay between the start and restart, or checking if zebra is already running before attempting to restart it.
| #define LOG_FILE "/rdklogs/logs/service_routed.txt" | ||
| #define APPLY_PRINT(fmt ...) {\ | ||
| FILE *logfp = fopen(LOG_FILE , "a+");\ |
There was a problem hiding this comment.
The APPLY_PRINT macro is defined twice in this file with different LOG_FILE values. The first definition at line 97 sets LOG_FILE to "/rdklogs/logs/service_routed.txt", while this would conflict with any other LOG_FILE definition in the same compilation unit. Consider using a unique macro name or ensuring there's no conflict with the LOG_FILE_NAME macro used elsewhere in the file.
| #define LOG_FILE "/rdklogs/logs/service_routed.txt" | |
| #define APPLY_PRINT(fmt ...) {\ | |
| FILE *logfp = fopen(LOG_FILE , "a+");\ | |
| #define SERVICE_ROUTED_LOG_FILE "/rdklogs/logs/service_routed.txt" | |
| #define APPLY_PRINT(fmt ...) {\ | |
| FILE *logfp = fopen(SERVICE_ROUTED_LOG_FILE , "a+");\ |
| APPLY_PRINT("[%s]: `%s` failed: rc=%d errno=%d (%s)\n", PROG_NAME, cmd_ops[i].cmd, rc1, errno, strerror(errno)); | ||
| } else { | ||
| fprintf(logfptr, "[%s]: `%s` succeeded\n", PROG_NAME, cmd_ops[i].cmd); | ||
| APPLY_PRINT("[%s]: `%s` succeeded\n", PROG_NAME, cmd_ops[i].cmd); |
There was a problem hiding this comment.
Inconsistent indentation detected. Lines 2776 and 2779 use tabs while the surrounding code uses spaces. This should be corrected to maintain consistent indentation style throughout the file.
| APPLY_PRINT("[%s]: `%s` failed: rc=%d errno=%d (%s)\n", PROG_NAME, cmd_ops[i].cmd, rc1, errno, strerror(errno)); | |
| } else { | |
| fprintf(logfptr, "[%s]: `%s` succeeded\n", PROG_NAME, cmd_ops[i].cmd); | |
| APPLY_PRINT("[%s]: `%s` succeeded\n", PROG_NAME, cmd_ops[i].cmd); | |
| APPLY_PRINT("[%s]: `%s` failed: rc=%d errno=%d (%s)\n", PROG_NAME, cmd_ops[i].cmd, rc1, errno, strerror(errno)); | |
| } else { | |
| fprintf(logfptr, "[%s]: `%s` succeeded\n", PROG_NAME, cmd_ops[i].cmd); | |
| APPLY_PRINT("[%s]: `%s` succeeded\n", PROG_NAME, cmd_ops[i].cmd); |
| char lanStartVal[64] = {0}; | ||
| sysevent_get(sysevent_fd, sysevent_token, "lan-status", lanStartVal, sizeof(lanStartVal)); | ||
| APPLY_PRINT("%s: lan-status value is %s \n", __FUNCTION__, lanStartVal); | ||
| // sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); |
There was a problem hiding this comment.
The lanStartVal variable is retrieved but never used. If this was intended to be a condition check before restarting zebra (as suggested by the PR description about checking LAN status), the logic is incomplete. Either remove the unused variable or implement the intended LAN status check.
| // sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); | |
| if (strcmp(lanStartVal, "started") == 0) | |
| { | |
| APPLY_PRINT("%s: LAN is started, restarting zebra\n", __FUNCTION__); | |
| sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); | |
| } | |
| else | |
| { | |
| APPLY_PRINT("%s: LAN is not started (status: %s), skipping zebra restart\n", __FUNCTION__, lanStartVal); | |
| } |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #endif | ||
| runCommandInShellBlocking("systemctl restart CcspLMLite.service"); | ||
| sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); | ||
| // sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); |
There was a problem hiding this comment.
Commented-out code should be removed rather than left in the codebase. If this code needs to be removed as part of the fix, delete the line entirely. If you want to preserve the history, rely on version control (git) to track what was changed.
| // sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); |
| status=$(sysevent get lan-status) | ||
| if [ "$status" == "started" ]; then | ||
| service_routed start | ||
| service_routed radv-restart |
There was a problem hiding this comment.
This radv-restart call may cause zebra to be stopped and restarted immediately after being started in scenarios where the routed service was not already running (e.g., when LAN comes up before WAN). While this is redundant in that case, it appears necessary to handle the case where routed is already started (e.g., WAN came up first) and zebra needs to be restarted with updated LAN configuration. Consider adding a comment explaining this dual-purpose behavior, or refactoring to check routed-status before calling radv-restart to avoid the unnecessary restart in the first scenario.
rajkamal-cv
left a comment
There was a problem hiding this comment.
I'm good with the changes
rajkamal-cv
left a comment
There was a problem hiding this comment.
Update the test results with the latest change
If yes, has the Gerrit topic or list of all dependent PRs across components (including meta-layer changes) been shared?
Gerrit verification build passed - https://gerrit.teamccp.com/#/c/944531/ Logs are updated in Jira - https://ccp.sys.comcast.net/secure/attachment/14036403/Logs_with_code_fix.txt |
5fb42e6 to
64632d7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
64632d7 to
3730dff
Compare
| status=$(sysevent get lan-status) | ||
| if [ "$status" == "started" ]; then | ||
| service_routed start | ||
| service_routed radv-start |
There was a problem hiding this comment.
call this only on XLE device. zebra-restart is triggered from dhcpv6 sequence in XB /Other Gateway devices
3730dff to
40070e7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
40070e7 to
3202401
Compare
Signed-off-by: aj970 <akshaya_j@comcast.com>
Signed-off-by: aj970 <akshaya_j@comcast.com>
Signed-off-by: aj970 <akshaya_j@comcast.com>
Signed-off-by: aj970 <akshaya_j@comcast.com>
3202401 to
7e1c4ed
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if [ "$BOX_TYPE" == "WNXL11BWL" ]; then | ||
| service_routed radv-start | ||
| fi |
There was a problem hiding this comment.
This additional call to service_routed radv-start is redundant and will cause zebra to be started twice. The service_routed start call on line 70 already invokes serv_routed_start() (in service_routed.c line 2225), which calls radv_start() at line 2263 to start the zebra daemon. Adding another explicit radv-start call here will attempt to start zebra a second time immediately after it's already been started, which is unnecessary and could cause issues.
| service_routed start | ||
| if [ "$BOX_TYPE" == "WNXL11BWL" ]; then | ||
| service_routed radv-start | ||
| fi |
There was a problem hiding this comment.
The indentation has changed from no spaces to one space before the function call. This change appears to be unintentional and should maintain consistency with the surrounding code style. Looking at the rest of the file, there is no consistent indentation pattern for commands within case blocks, but this change doesn't add value and introduces inconsistency with the previous line format.
| service_routed start | |
| if [ "$BOX_TYPE" == "WNXL11BWL" ]; then | |
| service_routed radv-start | |
| fi | |
| service_routed start | |
| if [ "$BOX_TYPE" == "WNXL11BWL" ]; then | |
| service_routed radv-start | |
| fi |
| if [ "$BOX_TYPE" == "WNXL11BWL" ]; then | ||
| service_routed radv-start | ||
| fi |
There was a problem hiding this comment.
The PR description states that zebra startup has been moved to service_routed.sh to ensure it starts only after LAN is fully initialized. However, the implementation doesn't match this description. The service_routed start call on line 70 (which is executed for all devices, not just WNXL11BWL) already starts zebra via the serv_routed_start() function in service_routed.c. This function calls radv_start() which both checks if LAN is ready and starts the zebra daemon. Therefore, the correct fix for WNXL11BWL would simply be to remove the zebra-restart event from service_devicemode.c (which has been done correctly), without adding any additional code here. The added lines 71-73 create a redundant second zebra start attempt.
| runCommandInShellBlocking("systemctl restart CcspLMLite.service"); | ||
| sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); | ||
| #if !defined(_WNXL11BWL_PRODUCT_REQ_) | ||
| sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); |
There was a problem hiding this comment.
The indentation level here is inconsistent with the surrounding code. Line 419 and 420 use different indentation levels (line 419 has 12 spaces, while line 421 has 16 spaces). This should match the indentation of line 419 for consistency, which appears to be at the same logical nesting level.
| sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); | |
| sysevent_set(sysevent_fd, sysevent_token, "zebra-restart", "", 0); |
|
Attaching the latest logs captured in XB and XLE device after addressing the review comments. |
LTE-2775 : Observing RDKB_PROCESS_CRASHED : zebra is not running, restarting the zebra in XLE
Reason for change:
1. The zebra restart logic requires bridge mode to be 0 and the LAN to be fully started. Previously, both LAN startup and the zebra start were triggered from the same location in service_devicemode.c. In cases where LAN initialization was slightly delayed, zebra startup would fail because it was being triggered before the LAN was ready.
2. To address this, zebra startup has been moved to service_routed.sh, ensuring that the zebra process is started only after the LAN is fully initialized.
Test Procedure: Check zebra process when XLE is in router mode.
Risks: Low
Priority: P1