meta-facebook: gc2-es: Fix MB_SOC_CPU_TEMP_C and MB_SOC_THERMAL_MARGI…#2673
Closed
Joseph-Shih-ww wants to merge 1 commit intofacebook:mainfrom
Conversation
Contributor
|
@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D92955293. (Because this pull request was imported automatically, there will not be any future comments.) |
…N_C sensor readings [Task Description] Fix MB_SOC_THERMAL_MARGIN_C sensor occasionally reading 255°C and MB_SOC_CPU_TEMP_C sensor showing incorrect positive values during SHC test. (GC20T5T7-57) [Motivation] During GC2 L10 MFG build's SHC test, MB_SOC_THERMAL_MARGIN_C frequently reads 255°C instead of expected negative values. These readings do not represent true sensor values and need correction. [Root Cause] 1. CPU temperature sensor occasionally returns Tjmax+1 instead of actual Tjmax value 2. Thermal margin calculation results in 255 when CPU temp equals or exceeds Tjmax [Design] Fix MB_SOC_CPU_TEMP_C Sensor: - Add post_cpu_read() function to detect and correct Tjmax+1 readings - If CPU temp equals Tjmax+1, correct it to Tjmax value - Add get_cpu_tjmax() helper function to read Tjmax from PECI Fix MB_SOC_THERMAL_MARGIN_C Sensor: - Add boundary check in post_cpu_margin_read() function - Convert small positive values (1°C) to 0°C to prevent false alarms [Test Result] Verified on GC2-ES platform using sensor-util: root@bmc-oob:~# sensor-util server --force | grep -iE "SOC_(.*?)_C" MB_SOC_CPU_TEMP_C (0x5) : 78.000 C | (ok) MB_SOC_THERMAL_MARGIN_C (0x14) : 0.000 C | (ok) MB_SOC_TJMAX_C (0x15) : 78.000 C | (ok) CPU temperature correctly capped at Tjmax (78°C) and thermal margin shows proper value. Verified sensor logs show no abnormal behavior: root@bmc-oob:~# log-util all --print | grep -iE "MB_SOC_(.*?)_C" 1 server 2026-02-11 01:09:06 sensord ASSERT: Upper Critical threshold - raised - FRU: 1, num: 0x5 curr_val: 75.00 C, thresh_val: 75.00 C, snr: MB_SOC_CPU_TEMP_C 1 server 2026-02-11 01:09:38 sensord ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C 1 server 2026-02-11 01:09:40 sensord DEASSERT: Upper Non Recoverable threshold - settled - FRU: 1, num: 0x5 curr_val: 77.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C 1 server 2026-02-11 01:09:44 sensord ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C Confirmed no abnormal logs detected.
bebcf4c to
a994621
Compare
Contributor
|
@Joseph-Shih-ww has updated the pull request. You must reimport the pull request before landing. |
Contributor
|
This pull request has been merged in 5a2415e. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Task Description]
Fix MB_SOC_THERMAL_MARGIN_C sensor occasionally reading 255°C and MB_SOC_CPU_TEMP_C sensor showing incorrect positive values during SHC test. (GC20T5T7-57)
[Motivation]
During GC2 L10 MFG build's SHC test, MB_SOC_THERMAL_MARGIN_C frequently reads 255°C instead of expected negative values. These readings do not represent true sensor values and need correction.
[Root Cause]
[Design]
Fix MB_SOC_CPU_TEMP_C Sensor:
Fix MB_SOC_THERMAL_MARGIN_C Sensor:
[Test Result]
Verified on GC2-ES platform using sensor-util:
root@bmc-oob:~# sensor-util server --force | grep -iE "SOC_(.*?)_C"
MB_SOC_CPU_TEMP_C (0x5) : 78.000 C | (ok)
MB_SOC_THERMAL_MARGIN_C (0x14) : 0.000 C | (ok)
MB_SOC_TJMAX_C (0x15) : 78.000 C | (ok)
CPU temperature correctly capped at Tjmax (78°C) and thermal margin shows proper value.
Verified sensor logs show no abnormal behavior:
root@bmc-oob:~# log-util all --print | grep -iE "MB_SOC_(.*?)_C"
1 server 2026-02-11 01:09:06 sensord ASSERT: Upper Critical threshold - raised - FRU: 1, num: 0x5 curr_val: 75.00 C, thresh_val: 75.00 C, snr: MB_SOC_CPU_TEMP_C
1 server 2026-02-11 01:09:38 sensord ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C
1 server 2026-02-11 01:09:40 sensord DEASSERT: Upper Non Recoverable threshold - settled - FRU: 1, num: 0x5 curr_val: 77.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C
1 server 2026-02-11 01:09:44 sensord ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C