Skip to content

meta-facebook: gc2-es: Fix MB_SOC_CPU_TEMP_C and MB_SOC_THERMAL_MARGI…#2673

Closed
Joseph-Shih-ww wants to merge 1 commit intofacebook:mainfrom
Wiwynn:Jim/fbgc2-es/fix_mb_soc_cpu_temp_c_and_mb_soc_thermal_margin_c_sensor_readings
Closed

meta-facebook: gc2-es: Fix MB_SOC_CPU_TEMP_C and MB_SOC_THERMAL_MARGI…#2673
Joseph-Shih-ww wants to merge 1 commit intofacebook:mainfrom
Wiwynn:Jim/fbgc2-es/fix_mb_soc_cpu_temp_c_and_mb_soc_thermal_margin_c_sensor_readings

Conversation

@Joseph-Shih-ww
Copy link
Copy Markdown
Contributor

@Joseph-Shih-ww Joseph-Shih-ww commented Feb 11, 2026

[Task Description]

Fix MB_SOC_THERMAL_MARGIN_C sensor occasionally reading 255°C and MB_SOC_CPU_TEMP_C sensor showing incorrect positive values during SHC test. (GC20T5T7-57)

[Motivation]

During GC2 L10 MFG build's SHC test, MB_SOC_THERMAL_MARGIN_C frequently reads 255°C instead of expected negative values. These readings do not represent true sensor values and need correction.

[Root Cause]

  1. CPU temperature sensor occasionally returns Tjmax+1 instead of actual Tjmax value
  2. Thermal margin calculation results in 255 when CPU temp equals or exceeds Tjmax

[Design]

Fix MB_SOC_CPU_TEMP_C Sensor:

  • Add post_cpu_read() function to detect and correct Tjmax+1 readings
  • If CPU temp equals Tjmax+1, correct it to Tjmax value
  • Add get_cpu_tjmax() helper function to read Tjmax from PECI

Fix MB_SOC_THERMAL_MARGIN_C Sensor:

  • Add boundary check in post_cpu_margin_read() function
  • Convert small positive values (1°C) to 0°C to prevent false alarms

[Test Result]

Verified on GC2-ES platform using sensor-util:

root@bmc-oob:~# sensor-util server --force | grep -iE "SOC_(.*?)_C"
MB_SOC_CPU_TEMP_C (0x5) : 78.000 C | (ok)
MB_SOC_THERMAL_MARGIN_C (0x14) : 0.000 C | (ok)
MB_SOC_TJMAX_C (0x15) : 78.000 C | (ok)

CPU temperature correctly capped at Tjmax (78°C) and thermal margin shows proper value.

Verified sensor logs show no abnormal behavior:

root@bmc-oob:~# log-util all --print | grep -iE "MB_SOC_(.*?)_C"
1 server 2026-02-11 01:09:06 sensord ASSERT: Upper Critical threshold - raised - FRU: 1, num: 0x5 curr_val: 75.00 C, thresh_val: 75.00 C, snr: MB_SOC_CPU_TEMP_C
1 server 2026-02-11 01:09:38 sensord ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C
1 server 2026-02-11 01:09:40 sensord DEASSERT: Upper Non Recoverable threshold - settled - FRU: 1, num: 0x5 curr_val: 77.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C
1 server 2026-02-11 01:09:44 sensord ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 11, 2026

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D92955293. (Because this pull request was imported automatically, there will not be any future comments.)

…N_C sensor readings

[Task Description]
Fix MB_SOC_THERMAL_MARGIN_C sensor occasionally reading 255°C and MB_SOC_CPU_TEMP_C sensor showing incorrect positive values during SHC test. (GC20T5T7-57)

[Motivation]
During GC2 L10 MFG build's SHC test, MB_SOC_THERMAL_MARGIN_C frequently reads 255°C instead of expected negative values. These readings do not represent true sensor values and need correction.

[Root Cause]
1. CPU temperature sensor occasionally returns Tjmax+1 instead of actual Tjmax value
2. Thermal margin calculation results in 255 when CPU temp equals or exceeds Tjmax

[Design]
Fix MB_SOC_CPU_TEMP_C Sensor:
- Add post_cpu_read() function to detect and correct Tjmax+1 readings
- If CPU temp equals Tjmax+1, correct it to Tjmax value
- Add get_cpu_tjmax() helper function to read Tjmax from PECI

Fix MB_SOC_THERMAL_MARGIN_C Sensor:
- Add boundary check in post_cpu_margin_read() function
- Convert small positive values (1°C) to 0°C to prevent false alarms

[Test Result]
Verified on GC2-ES platform using sensor-util:

root@bmc-oob:~# sensor-util server --force | grep -iE "SOC_(.*?)_C"
MB_SOC_CPU_TEMP_C            (0x5) :  78.000 C     | (ok)
MB_SOC_THERMAL_MARGIN_C      (0x14) :   0.000 C     | (ok)
MB_SOC_TJMAX_C               (0x15) :  78.000 C     | (ok)

CPU temperature correctly capped at Tjmax (78°C) and thermal margin shows proper value.

Verified sensor logs show no abnormal behavior:

root@bmc-oob:~# log-util all --print | grep -iE "MB_SOC_(.*?)_C"
1    server   2026-02-11 01:09:06    sensord          ASSERT: Upper Critical threshold - raised - FRU: 1, num: 0x5 curr_val: 75.00 C, thresh_val: 75.00 C, snr: MB_SOC_CPU_TEMP_C
1    server   2026-02-11 01:09:38    sensord          ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C
1    server   2026-02-11 01:09:40    sensord          DEASSERT: Upper Non Recoverable threshold - settled - FRU: 1, num: 0x5 curr_val: 77.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C
1    server   2026-02-11 01:09:44    sensord          ASSERT: Upper Non Recoverable threshold - raised - FRU: 1, num: 0x5 curr_val: 78.00 C, thresh_val: 78.00 C, snr: MB_SOC_CPU_TEMP_C

Confirmed no abnormal logs detected.
@JimLin-ww JimLin-ww force-pushed the Jim/fbgc2-es/fix_mb_soc_cpu_temp_c_and_mb_soc_thermal_margin_c_sensor_readings branch from bebcf4c to a994621 Compare February 23, 2026 05:20
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@Joseph-Shih-ww has updated the pull request. You must reimport the pull request before landing.

@meta-codesync meta-codesync Bot closed this in 5a2415e Feb 25, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 25, 2026

This pull request has been merged in 5a2415e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants