Skip to content

something bad happed to Alan's computer on Thursday, February 26th, 2026 #2425

@hawkw

Description

@hawkw

london sled 17 was observed to have gone to A2 on thursday, feb 26th for some reason that we didn't really understand at the time. the computer in question is a Cosmo rev B, running Hubris release 1.0.60.

the ambient temperature in the fridge at the time was observed by Alan to be 87 F. the thermal loop ringbuf indicated that the sled had gone into the Critical control domain a couple of times but had not thermally shutdowned:

alan@castle:targets$ pfexec humility -a $ARCHIVE --ip fe80::aa40:25ff:fe04:1103%l[2/14811]
tp0 ringbuf thermal                                                                       
humility: connecting to fe80::aa40:25ff:fe04:1103%3                                       
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
  129971 ControlPwm
       8 AutoState(Boot)
       8 AutoState(Running)
       2 AutoState(Critical)
       2 AutoState(FanParty)
       7 PowerModeChanged
       6 FanAdded
       4 SensorReadFailed
       3 MiscReadFailed
       2 CriticalDueTo 
       1 Start
       1 ThermalMode(Auto)
       1 FanControllerInitialized
       1 SetFanWatchdogOk
 NDX LINE      GEN    COUNT PAYLOAD
  15 1352     2975        3 ControlPwm(0x0)
  16 1352     2975        1 ControlPwm(0x1)
  17 1352     2975        1 ControlPwm(0x2)
  18 1352     2975        1 ControlPwm(0x0)
  19 1352     2975        1 ControlPwm(0x3)
  20 1352     2975        3 ControlPwm(0x2)
  21 1352     2975        3 ControlPwm(0x3)
  22 1352     2975        1 ControlPwm(0x2)
  23 1352     2975        1 ControlPwm(0x3)
  24 1352     2975        1 ControlPwm(0x5)
  25 1352     2975        2 ControlPwm(0x3)
  26 1352     2975        2 ControlPwm(0x5)
  27 1352     2975        1 ControlPwm(0x4)
  28 1352     2975        2 ControlPwm(0x5)
  29 1352     2975        1 ControlPwm(0x3)
  30 1352     2975        3 ControlPwm(0x4)
  31 1352     2975        1 ControlPwm(0x5)
   0 1352     2976        2 ControlPwm(0x4)
   1 1352     2976        1 ControlPwm(0x5)
   2 1352     2976        1 ControlPwm(0x6)
   3 1352     2976        1 ControlPwm(0x4)
   4 1352     2976        4 ControlPwm(0x5)
   5 1352     2976        1 ControlPwm(0x8)
   6 1352     2976        1 ControlPwm(0x5)
   7 1352     2976        3 ControlPwm(0x6)
   8 1352     2976        1 ControlPwm(0x7)
   9 1352     2976        1 ControlPwm(0x6)
  10 1352     2976        1 ControlPwm(0x5)
  11 1352     2976        1 ControlPwm(0x7)
  12 1352     2976        1 ControlPwm(0x5)
  13 1352     2976        2 ControlPwm(0x6)
  14 1352     2976        1 ControlPwm(0x7)

Alan also reported having seen that the VDDCR_CPU0_A0 regulator had asserted its PMBus alert at some point, according to the ringbuf:

32  312        2        1 RegulatorStatus { rail: VddcrCpu0, power_good: true, faulted: true }

However, it did not appear to have lost POWER_GOOD, so that probably didn't turn the computer off?

On the other hand, looking at humility_pmbus, we saw that a bunch of regulators did not have POWER_GOOD asserted. But, that was probably just because the computer was in A2 when Alan ran humility pmbus, so we would not expect the A0 power rails to be enabled. Perhaps one of them had actually lost POWER_GOOD and sent us to A2, but there's no actual way to tell based on this.

We observed that the TPS546B24A regulator on the V0P96_NIC_VDD_A0HP had its PMBus fault bit set (and was claiming to outputting -0.1 amps?). According to the humility pmbus output, the regulator was 46 °C, which is Kinda Hot! However, a different TPS546B24A (the V5_SYS_A2 regulator) was 50 °C, which is Way Hotter. So that's weird.

alan@castle:racklette-commands$ pfexec humility -a $ARCHIVE --ip fe80::aa40:25ff:fe04:1281%london_sw0tp0 pmbus -s  
humility: connecting to fe80::aa40:25ff:fe04:1281%3
DEVICE      RAIL               PG? #FLT       VIN      VOUT      IOUT    TEMP_1
tps546b24a  V3P3_SP_A2           Y    0   11.984V    3.309V    0.146A  46.750°C
tps546b24a  V5_SYS_A2            Y    0   11.953V    4.977V    0.016A  50.250°C                           
tps546b24a  V1P8_SYS_A2          Y    0   11.969V    1.791V   -0.471A  46.750°C                                     
raa229620a  VDDCR_CPU0_A0        N    0   12.200V    0.000V    0.000A  43.000°C
raa229620a  VDDCR_SOC_A0         N    0   12.190V    0.000V    0.000A  45.000°C
raa229620a  VDDCR_CPU1_A0        N    0   12.070V    0.000V    0.000A  45.000°C
raa229620a  VDDIO_SP5_A0         N    0   12.070V    0.001V    0.000A  46.000°C
isl68224    V1P1_SP5_A0          N    0   11.940V    0.002V    0.000A  44.000°C
isl68224    V1P8_SP5_A1          N    0   11.940V    0.001V    0.000A  42.000°C
isl68224    V3P3_SP5_A1          N    0   11.940V    0.002V    0.000A  45.000°C
tps546b24a  V0P96_NIC_VDD_A0HP   N    1   -0.003V    0.012V   -0.163A  46.500°C
bmr491      V12_SYS_A2           Y    0   53.875V   11.968V    0.500A  57.750°C
lm5066i     V54P5_FAN_EAST       Y    0   54.438V   54.563V         -  50.438°C
lm5066i     V54P5_FAN_CENTRAL    Y    0   54.373V   54.455V         -  48.375°C
lm5066i     V54P5_FAN_WEST       Y    0   54.460V   54.411V         -  47.625°C
adm127x     V54P5_IBC_A3       --  error: can't read VOUT_MODE: NoRegister  -- 

We basically don't know why this computer turned off. We don't really like that it turned off. It would be nice to know why this happened that one time and if there was something we could do about it.

there is an archive for this system in /staff/alan/image-for-dublin/extracted/repo/targets/d94ec577f3b19fba0be4174192bfa8820f969d221966e32ded8f7450729cdab4.gimlet_sp-cosmo-b-1.0.60.tar.gz on catacomb, unless it has since been deleted. allegedly, there is (or was, at one time) a text file containing all the ringbufs in /staff/alan/cosmo-off/ringbuf.txt on catacomb, again, provided that this has not also been deleted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions