-
Notifications
You must be signed in to change notification settings - Fork 222
Description
london sled 17 was observed to have gone to A2 on thursday, feb 26th for some reason that we didn't really understand at the time. the computer in question is a Cosmo rev B, running Hubris release 1.0.60.
the ambient temperature in the fridge at the time was observed by Alan to be 87 F. the thermal loop ringbuf indicated that the sled had gone into the Critical control domain a couple of times but had not thermally shutdowned:
alan@castle:targets$ pfexec humility -a $ARCHIVE --ip fe80::aa40:25ff:fe04:1103%l[2/14811]
tp0 ringbuf thermal
humility: connecting to fe80::aa40:25ff:fe04:1103%3
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
TOTAL VARIANT
129971 ControlPwm
8 AutoState(Boot)
8 AutoState(Running)
2 AutoState(Critical)
2 AutoState(FanParty)
7 PowerModeChanged
6 FanAdded
4 SensorReadFailed
3 MiscReadFailed
2 CriticalDueTo
1 Start
1 ThermalMode(Auto)
1 FanControllerInitialized
1 SetFanWatchdogOk
NDX LINE GEN COUNT PAYLOAD
15 1352 2975 3 ControlPwm(0x0)
16 1352 2975 1 ControlPwm(0x1)
17 1352 2975 1 ControlPwm(0x2)
18 1352 2975 1 ControlPwm(0x0)
19 1352 2975 1 ControlPwm(0x3)
20 1352 2975 3 ControlPwm(0x2)
21 1352 2975 3 ControlPwm(0x3)
22 1352 2975 1 ControlPwm(0x2)
23 1352 2975 1 ControlPwm(0x3)
24 1352 2975 1 ControlPwm(0x5)
25 1352 2975 2 ControlPwm(0x3)
26 1352 2975 2 ControlPwm(0x5)
27 1352 2975 1 ControlPwm(0x4)
28 1352 2975 2 ControlPwm(0x5)
29 1352 2975 1 ControlPwm(0x3)
30 1352 2975 3 ControlPwm(0x4)
31 1352 2975 1 ControlPwm(0x5)
0 1352 2976 2 ControlPwm(0x4)
1 1352 2976 1 ControlPwm(0x5)
2 1352 2976 1 ControlPwm(0x6)
3 1352 2976 1 ControlPwm(0x4)
4 1352 2976 4 ControlPwm(0x5)
5 1352 2976 1 ControlPwm(0x8)
6 1352 2976 1 ControlPwm(0x5)
7 1352 2976 3 ControlPwm(0x6)
8 1352 2976 1 ControlPwm(0x7)
9 1352 2976 1 ControlPwm(0x6)
10 1352 2976 1 ControlPwm(0x5)
11 1352 2976 1 ControlPwm(0x7)
12 1352 2976 1 ControlPwm(0x5)
13 1352 2976 2 ControlPwm(0x6)
14 1352 2976 1 ControlPwm(0x7)Alan also reported having seen that the VDDCR_CPU0_A0 regulator had asserted its PMBus alert at some point, according to the ringbuf:
32 312 2 1 RegulatorStatus { rail: VddcrCpu0, power_good: true, faulted: true }
However, it did not appear to have lost POWER_GOOD, so that probably didn't turn the computer off?
On the other hand, looking at humility_pmbus, we saw that a bunch of regulators did not have POWER_GOOD asserted. But, that was probably just because the computer was in A2 when Alan ran humility pmbus, so we would not expect the A0 power rails to be enabled. Perhaps one of them had actually lost POWER_GOOD and sent us to A2, but there's no actual way to tell based on this.
We observed that the TPS546B24A regulator on the V0P96_NIC_VDD_A0HP had its PMBus fault bit set (and was claiming to outputting -0.1 amps?). According to the humility pmbus output, the regulator was 46 °C, which is Kinda Hot! However, a different TPS546B24A (the V5_SYS_A2 regulator) was 50 °C, which is Way Hotter. So that's weird.
alan@castle:racklette-commands$ pfexec humility -a $ARCHIVE --ip fe80::aa40:25ff:fe04:1281%london_sw0tp0 pmbus -s
humility: connecting to fe80::aa40:25ff:fe04:1281%3
DEVICE RAIL PG? #FLT VIN VOUT IOUT TEMP_1
tps546b24a V3P3_SP_A2 Y 0 11.984V 3.309V 0.146A 46.750°C
tps546b24a V5_SYS_A2 Y 0 11.953V 4.977V 0.016A 50.250°C
tps546b24a V1P8_SYS_A2 Y 0 11.969V 1.791V -0.471A 46.750°C
raa229620a VDDCR_CPU0_A0 N 0 12.200V 0.000V 0.000A 43.000°C
raa229620a VDDCR_SOC_A0 N 0 12.190V 0.000V 0.000A 45.000°C
raa229620a VDDCR_CPU1_A0 N 0 12.070V 0.000V 0.000A 45.000°C
raa229620a VDDIO_SP5_A0 N 0 12.070V 0.001V 0.000A 46.000°C
isl68224 V1P1_SP5_A0 N 0 11.940V 0.002V 0.000A 44.000°C
isl68224 V1P8_SP5_A1 N 0 11.940V 0.001V 0.000A 42.000°C
isl68224 V3P3_SP5_A1 N 0 11.940V 0.002V 0.000A 45.000°C
tps546b24a V0P96_NIC_VDD_A0HP N 1 -0.003V 0.012V -0.163A 46.500°C
bmr491 V12_SYS_A2 Y 0 53.875V 11.968V 0.500A 57.750°C
lm5066i V54P5_FAN_EAST Y 0 54.438V 54.563V - 50.438°C
lm5066i V54P5_FAN_CENTRAL Y 0 54.373V 54.455V - 48.375°C
lm5066i V54P5_FAN_WEST Y 0 54.460V 54.411V - 47.625°C
adm127x V54P5_IBC_A3 -- error: can't read VOUT_MODE: NoRegister -- We basically don't know why this computer turned off. We don't really like that it turned off. It would be nice to know why this happened that one time and if there was something we could do about it.
there is an archive for this system in /staff/alan/image-for-dublin/extracted/repo/targets/d94ec577f3b19fba0be4174192bfa8820f969d221966e32ded8f7450729cdab4.gimlet_sp-cosmo-b-1.0.60.tar.gz on catacomb, unless it has since been deleted. allegedly, there is (or was, at one time) a text file containing all the ringbufs in /staff/alan/cosmo-off/ringbuf.txt on catacomb, again, provided that this has not also been deleted.