Add shared ereport types crate, more sequencer ereports#2399
Conversation
drv/cosmo-seq-server/src/main.rs
Outdated
| @@ -219,17 +221,15 @@ fn main() -> ! { | |||
|
|
|||
| if let Some(last_cause) = last_cause { | |||
| // Report the failure even if we eventually succeeded. | |||
| try_send_ereport( | |||
| &packrat, | |||
| &mut ereport_buf[..], | |||
| EreportClass::Bmr491MitigationFailure, | |||
| EreportKind::Bmr491MitigationFailure { | |||
| refdes: FixedStr::from_str(dev.component_id()), | |||
| failures, | |||
| last_cause, | |||
| succeeded, | |||
| }, | |||
| ); | |||
| let ereport = ereports::pwr::Bmr491MitigationFailure { | |||
| refdes: FixedStr::<{ REFDES_LEN }>::from_str( | |||
| dev.component_id(), | |||
There was a problem hiding this comment.
It's a bit annoying that these are generic over REFDES_LEN, since it means that the downstream task must specify it both in the max_cbor_len_for! macro and when actually constructing the FixedStrs (as the type is not inferred there, since it could be any FixedStr).
The alternative to this would be making the ereports crate depend on build_i2c, so that it can get the MAX_COMPONENT_ID_LEN value from there. But, that felt a bit gross for the lib that just defines a bunch of types to have to go build all the I2C stuff. Also, there might be some cases where a task needs to override the length in order to use a refdes that isn't coming from the I2C config. For instance, when we wire up the "look up refdes for a sensor ID" stuff in #2364 (which i started on in #2383), a sensor which is not an I2C device may have a longer refdes than any of the I2C devices in the config (unlikely but who knows!). Hmm.
|
Testing out the eliza@jeeves ~ $ pfexec humility hiffy -c FmcDemo.poke32 -a addr=0xc0000208,value=0x2
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
FmcDemo.poke32() => ()
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.get_state
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.get_state() => A0Thermtrip
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.set_state -a state=A2
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.set_state() => Changed
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.set_state -a state=A0
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.set_state() => Changed
eliza@jeeves ~ $ pilot sp -i e1000g0 exec -e 'ereports' fe80::aa40:25ff:fe04:11c5
Feb 26 18:46:53.957 INFO creating SP handle on interface e1000g0, component: faux-mgs
Feb 26 18:46:53.988 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:11c5%2]:11111, interface: e1000g0, socket: control-plane-agent, component: faux-mgs
restart ID: 7b2025d6-239d-9da0-5ddd-d6b13b6fb64b
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 2
ereports:
0x1: {
"baseboard_part_number": String("913-0000023"),
"baseboard_rev": Number(2),
"baseboard_serial_number": String("2J80WF9K\0\0\0"),
"ereport_message_version": Number(0),
"hubris_caboose": Object {
"board": String("cosmo-b"),
"commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
"version": Null,
},
"hubris_task_gen": Number(0),
"hubris_task_name": String("packrat"),
"hubris_uptime_ms": Number(0),
"lost": Null,
}
0x2: {
"baseboard_part_number": String("913-0000023"),
"baseboard_rev": Number(2),
"baseboard_serial_number": String("2J80WF9K\0\0\0"),
"dev_id": String("sp5-host-cpu"),
"ereport_message_version": Number(0),
"hubris_caboose": Object {
"board": String("cosmo-b"),
"commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
"version": Null,
},
"hubris_task_gen": Number(0),
"hubris_task_name": String("cosmo_seq"),
"hubris_uptime_ms": Number(568146),
"k": String("hw.cpu.amd.thermtrip"),
"refdes": String("P0"),
"state": Object {
"cur": String("A0PlusHP"),
"since": Number(314536),
},
"v": Number(0),
}
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.get_state
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.get_state() => A0
eliza@jeeves ~ $ pfexec humility hiffy -c FmcDemo.poke32 -a addr=0xc0000208,value=0x4
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
FmcDemo.poke32() => ()
eliza@jeeves ~ $ pilot sp -i e1000g0 exec -e 'ereports' fe80::aa40:25ff:fe04:11c5
Feb 26 18:47:39.106 INFO creating SP handle on interface e1000g0, component: faux-mgs
Feb 26 18:47:39.137 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:11c5%2]:11111, interface: e1000g0, socket: control-plane-agent, component: faux-mgs
restart ID: 7b2025d6-239d-9da0-5ddd-d6b13b6fb64b
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 3
ereports:
0x1: {
"baseboard_part_number": String("913-0000023"),
"baseboard_rev": Number(2),
"baseboard_serial_number": String("2J80WF9K\0\0\0"),
"ereport_message_version": Number(0),
"hubris_caboose": Object {
"board": String("cosmo-b"),
"commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
"version": Null,
},
"hubris_task_gen": Number(0),
"hubris_task_name": String("packrat"),
"hubris_uptime_ms": Number(0),
"lost": Null,
}
0x2: {
"baseboard_part_number": String("913-0000023"),
"baseboard_rev": Number(2),
"baseboard_serial_number": String("2J80WF9K\0\0\0"),
"dev_id": String("sp5-host-cpu"),
"ereport_message_version": Number(0),
"hubris_caboose": Object {
"board": String("cosmo-b"),
"commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
"version": Null,
},
"hubris_task_gen": Number(0),
"hubris_task_name": String("cosmo_seq"),
"hubris_uptime_ms": Number(568146),
"k": String("hw.cpu.amd.thermtrip"),
"refdes": String("P0"),
"state": Object {
"cur": String("A0PlusHP"),
"since": Number(314536),
},
"v": Number(0),
}
0x3: {
"baseboard_part_number": String("913-0000023"),
"baseboard_rev": Number(2),
"baseboard_serial_number": String("2J80WF9K\0\0\0"),
"dev_id": String("sp5-host-cpu"),
"ereport_message_version": Number(0),
"hubris_caboose": Object {
"board": String("cosmo-b"),
"commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
"version": Null,
},
"hubris_task_gen": Number(0),
"hubris_task_name": String("cosmo_seq"),
"hubris_uptime_ms": Number(714333),
"k": String("hw.cpu.amd.smerr"),
"refdes": String("P0"),
"state": Object {
"cur": String("A0"),
"since": Number(617259),
},
"v": Number(0),
}
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.get_state
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.get_state() => A2
eliza@jeeves ~ $ |
ba992d6 to
6717b05
Compare
This reverts commit 6717b05. I forgot that we actaully do need this.
|
@rmustacc I've requested your review on this mostly just for any feedback on the new ereport messages and their contents, since the new ereports are based on your suggestions from #2242. Hopefully, now that we have a shared library with most of the ereport type definitions, it should be a bit easier to review just the ereport messages, without having to wade through a whole bunch of implementation details moving Rust around. In particular, you can see the complete list of all the ereport classes that each sequencer task will emit here (for Gimlet): hubris/drv/gimlet-seq-server/src/main.rs Lines 1601 to 1606 in 092f494 and here (for Cosmo): hubris/drv/cosmo-seq-server/src/main.rs Lines 1215 to 1221 in 092f494 and you should be able to find most of these Rust types defined in this library, except for a few whose content is specific to the Cosmo or Gimlet sequencer: Please don't feel like you need to review anything beyond how we feel about the ereports, their contents, and the class hierarchy and naming --- thanks in advance for taking a look! |
labbott
left a comment
There was a problem hiding this comment.
Mostly stupid questions but I like the split of having some of the ereport details in a separate crate
| let ereport = ereports::cpu::UnsupportedCpu { | ||
| cpu: &HOST_CPU_REFDES, | ||
| cpu_type: CpuTypeBits { | ||
| coretype: [coretype0, coretype1, coretype2], | ||
| sp5rx: [sp5r1, sp5r2, sp5r3, sp5r4], | ||
| coretype_ok, | ||
| sp5rx_ok, | ||
| }, | ||
| }; |
There was a problem hiding this comment.
I was having a thought of "do we need something to compare against to avoid guess and check about what the bits should be" but based on what we've heard from manufacturing any error here is treated as "back to the inspection bin with ye" so I guess this is fine
There was a problem hiding this comment.
We have exactly what we want the bits to be at +612 for the coretype and only a single combination for the sp5rx bits at +620, so I don't think we are guessing and checking? Or at the least we do know from the AMD docs exactly what to look for.
But what we are missing is basically an ereport for CPU missing. That is, with a lot of these failures I don't think we're going to get to this point as we will ultimately also fail with CpuNotPresent so I think we may want to add a second ereport for that case ala hw.cpu.miss, perhaps something a bit more general for a missing component, or something. I'm not sure that needs to be part of this change.
| self.ereporter.try_send_ereport(&ereports::cpu::Thermtrip { | ||
| cpu: &HOST_CPU_REFDES, | ||
| state: self.ereport_current_state(), | ||
| }); |
There was a problem hiding this comment.
Is the expectation we'll get other ereports independently? If we're tripping the CPU thermtrip I'm also expecting us to get reports from the thermal task too?
There was a problem hiding this comment.
Yes, that's the eventual goal! This one is meant to just tell us that we saw the THERMTRIP pin asserted by the host CPU, which is all we really know about here. The thermal task may or may not be independently aware of a THERMTRIP assertion (we really expect the thermal task to also independently try to power down the system when the host CPU is reaching its thermal limits), so if it did something, we might not get a THERMTRIP ereport and may instead get some other thing. Eventually we will also expect the control plane to try and look at temperatures upon receiving this ereport.
| self.seq.ifr.modify(|h| h.set_a0mapo(true)); | ||
| ringbuf_entry!(Trace::A0MapoInterrupt); | ||
| action = InternalAction::Mapo; | ||
| // Great place for an ereport? |
There was a problem hiding this comment.
Guessing this is still WIP?
There was a problem hiding this comment.
Yeah, I mentioned in the PR description that per a discussion with rm, we probably want to communicate about MAPOs by monitoring the state of POWER_GOOD on various rails, and that it'll be a bit more complex than just sending a simple ereport when this IRQ (and the NIC MAPO one) fires. I didn't do that as part of this PR as I wanted to get both the simple ones and some of the refactoring in without also trying to figure all that stuff out.
| pub state: crate::pwr::CurrentState, | ||
| } | ||
|
|
||
| /// An ereport representing an AMD CPU's `SMERR_L` assertion. |
There was a problem hiding this comment.
Have you been able to test generating this? Last time I looked the AMD docs were very clear you should handle it but not how to trigger it.
There was a problem hiding this comment.
Soooo...thanks to @nathanaelhuffman, the FPGA now has an Interrupt Generation Register (IGR) with bits corresponding to those in the IFR and IER. We can poke bits in that register and cause the FPGA to raise the corresponding interrupt to the SP (setting the IFR bit and, if the IER flag enables the interrupt, raising the corresponding IRQ). This does not cause the host CPU to actually assert that pin, and does not cause whatever else the FPGA does if the actual input pin that triggers the interrupt is asserted.
If you take a look at the testing I did in #2399 (comment), you can see that I was able to poke the FPGA into simulating both the THERMTRIP and SMERR interrupts and that we got the expected ereports out of the sequencer task in both cases. Neither of those really simulate anything else that may or may not have happened if the host CPU actually asserted those pins. In this case, that isn't a big deal, since (AFAIK) those pins are basically just muxed by the FPGA on our behalf and don't cause it to go and do anything else.
There was a problem hiding this comment.
There's a part of me that still wants to see the CPU SMERR but this is really great testing and I'm glad to know it works
There was a problem hiding this comment.
I would also like that! I would also like to know what it even means if the CPU does SMERR...
There was a problem hiding this comment.
I would also like that! I would also like to know what it even means if the CPU does SMERR...
Can you clarify what you're looking to understand? We can do some research on error injection though.
There was a problem hiding this comment.
So, this is mainly just curiosity on my part, but... all I could find about SMERR_L is that the AMD Socket SP5 Processor Motherboard Design Guide (56870 Rev. 1.1) says1 that this signal should be connected to the motherboard component that's responsible for power sequencing, and that the motherboard must reset the CPU upon its assertion, but it doesn't really explain under what conditions it might be asserted besides that it is related to "system management". I'm a bit curious as to whether we know anything beyond that since it might indicate that there's additional information that we might want to report based on an understanding of what it means.
Footnotes
-
I'm paraphrasing what it actually says a bit since it's not public documentation. ↩
| let ereport = ereports::cpu::UnsupportedCpu { | ||
| cpu: &HOST_CPU_REFDES, | ||
| cpu_type: CpuTypeBits { | ||
| coretype: [coretype0, coretype1, coretype2], | ||
| sp5rx: [sp5r1, sp5r2, sp5r3, sp5r4], | ||
| coretype_ok, | ||
| sp5rx_ok, | ||
| }, | ||
| }; |
There was a problem hiding this comment.
We have exactly what we want the bits to be at +612 for the coretype and only a single combination for the sp5rx bits at +620, so I don't think we are guessing and checking? Or at the least we do know from the AMD docs exactly what to look for.
But what we are missing is basically an ereport for CPU missing. That is, with a lot of these failures I don't think we're going to get to this point as we will ultimately also fail with CpuNotPresent so I think we may want to add a second ereport for that case ala hw.cpu.miss, perhaps something a bit more general for a missing component, or something. I'm not sure that needs to be part of this change.
| pub state: crate::pwr::CurrentState, | ||
| } | ||
|
|
||
| /// An ereport representing an AMD CPU's `SMERR_L` assertion. |
There was a problem hiding this comment.
I would also like that! I would also like to know what it even means if the CPU does SMERR...
Can you clarify what you're looking to understand? We can do some research on error injection though.
| if ifr & thermtrip != 0 { | ||
| self.seq.clear_bytes(Addr::IFR, &[thermtrip]).unwrap_lite(); | ||
| self.update_state_internal(PowerState::A0Thermtrip); | ||
| self.ereporter.try_send_ereport(&ereports::cpu::Thermtrip { |
There was a problem hiding this comment.
Nit: this won't use now, because it's not written until update_state_internal is called.
There was a problem hiding this comment.
Maybe that's intentional, though, to reflect the before-thermtrip state?
There was a problem hiding this comment.
yeah, that's correct, that field is meant to indicate what state we were in when the event occurred, not any subsequent state we might be transitioning to. There's a comment on the type explaining that.
Depends on #2399. While working on #2399, I noticed that there was an emerging boilerplate pattern for tasks which report ereports, where I had declared a struct that bundles together a buffer of a size calculated using `microcbor::max_cbor_len_for!` with the `Packrat` API client and some methods for reporting ereports. For example, in the Cosmo sequencer, I wrote this: https://github.com/oxidecomputer/hubris/blob/9e18d0bd9cb69f76534ab0539f9114aeb398ac8c/drv/cosmo-seq-server/src/main.rs#L1237-L1263 And then an identical copy of the same code in the Gimlet sequencer: https://github.com/oxidecomputer/hubris/blob/9e18d0bd9cb69f76534ab0539f9114aeb398ac8c/drv/gimlet-seq-server/src/main.rs#L1620-L1647 I started to wonder whether it would be possible to factor out this boilerplate. While we could easily just have a ```rust pub struct Ereporter<const BUF_LEN: usize> { // ... } ``` and some methods in the `ereports` crate, and let the user provide the buffer size from a separate invocation of the `microcbor` macro, it occurred to me that there was a way to ... feed ... two birds with one...loaf of bread?[^1] and also solve some of the long running issues with the present ereport-recording APIs. In particular, a long-standing thorn in my side is the fact that, while we presently have a way to _calculate_ the required buffer length for a list of ereports, we **don't** currently have anything that actually **stops** you from attempting to to actually use that buffer to encode an ereport that **wasn't** involved in the length calculation. I had hoped that there would be a generic way to do this using const generics, but unfortunately, Rust's const generics remain only sort of half finished, as I discussed in detail in #2246 (comment). So, this means that it's possible to add a new ereport type, forget to add it to the list of ereports used for the length calculation, and end up with an ereport that never fits in the buffer without really noticing. So that's not great. Another less important but similarly bothersome thing is that I have generally tried to add ringbuf entries to record when an ereport is submitted, and more importantly, when it _isn't_, either because the ereport was too big for the encoding buffer due to issues like the one I described above, or due to the `packrat` ereport aggregation buffer being full. Ideally, these ringbuf entries/counters would also record _which_ ereport class was or was not reported, but doing this requires even *more* boilerplate which must be specific to the individual task that records ereports, as the list of ereports depends on the task. So, I've come up with a way to feed all of the aforementioned birds, by introducing a new `declare_ereporter!` macro in the `ereports` crate. This macro is invoked with the name of a struct, the name of a _trait_, and a list of ereports, and generates an ereporter struct, a trait implemented by all of the ereports in the provided list, and a method on the ereporter struct to record an ereport *where the ereport type implements the generated trait*. This way, we can have a way to ensure that only ereports whose lengths were used in the encoding buffer size calculation are reported using that buffer. The macro also generates ringbuf entries/counters for each ereport that's recorded using the generated ereporter type. For more details on how the new thing is used, see [the RustDoc I added for the macro][1]. [1]: https://github.com/oxidecomputer/hubris/blob/d4444b38458ae4d6fb72df7cec8fb38519b7f040/lib/ereports/src/lib.rs#L14-L104 [^1]: Attempting to use non-violent language here... :)
Presently, the compute sled sequencer tasks (
cosmo_seqandgimlet_seq) duplicate a large number of type definitions for ereportmessages. This is due to our present practice of defining a task's
ereports as one big enum of all the ereports it may send, which is
almost always unique to that task. This is unfortunate, as it means that
we must duplicate these messages when we would like them to be
consistent between multiple tasks.
Furthermore, the sequencer tasks currently only emit ereports for a
small subset of events that we care about --- mostly Vcore
regulator PMBus alerts. There are several other events that should
definitely be reported. I had previously attempted to add some new
ereports to the sequencer tasks in #2242, but at the time, neither
@rmustacc nor I were particularly satisfied with the content and class
hierarchy of those ereports, and I was unhappy with the fact that these
ereports added additional duplicated code between the two sequencer
tasks.
This PR is a second attempt to add some of the sequencer ereports I
would like, using the new API added in #2397 to define those ereport
messages in a shared crate, rather than duplicating them between the two
sequencer tasks. In particular, I've done the following:
lib/ereportscrate that contains shared ereportmessage type definitions, using the
#[ereport(...)]attribute frommicrocbor: add
#[ereport(...)]attribute #2397 to define the ereports as separate types.gimlet_seqandcosmo_seqereports to movethe message types for PMBus alerts and BMR491 mitigation failures to
the
ereportscrate, and use them in both sequencer tasks.psc_seqtask to use the shared definition ofPmbusStatusregisters. I didn't change the actual ereport messagesthere, as they're pretty different from the CPU sequencer ones.
power state, so that this can be included in some of the ereports,
as @rmustacc suggested in this comment.
THERMTRIP_LpinCPU_PRESENTpin is not assertedSMERR_Lpin (on Cosmo, as I'm not sure ifthe Gimlet sequencer task can notice this)
coretypeandsp3r{x}/sp5r{x}pins)I did not add ereports for MAPO or NIC MAPO events, as per a
conversation on Matrix with @rmustacc, I think we will want to report
MAPOs by adding better monitoring of
POWER_GOODstatus from variouspower rails. Some of that is described in issues #2372, #2394, and
perhaps also #2398.
This does not actually close #2142, as that describes sequencing
failures due to timeouts waiting for
POWER_GOODon various rails,which I will want to implement subsequently.
Closes #2242.