virtio-nic: in-repo tests and a better device state FSM#1064
virtio-nic: in-repo tests and a better device state FSM#1064
Conversation
VirtIO 1.2 states clearly that a driver MUST NOT clear bits from device_status (except, implicitly, by writing RESET aka 0 to device_status to.. do a device reset). Since this is the only way to clear a bit, bits that *are* set can be set at most once. Our old FSM accepted whatever status a guest happened to provide and stored that; a guest could intentionally or unintentionally clear NEEDS_RESET even, though it would be very much in error at that point. Instead of this very lax treatment of device_status, be more clear that FEATURES_OK can be toggled to set only once (which fixes a bug on its own), and if a guest tries clearing a status bit by any means other than reset, set NEEDS_RESET and demand a reset outright. Along with this, `viona.rs` gets a very simple driver to exercise some usage patterns we've seen in starting up virtio NIC devices. `basic_operation_multiqueue` fails without the device state FSM improvements, a reflection of the bug mentioned above; since FEATURES_OK is set before DEVICE_OK, the final status write including DEVICE_OK would cause a second attempt to apply the negotiated features. We would then try to set up all the virtqueues again, which fails because the virtqueues were just taken out of the RESET state when the driver initialized them, and we come to rest with the NIC in NEEDS_RESET. Linux (and BSDs, and illumos) clearly do not care about this, as such instances proceed to network seemingly without issue. Windows seems to have a periodic task that notices the NEEDS_RESET bit and obliges. It resets the NIC, which fails to ever come up correctly, and dutifully stops networking (#1048). Commenting out parts of prior patches (957f5c4, 1df49a4, 45af0f7) also cause these new tests to fail; they are collectively as necessary as it seemed for correct operation in the face of resets. Finally, one of the new tests verifies that a device in NEEDS_RESET cannot have the bit inappropriately cleared without reset. This tripped over a deadlock if a virtqueue is set with an inappropriate size. That is fixed in this patch as well.
|
I need to test the actual device state FSM change across a gamut of guests still, and I'm not totally sure how I want to handle communicating the test vnic to these tests (hardcoded device name probably is not it), hence the draft. |
| /// Finally, we'll actually create and destroy some vnics so not only do we need | ||
| /// `dladm`, we need a recent enough viona and everything.. |
There was a problem hiding this comment.
well, not yet, but I want to. I was hesitant to automatically deleting vnics (such as in cleanup) because if your host OS is subject to https://www.illumos.org/issues/17853 then deleting the last vnic will panic your machine. to be clear, this is very fixed in illumos and helios, my OS is just on old bits and I haven't updated yet.
| let mut driver = test_ctx.create_driver(); | ||
| driver.modern_device_init(expected_feats); | ||
| let mut driver = test_ctx.create_driver(); | ||
| driver.modern_device_init(expected_feats); | ||
| crate::lifecycle::Lifecycle::reset(test_ctx.dev.as_ref()); | ||
| let mut driver = test_ctx.create_driver(); | ||
| driver.modern_device_init(expected_feats); | ||
| let mut driver = test_ctx.create_driver(); | ||
| driver.modern_device_init(expected_feats); | ||
| let mut driver = test_ctx.create_driver(); | ||
| driver.modern_device_init(expected_feats); |
There was a problem hiding this comment.
aesthetically I just want to say I was surprised and experienced an emotion (not sure if good) when I realized all these lines are the same length (except reset())
| #[tokio::test] | ||
| async fn run_viona_tests() { | ||
| let tests = &[ | ||
| test_device_status_writes, | ||
| basic_operation_modern, | ||
| basic_operation_multiqueue, | ||
| multiqueue_to_singlequeue_to_multiqueue, | ||
| ]; | ||
|
|
||
| for test in tests { | ||
| test(); | ||
| } | ||
| } |
There was a problem hiding this comment.
this is really the bit I don't love: since these tests all use vnic_propo0 for tests, they can't run concurrently. they could run concurrently if they all created their own vnics, which should be legal. so I think I'll adjust this module to expect a host NIC in an environment variable or something and use a shared u8 to create prop_testnic<N> names for each test to be distinct.
| test_device_status_writes, | ||
| basic_operation_modern, | ||
| basic_operation_multiqueue, |
There was a problem hiding this comment.
also this is most obviously missing a basic_operation_legacy, which should be straightforward to add from here.
| // For simplicity, shrink `queue_size` small enough that it fits in | ||
| // one page. There are a few additional items for the various parts | ||
| // of virtquues in addition to just an array of 16-byte elements, so | ||
| // we use the next smaller power of two so we round up to one page | ||
| // in the end. |
There was a problem hiding this comment.
the relevant structures below, which I should note here, are struct virtq_desc, struct virtq_avail, and struct virtq_used. I'm kinda just.. not initializing the structures much at all, because just setting valid addresses and sizes up is Good Enough for my purposes today.
I think we'd just interpret the addresses here as the corresponding struct from the VirtIO spec, but I'm not totally sure and these are already useful tests.
VirtIO 1.2 states clearly that a driver MUST NOT clear bits from device_status (except, implicitly, by writing RESET aka 0 to device_status to.. do a device reset). Since this is the only way to clear a bit, bits that are set can be set at most once. Our old FSM accepted whatever status a guest happened to provide and stored that; a guest could intentionally or unintentionally clear NEEDS_RESET even, though it would be very much in error at that point.
Instead of this very lax treatment of device_status, be more clear that FEATURES_OK can be toggled to set only once (which fixes a bug on its own), and if a guest tries clearing a status bit by any means other than reset, set NEEDS_RESET and demand a reset outright.
Along with this,
viona.rsgets a very simple driver to exercise some usage patterns we've seen in starting up virtio NIC devices.basic_operation_multiqueuefails without the device state FSM improvements, a reflection of the bug mentioned above; since FEATURES_OK is set before DEVICE_OK, the final status write including DEVICE_OK would cause a second attempt to apply the negotiated features. We would then try to set up all the virtqueues again, which fails because the virtqueues were just taken out of the RESET state when the driver initialized them, and we come to rest with the NIC in NEEDS_RESET.Linux (and BSDs, and illumos) clearly do not care about this, as such instances proceed to network seemingly without issue. Windows seems to have a periodic task that notices the NEEDS_RESET bit and obliges. It resets the NIC, which fails to ever come up correctly, and dutifully stops networking (#1048).
Commenting out parts of prior patches (957f5c4, 1df49a4, 45af0f7) also cause these new tests to fail; they are collectively as necessary as it seemed for correct operation in the face of resets.
Finally, one of the new tests verifies that a device in NEEDS_RESET cannot have the bit inappropriately cleared without reset. This tripped over a deadlock if a virtqueue is set with an inappropriate size. That is fixed in this patch as well.