diff --git a/docs/PLUGIN_DOC.md b/docs/PLUGIN_DOC.md index d29bd484..31b7ff74 100644 --- a/docs/PLUGIN_DOC.md +++ b/docs/PLUGIN_DOC.md @@ -4,7 +4,7 @@ | Plugin | Collection | Analyzer Args | Collection Args | DataModel | Collector | Analyzer | | --- | --- | --- | --- | --- | --- | --- | -| AmdSmiPlugin | bad-pages
firmware --json
list --json
metric -g all
partition --json
process --json
ras --cper --folder={folder}
ras --afid --cper-file {cper_file}
static -g all --json
static -g {gpu_id} --json
topology
version --json
xgmi -l
xgmi -m | **Analyzer Args:**
- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).
- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.
- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).
- `expected_driver_version`: Optional[str] — Expected AMD driver version string.
- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).
- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.
- `expected_pldm_version`: Optional[str] — Expected PLDM version string.
- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.
- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.
- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).
- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.
- `devid_ep`: Optional[str] — Expected endpoint device ID.
- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.
- `sku_name`: Optional[str] — Expected SKU name string for GPU.
- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).
- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.
- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**
- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) | +| AmdSmiPlugin | bad-pages
firmware --json
list --json
metric -g all
partition --json
process --json
ras --cper --folder={folder}
ras --afid --cper-file {cper_file}
static -g all --json
static -g {gpu_id} --json
topology
version --json
xgmi -l
xgmi -m | **Analyzer Args:**
- `check_static_data`: bool — If True, run static data checks (e.g. driver version, partition mode).
- `expected_gpu_processes`: Optional[int] — Expected number of GPU processes.
- `expected_max_power`: Optional[int] — Expected maximum power value (e.g. watts).
- `expected_driver_version`: Optional[str] — Expected AMD driver version string.
- `expected_memory_partition_mode`: Optional[str] — Expected memory partition mode (e.g. sp3, dp).
- `expected_compute_partition_mode`: Optional[str] — Expected compute partition mode.
- `expected_firmware_versions`: Optional[dict[str, str]] — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).
- `l0_to_recovery_count_error_threshold`: Optional[int] — L0-to-recovery count above which an error is raised.
- `l0_to_recovery_count_warning_threshold`: Optional[int] — L0-to-recovery count above which a warning is raised.
- `vendorid_ep`: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).
- `vendorid_ep_vf`: Optional[str] — Expected endpoint VF vendor ID.
- `devid_ep`: Optional[str] — Expected endpoint device ID.
- `devid_ep_vf`: Optional[str] — Expected endpoint VF device ID.
- `sku_name`: Optional[str] — Expected SKU name string for GPU.
- `expected_xgmi_speed`: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).
- `analysis_range_start`: Optional[datetime.datetime] — Start of time range for time-windowed analysis.
- `analysis_range_end`: Optional[datetime.datetime] — End of time range for time-windowed analysis. | **Collection Args:**
- `cper_file_path`: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) | | BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'
wmic bios get SMBIOSBIOSVersion /Value | **Analyzer Args:**
- `exp_bios_version`: list[str] — Expected BIOS version(s) to match against collected value (str or list).
- `regex_match`: bool — If True, match exp_bios_version as regex; otherwise exact match. | - | [BiosDataModel](#BiosDataModel-Model) | [BiosCollector](#Collector-Class-BiosCollector) | [BiosAnalyzer](#Data-Analyzer-Class-BiosAnalyzer) | | CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**
- `required_cmdline`: Union[str, List] — Command-line parameters that must be present (e.g. 'pci=bfsort').
- `banned_cmdline`: Union[str, List] — Command-line parameters that must not be present.
- `os_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-OS overrides for required_cmdline and banned_cmdline (keyed by OS identifier).
- `platform_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-platform overrides for required_cmdline and banned_cmdline (keyed by platform). | - | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) | | DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor | Measure-Object).Count"
lspci -d {vendorid_ep}: | grep -i 'VGA\|Display\|3D' | wc -l
powershell -Command "(wmic path win32_VideoController get name | findstr AMD | Measure-Object).Count"
lscpu
lshw
lspci -d {vendorid_ep}: | grep -i 'Virtual Function' | wc -l
powershell -Command "(Get-VMHostPartitionableGpu | Measure-Object).Count" | **Analyzer Args:**
- `cpu_count`: Optional[list[int]] — Expected CPU count(s); pass as int or list of ints. Analysis passes if actual is in list.
- `gpu_count`: Optional[list[int]] — Expected GPU count(s); pass as int or list of ints. Analysis passes if actual is in list.
- `vf_count`: Optional[list[int]] — Expected virtual function count(s); pass as int or list of ints. Analysis passes if actual is in list. | - | [DeviceEnumerationDataModel](#DeviceEnumerationDataModel-Model) | [DeviceEnumerationCollector](#Collector-Class-DeviceEnumerationCollector) | [DeviceEnumerationAnalyzer](#Data-Analyzer-Class-DeviceEnumerationAnalyzer) | @@ -1843,7 +1843,7 @@ Check sysctl matches expected sysctl details - **expected_driver_version**: `Optional[str]` — Expected AMD driver version string. - **expected_memory_partition_mode**: `Optional[str]` — Expected memory partition mode (e.g. sp3, dp). - **expected_compute_partition_mode**: `Optional[str]` — Expected compute partition mode. -- **expected_pldm_version**: `Optional[str]` — Expected PLDM version string. +- **expected_firmware_versions**: `Optional[dict[str, str]]` — Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE). - **l0_to_recovery_count_error_threshold**: `Optional[int]` — L0-to-recovery count above which an error is raised. - **l0_to_recovery_count_warning_threshold**: `Optional[int]` — L0-to-recovery count above which a warning is raised. - **vendorid_ep**: `Optional[str]` — Expected endpoint vendor ID (e.g. for PCIe). diff --git a/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py b/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py index 815affdc..9a9cea71 100644 --- a/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py +++ b/nodescraper/plugins/inband/amdsmi/amdsmi_analyzer.py @@ -534,18 +534,14 @@ def _format_static_mismatch_payload( "per_gpu": per_gpu_list, } - def check_pldm_version( + def check_firmware_versions( self, amdsmi_fw_data: Optional[list[Fw]], - expected_pldm_version: Optional[str], - ): - """Check expected pldm version - - Args: - amdsmi_fw_data (Optional[list[Fw]]): data model - expected_pldm_version (Optional[str]): expected pldm version - """ - PLDM_STRING = "PLDM_BUNDLE" + expected_firmware_versions: dict[str, str], + ) -> None: + """Check that each GPU reports the expected version for each ``fw_id``.""" + if not expected_firmware_versions: + return if amdsmi_fw_data is None or len(amdsmi_fw_data) == 0: self._log_event( category=EventCategory.PLATFORM, @@ -554,30 +550,37 @@ def check_pldm_version( data={"amdsmi_fw_data": amdsmi_fw_data}, ) return - mismatched_gpus: list[int] = [] - pldm_missing_gpus: list[int] = [] + mismatches: list[dict[str, object]] = [] + missing: list[dict[str, object]] = [] for fw_data in amdsmi_fw_data: gpu = fw_data.gpu if isinstance(fw_data.fw_list, str): - pldm_missing_gpus.append(gpu) + for fw_id in expected_firmware_versions: + missing.append({"gpu": gpu, "fw_id": fw_id}) continue - for fw_info in fw_data.fw_list: - if PLDM_STRING == fw_info.fw_id and expected_pldm_version != fw_info.fw_version: - mismatched_gpus.append(gpu) - if PLDM_STRING == fw_info.fw_id: - break - else: - pldm_missing_gpus.append(gpu) + actual_by_id = {item.fw_id: item.fw_version for item in fw_data.fw_list} + for fw_id, expected_ver in expected_firmware_versions.items(): + if fw_id not in actual_by_id: + missing.append({"gpu": gpu, "fw_id": fw_id}) + elif actual_by_id[fw_id] != expected_ver: + mismatches.append( + { + "gpu": gpu, + "fw_id": fw_id, + "expected": expected_ver, + "actual": actual_by_id[fw_id], + } + ) - if mismatched_gpus or pldm_missing_gpus: + if mismatches or missing: self._log_event( category=EventCategory.FW, - description="PLDM Version Mismatch", + description="Firmware version mismatch", priority=EventPriority.ERROR, data={ - "mismatched_gpus": mismatched_gpus, - "pldm_missing_gpus": pldm_missing_gpus, - "expected_pldm_version": expected_pldm_version, + "expected_firmware_versions": expected_firmware_versions, + "mismatches": mismatches, + "missing": missing, }, ) @@ -661,8 +664,9 @@ def check_expected_xgmi_link_speed( if expected_xgmi_speed is None or len(expected_xgmi_speed) == 0: self._log_event( category=EventCategory.IO, - description="Expected XGMI speed not configured, skipping XGMI link speed check", - priority=EventPriority.WARNING, + description=("Expected XGMI link speed not set; skipping XGMI link speed analysis"), + priority=EventPriority.INFO, + console_log=True, ) return @@ -778,8 +782,8 @@ def analyze_data( args.expected_compute_partition_mode, ) - if args.expected_pldm_version: - self.check_pldm_version(data.firmware, args.expected_pldm_version) + if args.expected_firmware_versions: + self.check_firmware_versions(data.firmware, args.expected_firmware_versions) if data.cper_data: self.analyzer_cpers( diff --git a/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py b/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py index 860c0e0f..d4f22c46 100644 --- a/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py +++ b/nodescraper/plugins/inband/amdsmi/amdsmi_collector.py @@ -475,7 +475,8 @@ def _get_amdsmi_data( return None try: - return AmdSmiDataModel( + fw_ids = args.analysis_firmware_ids if args and args.analysis_firmware_ids else None + base = AmdSmiDataModel( version=version, gpu_list=gpu_list, process=processes, @@ -489,7 +490,10 @@ def _get_amdsmi_data( xgmi_link=xgmi_link or [], cper_data=cper_data, cper_afids=cper_afids, + analysis_firmware_ids=fw_ids, + analysis_ref=None, ) + return base.model_copy(update={"analysis_ref": base.build_analysis_ref()}) except ValidationError as err: self.logger.warning("Validation err: %s", err) self._log_event( @@ -763,7 +767,9 @@ def get_firmware(self) -> Optional[list[Fw]]: normalized: list[FwListItem] = [] for e in fw_list_raw: if isinstance(e, dict): - fid = e.get("fw_name") + fid = e.get("fw_id") + if fid is None: + fid = e.get("fw_name") ver = e.get("fw_version") normalized.append( FwListItem( diff --git a/nodescraper/plugins/inband/amdsmi/amdsmidata.py b/nodescraper/plugins/inband/amdsmi/amdsmidata.py index 04ff545f..940047ba 100644 --- a/nodescraper/plugins/inband/amdsmi/amdsmidata.py +++ b/nodescraper/plugins/inband/amdsmi/amdsmidata.py @@ -927,6 +927,24 @@ class Topo(BaseModel): links: list[TopoLink] +class AmdSmiAnalysisRef(BaseModel): + """Collector-filled summary for reference config""" + + gpu_processes_max: Optional[int] = None + max_power_w: Optional[int] = None + amdgpu_drv_version: Optional[str] = None + mem_part_mode: Optional[str] = None + compute_part_mode: Optional[str] = None + firmware_versions: Optional[dict[str, str]] = None + pldm_version: Optional[str] = None + ep_vendor_id: Optional[str] = None + ep_subvendor_id: Optional[str] = None + ep_device_id: Optional[str] = None + ep_subsystem_id: Optional[str] = None + ep_market_name: Optional[str] = None + xgmi_rates: Optional[list[float]] = None + + class AmdSmiDataModel(DataModel): """Data model for amd-smi data. @@ -957,6 +975,13 @@ class AmdSmiDataModel(DataModel): cper_data: Optional[list[FileModel]] = Field(default_factory=list) cper_afids: dict[str, int] = Field(default_factory=dict) + analysis_firmware_ids: Optional[list[str]] = Field( + default=None, + description="fw_id values used when snapshotting firmware_versions into analysis_ref.", + ) + + analysis_ref: Optional[AmdSmiAnalysisRef] = None + def get_list(self, gpu: int) -> Optional[AmdSmiListItem]: """Get the gpu list item for the given gpu id.""" if self.gpu_list is None: @@ -1001,3 +1026,154 @@ def get_bad_pages(self, gpu: int) -> Optional[BadPages]: if item.gpu == gpu: return item return None + + def _sorted_static_gpus(self) -> list[AmdSmiStatic]: + return sorted(self.static or [], key=lambda s: s.gpu) + + @property + def ref_gpu_processes_max(self) -> Optional[int]: + """Max process-list length across GPUs (for analysis reference snapshot).""" + proc = self.process + if not proc: + return None + counts: list[int] = [] + for p in proc: + if not p.process_list: + continue + if isinstance(p.process_list[0].process_info, str): + continue + counts.append(len(p.process_list)) + return max(counts) if counts else None + + @property + def ref_max_power_w(self) -> Optional[int]: + """First available max power limit (W) from static data, lowest GPU index first.""" + for gpu in self._sorted_static_gpus(): + lim = gpu.limit + if lim is None or lim.max_power is None or lim.max_power.value is None: + continue + try: + return int(float(lim.max_power.value)) + except (TypeError, ValueError): + continue + return None + + @property + def ref_amdgpu_drv_version(self) -> Optional[str]: + """Driver version from the lowest-index GPU with static data.""" + for gpu in self._sorted_static_gpus(): + if gpu.driver and gpu.driver.version: + return gpu.driver.version + return None + + @property + def ref_mem_part_mode(self) -> Optional[str]: + if self.partition is None: + return None + mps = self.partition.memory_partition + if not mps: + return None + return sorted(mps, key=lambda p: p.gpu_id)[0].partition_type + + @property + def ref_compute_part_mode(self) -> Optional[str]: + if self.partition is None: + return None + cps = self.partition.compute_partition + if not cps: + return None + return sorted(cps, key=lambda p: p.gpu_id)[0].partition_type + + @property + def ref_firmware_versions(self) -> Optional[dict[str, str]]: + ids = ( + list(self.analysis_firmware_ids) + if self.analysis_firmware_ids is not None + else list(_DEFAULT_ANALYSIS_FW_IDS) + ) + return _first_observed_fw_versions(self.firmware, ids) or None + + @property + def ref_pldm_version(self) -> Optional[str]: + fw = self.ref_firmware_versions + return fw.get("PLDM_BUNDLE") if fw else None + + @property + def ref_ep_vendor_id(self) -> Optional[str]: + ss = self._sorted_static_gpus() + return ss[0].asic.vendor_id if ss else None + + @property + def ref_ep_subvendor_id(self) -> Optional[str]: + ss = self._sorted_static_gpus() + return ss[0].asic.subvendor_id if ss else None + + @property + def ref_ep_device_id(self) -> Optional[str]: + ss = self._sorted_static_gpus() + return ss[0].asic.device_id if ss else None + + @property + def ref_ep_subsystem_id(self) -> Optional[str]: + ss = self._sorted_static_gpus() + return ss[0].asic.subsystem_id if ss else None + + @property + def ref_ep_market_name(self) -> Optional[str]: + ss = self._sorted_static_gpus() + return ss[0].asic.market_name if ss else None + + @property + def ref_xgmi_rates(self) -> Optional[list[float]]: + xm = self.xgmi_metric + if not xm: + return None + rates: set[float] = set() + for m in xm: + br = m.link_metrics.bit_rate + if br is None or br.value is None: + continue + try: + rates.add(float(br.value)) + except (TypeError, ValueError): + continue + return sorted(rates) if rates else None + + def build_analysis_ref(self) -> AmdSmiAnalysisRef: + """Build ``AmdSmiAnalysisRef`` from current field values""" + return AmdSmiAnalysisRef( + gpu_processes_max=self.ref_gpu_processes_max, + max_power_w=self.ref_max_power_w, + amdgpu_drv_version=self.ref_amdgpu_drv_version, + mem_part_mode=self.ref_mem_part_mode, + compute_part_mode=self.ref_compute_part_mode, + firmware_versions=self.ref_firmware_versions, + pldm_version=self.ref_pldm_version, + ep_vendor_id=self.ref_ep_vendor_id, + ep_subvendor_id=self.ref_ep_subvendor_id, + ep_device_id=self.ref_ep_device_id, + ep_subsystem_id=self.ref_ep_subsystem_id, + ep_market_name=self.ref_ep_market_name, + xgmi_rates=self.ref_xgmi_rates, + ) + + +_DEFAULT_ANALYSIS_FW_IDS: tuple[str, ...] = ("PLDM_BUNDLE",) + + +def _first_observed_fw_versions(firmware: Optional[list[Fw]], fw_ids: list[str]) -> dict[str, str]: + """For each ``fw_id``, take the version from the lowest GPU index that reports it.""" + out: dict[str, str] = {} + if not firmware or not fw_ids: + return out + need = set(fw_ids) + for fw in sorted(firmware, key=lambda f: f.gpu): + if isinstance(fw.fw_list, str): + continue + for item in fw.fw_list: + if item.fw_id in need and item.fw_id not in out: + out[item.fw_id] = item.fw_version + need.discard(item.fw_id) + if not need: + break + return out diff --git a/nodescraper/plugins/inband/amdsmi/analyzer_args.py b/nodescraper/plugins/inband/amdsmi/analyzer_args.py index de9d0312..3a5d2cfb 100644 --- a/nodescraper/plugins/inband/amdsmi/analyzer_args.py +++ b/nodescraper/plugins/inband/amdsmi/analyzer_args.py @@ -29,6 +29,7 @@ from pydantic import Field from nodescraper.models import AnalyzerArgs +from nodescraper.plugins.inband.amdsmi.amdsmidata import AmdSmiDataModel class AmdSmiAnalyzerArgs(AnalyzerArgs): @@ -51,8 +52,9 @@ class AmdSmiAnalyzerArgs(AnalyzerArgs): expected_compute_partition_mode: Optional[str] = Field( default=None, description="Expected compute partition mode." ) - expected_pldm_version: Optional[str] = Field( - default=None, description="Expected PLDM version string." + expected_firmware_versions: Optional[dict[str, str]] = Field( + default=None, + description="Expected firmware versions keyed by amd-smi fw_id (e.g. PLDM_BUNDLE).", ) l0_to_recovery_count_error_threshold: Optional[int] = Field( default=3, @@ -80,3 +82,36 @@ class AmdSmiAnalyzerArgs(AnalyzerArgs): analysis_range_end: Optional[datetime] = Field( default=None, description="End of time range for time-windowed analysis." ) + + @classmethod + def build_from_model(cls, datamodel: AmdSmiDataModel) -> "AmdSmiAnalyzerArgs": + """Build analyzer args from data model (reference snapshot set by collector). + + Args: + datamodel (AmdSmiDataModel): data model for plugin + + Returns: + AmdSmiAnalyzerArgs: instance of analyzer args class + """ + r = datamodel.analysis_ref + if r is None: + return cls() + fw_expect: dict[str, str] = {} + if r.firmware_versions: + fw_expect.update(r.firmware_versions) + if r.pldm_version is not None and "PLDM_BUNDLE" not in fw_expect: + fw_expect["PLDM_BUNDLE"] = r.pldm_version + return cls( + expected_gpu_processes=r.gpu_processes_max, + expected_max_power=r.max_power_w, + expected_driver_version=r.amdgpu_drv_version, + expected_memory_partition_mode=r.mem_part_mode, + expected_compute_partition_mode=r.compute_part_mode, + expected_firmware_versions=dict(fw_expect) if fw_expect else None, + vendorid_ep=r.ep_vendor_id, + vendorid_ep_vf=r.ep_subvendor_id, + devid_ep=r.ep_device_id, + devid_ep_vf=r.ep_subsystem_id, + sku_name=r.ep_market_name, + expected_xgmi_speed=r.xgmi_rates, + ) diff --git a/nodescraper/plugins/inband/amdsmi/collector_args.py b/nodescraper/plugins/inband/amdsmi/collector_args.py index 1a12d8d5..4fedc39b 100644 --- a/nodescraper/plugins/inband/amdsmi/collector_args.py +++ b/nodescraper/plugins/inband/amdsmi/collector_args.py @@ -33,6 +33,10 @@ class AmdSmiCollectorArgs(CollectorArgs): """Collector arguments for AmdSmiPlugin""" + analysis_firmware_ids: Optional[list[str]] = Field( + default=None, + description=("amd-smi fw_id values to record in analysis_ref.firmware_versions "), + ) cper_file_path: Optional[str] = Field( default=None, description="Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file).", diff --git a/test/unit/plugin/test_amdsmi_analyzer.py b/test/unit/plugin/test_amdsmi_analyzer.py index 6bc40330..f3966c97 100644 --- a/test/unit/plugin/test_amdsmi_analyzer.py +++ b/test/unit/plugin/test_amdsmi_analyzer.py @@ -461,8 +461,8 @@ def test_check_static_data_mismatch(mock_analyzer): assert len(analyzer.result.events) >= 1 -def test_check_pldm_version_success(mock_analyzer): - """Test check_pldm_version passes when PLDM version matches.""" +def test_check_firmware_versions_pldm_success(mock_analyzer): + """Test check_firmware_versions passes when PLDM version matches.""" analyzer = mock_analyzer firmware_data = [ @@ -474,13 +474,13 @@ def test_check_pldm_version_success(mock_analyzer): ), ] - analyzer.check_pldm_version(firmware_data, "1.2.3") + analyzer.check_firmware_versions(firmware_data, {"PLDM_BUNDLE": "1.2.3"}) assert len(analyzer.result.events) == 0 -def test_check_pldm_version_mismatch(mock_analyzer): - """Test check_pldm_version logs error when PLDM version doesn't match.""" +def test_check_firmware_versions_pldm_mismatch(mock_analyzer): + """Test check_firmware_versions logs error when PLDM version doesn't match.""" analyzer = mock_analyzer firmware_data = [ @@ -492,14 +492,14 @@ def test_check_pldm_version_mismatch(mock_analyzer): ), ] - analyzer.check_pldm_version(firmware_data, "1.2.4") + analyzer.check_firmware_versions(firmware_data, {"PLDM_BUNDLE": "1.2.4"}) assert len(analyzer.result.events) == 1 assert analyzer.result.events[0].priority == EventPriority.ERROR -def test_check_pldm_version_missing(mock_analyzer): - """Test check_pldm_version handles missing PLDM firmware.""" +def test_check_firmware_versions_pldm_missing(mock_analyzer): + """Test check_firmware_versions handles missing PLDM firmware.""" analyzer = mock_analyzer firmware_data = [ @@ -511,12 +511,51 @@ def test_check_pldm_version_missing(mock_analyzer): ), ] - analyzer.check_pldm_version(firmware_data, "1.2.3") + analyzer.check_firmware_versions(firmware_data, {"PLDM_BUNDLE": "1.2.3"}) assert len(analyzer.result.events) == 1 assert analyzer.result.events[0].priority == EventPriority.ERROR +def test_check_firmware_versions_multiple_fw_ids_success(mock_analyzer): + """Test check_firmware_versions passes when all fw_ids match on each GPU.""" + analyzer = mock_analyzer + firmware_data = [ + Fw( + gpu=0, + fw_list=[ + FwListItem(fw_id="PLDM_BUNDLE", fw_version="1.2.3"), + FwListItem(fw_id="OTHER_FW", fw_version="9.0"), + ], + ), + ] + analyzer.check_firmware_versions( + firmware_data, + {"PLDM_BUNDLE": "1.2.3", "OTHER_FW": "9.0"}, + ) + assert len(analyzer.result.events) == 0 + + +def test_check_firmware_versions_one_id_mismatch(mock_analyzer): + """Test check_firmware_versions errors when any fw_id version differs.""" + analyzer = mock_analyzer + firmware_data = [ + Fw( + gpu=0, + fw_list=[ + FwListItem(fw_id="PLDM_BUNDLE", fw_version="1.2.3"), + FwListItem(fw_id="OTHER_FW", fw_version="8.0"), + ], + ), + ] + analyzer.check_firmware_versions( + firmware_data, + {"PLDM_BUNDLE": "1.2.3", "OTHER_FW": "9.0"}, + ) + assert len(analyzer.result.events) == 1 + assert analyzer.result.events[0].priority == EventPriority.ERROR + + def test_check_expected_memory_partition_mode_success(mock_analyzer): """Test check_expected_memory_partition_mode passes when partition modes match.""" analyzer = mock_analyzer