Skip to content

[Test] Add automated root cause analysis for cluster creation failure.#7247

Merged
gmarciani merged 2 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3150/integ-diagnosis-0225-1
Mar 2, 2026
Merged

[Test] Add automated root cause analysis for cluster creation failure.#7247
gmarciani merged 2 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3150/integ-diagnosis-0225-1

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Feb 25, 2026

Description of changes

Add automated root cause analysis for cluster creation failure.
If cluster creation fails, the test now prints out the last 10 errors from the most critical logs and attempt to identify a root cause from them.
For the time being, the only root cause that we extract is the ICE.

Here is the sample diagnosis report printed by a test failure:

=== ROOT CAUSE ANALYSIS ===

--- SUMMARY ---
['InsufficientInstanceCapacity on p4d.24xlarge']

--- /var/log/chef-client.log ---
[2026-02-26T19:59:41+00:00] ERROR: shard_seed: Failed to get dmi property serial_number: is dmidecode installed?
... other 10 errors ...

--- /var/log/parallelcluster/clustermgtd ---
2026-02-26 20:25:14,163 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Failed RunInstances request (3fd67297-4cbc-4d20-94b3-370e060495a6): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d.
... other 10 errors ...

--- /var/log/cfn-init.log ---
DEBUG:root:CustomActionsConfig(stack_name='integ-tests-u8lflx6fp7yct6bp-integ-mg', cluster_name='integ-tests-u8lflx6fp7yct6bp-integ-mg', region_name='us-east-1', node_type='HeadNode', queue_name='', pool_name='', resource_name='', instance_id='i-0c94f0e998cdd9f63', instance_type='c5.18xlarge', ip_address='192.168.3.21', hostname='ip-192-168-3-21.ec2.internal', availability_zone='us-east-1a', scheduler='slurm', event_name='OnNodeStart', legacy_event=<LegacyEventName.ON_NODE_START: 'preinstall'>, can_execute=False, dry_run=False, script_sequence=[], script_sequences_per_event={<LegacyEventName.ON_NODE_START: 'preinstall'>: [], <LegacyEventName.ON_NODE_CONFIGURED: 'postinstall'>: [], 
... other 10 errors ...

--- /var/log/cloud-init-output.log ---
2026-02-26 19:59:05,767 - util.py[WARNING]: Running module selinux (<module 'cloudinit.config.cc_selinux' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_selinux.py'>) failed
... other 10 errors ...

--- /var/log/parallelcluster/slurmctld.log ---
No error found

Tests

Executed integ test test_efa known that it is going to face ICE. The test report shows the expected diagnosis (see above)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels Feb 25, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/integ-diagnosis-0225-1 branch 4 times, most recently from e92b237 to b42c566 Compare February 26, 2026 17:19
@gmarciani gmarciani marked this pull request as ready for review February 26, 2026 19:45
@gmarciani gmarciani requested review from a team as code owners February 26, 2026 19:45
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/integ-diagnosis-0225-1 branch 3 times, most recently from 7258930 to c91d7ff Compare February 26, 2026 21:46
If cluster creation fails, the test now prints out the last 10 errors from the most critical logs and attempt to identify a root cause from them.

For the time being, the only root cause that we extract is the ICE.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/integ-diagnosis-0225-1 branch from c91d7ff to 5a53b4e Compare February 26, 2026 21:47
@gmarciani gmarciani changed the title [DRAFT] Add diagnosis to integ test failures. [Test] Add automated root cause analysis for cluster creation failure. Feb 27, 2026
@gmarciani gmarciani merged commit 27cdde4 into aws:develop Mar 2, 2026
24 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3150/integ-diagnosis-0225-1 branch March 2, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs Test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants