Skip to content

[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246

Open
himani2411 wants to merge 17 commits intoaws:developfrom
himani2411:xuanqi--expedited-requeue-mode-integ
Open

[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246
himani2411 wants to merge 17 commits intoaws:developfrom
himani2411:xuanqi--expedited-requeue-mode-integ

Conversation

@himani2411
Copy link
Contributor

@himani2411 himani2411 commented Feb 24, 2026

Description of changes

  • cherry-pick https://github.com/aws/aws-parallelcluster/pull/7211/changes
    • Add E2E test for Slurm 25.11 expedited requeue (--requeue=expedite). The test simulates ICE on a compute node, submits a mix of expedited and normal jobs targeting that node, recovers from ICE, and verifies that the requeued expedited job runs first by comparing start time epochs from job output files.
    • Helper functions _submit_jobs_and_simulate_ice and _recover_from_ice_and_wait_for_jobs are extracted to reduce duplication in the ICE simulation cycle.
  • reverting the changes made for slurm bug in expedited-requeue mode
  • Add --exclusive flag to the job submitted so both the jobs are not assigned/started at the same time when we are no longer simulating ICE failure

Cookbook Changes -> aws/aws-parallelcluster-cookbook#3117

Tests

  • Integ test was succeesful

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@himani2411 himani2411 requested review from a team as code owners February 24, 2026 18:28
@himani2411 himani2411 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Feb 24, 2026
@himani2411 himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch 2 times, most recently from 4b1752a to b77ee9b Compare February 24, 2026 20:47
@himani2411 himani2411 changed the title Xuanqi expedited requeue mode integ [Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature Feb 24, 2026
@himani2411 himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch 9 times, most recently from 82495eb to daf7428 Compare March 2, 2026 17:49
hehe7318 and others added 14 commits March 2, 2026 18:56
Extend test_fast_capacity_failover to validate the new --requeue=expedite
option introduced in Slurm 25.11.2. This feature allows batch jobs to
automatically requeue on node failure with highest priority.
- Change job commands from simple 'sleep 30' to output hostname and
  timestamps, making it easier to verify job execution in output files
- Add --prefer option to job2 targeting the same compute resource as job1
- Increase job2 node request from 1 to 2 nodes to prevent it from
  immediately running on another CR before job1 requeues
…er and use recoverable ICE simulation

Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone
test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet).

Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json:
write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones
(t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited
requeue job starts before a normal job submitted earlier.
…ns 1st on the node we are targetting

* adding a wait fix of 5 secs
@himani2411 himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch from daf7428 to 04a558d Compare March 2, 2026 23:56
@himani2411 himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch from 04a558d to 0fdd987 Compare March 3, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants