perf: split Databricks E2E into 3 parallel CPU partitions, tune timeouts by BrendanWalsh · Pull Request #2527 · microsoft/SynapseML

BrendanWalsh · 2026-03-27T05:27:45Z

Summary

Splits the single databricks-cpu E2E job (~43 minutes) into 3 parallel partitions (~15-20 minutes each), increases GPU parallelism, and tunes Databricks job timeouts for reliability.

Changes

DatabricksCPUTests.scala: Split DatabricksCPUTests into DatabricksCPUTests1, DatabricksCPUTests2, DatabricksCPUTests3 using partition indices
DatabricksGPUTests.scala: Increase GPU cluster workers from 2→3 and concurrency from 1→3 (GPU notebooks are independent and can run in parallel), add 30-minute GPU-specific timeout override
DatabricksUtilities.scala:
- Add cpuNotebookPartition(numPartitions, partition) method with deterministic absolute-path-based sorting
- Add timeoutSeconds parameter to databricksSubmitRun and submitRunOnCluster
- Tune default timeouts (CPU: 20min, GPU: 35min)
pipeline.yaml: Replace single databricks-cpu matrix entry with databricks-cpu-1, databricks-cpu-2, databricks-cpu-3

Testing

Partition function uses deterministic index-based splitting (absolute path sort)
All 3 CPU partitions validated in container CI builds (all pass)
GPU timeout parameter threading is backward-compatible (default values preserved)

Split from #2506 for independent review.

github-actions · 2026-03-27T05:28:04Z

Hey @BrendanWalsh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

github-actions · 2026-03-27T05:28:07Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA bbbbf7a.

Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

Copilot

Pull request overview

This PR optimizes SynapseML’s Databricks E2E CI by splitting the long-running CPU notebook suite into multiple parallel partitions and by threading/tuning notebook timeouts to improve reliability and reduce wall-clock duration in the Azure DevOps pipeline.

Changes:

Split CPU Databricks notebook E2E tests into three ScalaTest suites and three ADO matrix entries to run in parallel.
Add timeout parameter plumbing to Databricks notebook submission/execution and adjust default timeouts/retry windows.
Update GPU E2E suite to use an explicit (longer) timeout.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`pipeline.yaml`	Replaces the single `databricks-cpu` matrix entry with three parallel CPU partitions.
`core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksUtilities.scala`	Adds CPU partitioning helper, threads timeout parameters through notebook runs, and tunes retry/timeout defaults.
`core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksGPUTests.scala`	Passes explicit GPU timeout through the helper (and changes concurrency/cluster sizing).
`core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksCPUTests.scala`	Splits the CPU E2E suite into three separate test classes, each running a partition.

Comments suppressed due to low confidence (2)

core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksUtilities.scala:461

databricksTestHelper now accepts timeoutMs, but the suite-level testTimeoutInSeconds is set independently (to TimeoutInMillis / 1000). This makes it easy for callers to pass a larger timeoutMs without also overriding testTimeoutInSeconds, causing ScalaTest to abort tests early. Consider asserting timeoutMs/1000 <= testTimeoutInSeconds (or documenting/enforcing that subclasses must override the suite timeout when they override timeoutMs).

  // E2E notebooks run on remote Databricks clusters and take longer than unit tests.
  // Override the 10-min TestBase timeout to match the per-notebook timeout.
  override val testTimeoutInSeconds: Int = TimeoutInMillis / 1000

  import DatabricksUtilities._

  def databricksTestHelper(clusterId: String,
                           libraries: String,
                           notebooks: Seq[File],
                           maxConcurrency: Int,
                           retries: List[Int] = List(1000 * 15),
                           timeoutMs: Int = TimeoutInMillis): Unit = {

core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksUtilities.scala:461

databricksTestHelper creates a new Executors.newFixedThreadPool(maxConcurrency) per suite but never shuts it down. With the CPU suite now split into 3 classes (plus GPU), this can leave multiple thread pools running and potentially prevent the test JVM from exiting cleanly. Consider shutting down the executor in a finally block (and optionally awaitTermination) once all Await.result calls complete.

  def databricksTestHelper(clusterId: String,
                           libraries: String,
                           notebooks: Seq[File],
                           maxConcurrency: Int,
                           retries: List[Int] = List(1000 * 15),
                           timeoutMs: Int = TimeoutInMillis): Unit = {

Copilot · 2026-03-27T05:31:20Z

core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksCPUTests.scala


-class DatabricksCPUTests extends DatabricksTestHelper {
+// Split CPU E2E tests into 3 partitions that run as separate ADO matrix entries.
+// Each partition creates its own cluster and runs ~15 notebooks in parallel.


The header comment says each partition "runs ~15 notebooks in parallel", but databricksTestHelper(..., maxConcurrency = NumWorkers) limits parallelism to NumWorkers (currently 5). If the intent is "~15 notebooks per partition", consider rewording the comment to avoid implying 15-way concurrency.

Suggested change

// Each partition creates its own cluster and runs ~15 notebooks in parallel.

// Each partition creates its own cluster and contains ~15 notebooks, running up to the configured worker concurrency.

Copilot · 2026-03-27T05:31:20Z

core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksGPUTests.scala

+  val clusterId: String = createClusterInPool(GPUClusterName, AdbGpuRuntime, 3, GpuPoolId)
+
+  databricksTestHelper(clusterId, GPULibraries, GPUNotebooks, 3, List(), gpuTimeoutMs)


This change increases GPU cluster size (numWorkers 2→3) and notebook concurrency (1→3). The PR description only mentions timeout threading/tuning, so this is an unmentioned behavioral/cost change that could affect reliability if GPU notebooks contend for resources. Consider either updating the PR description to call this out (with rationale) or keeping GPU concurrency/workers unchanged while only threading the timeout parameter.

Copilot · 2026-03-27T05:31:20Z

core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksUtilities.scala

+  private val SortedCPUNotebooks = CPUNotebooks.sortBy(_.getName)
+  val NumCPUPartitions = 3
+  def cpuNotebookPartition(partIndex: Int): Seq[File] = {
+    SortedCPUNotebooks.zipWithIndex.filter(_._2 % NumCPUPartitions == partIndex).map(_._1)
+  }


cpuNotebookPartition is intended to be deterministic, but sorting by _.getName can be unstable if there are notebooks with the same filename in different directories, and it also relies on the traversal order from recursiveListFiles. Consider sorting by a unique, stable key (e.g., relative path from DocsDir or canonical/absolute path) to keep partitions consistent across machines/runs.

This issue also appears in the following locations of the same file:

line 449

line 455

Split the monolithic DatabricksCPUTests (~43 min) into 3 partitions (DatabricksCPUTests1/2/3) that run as separate ADO matrix entries. Each partition creates its own Databricks cluster and runs ~15 notebooks, with all 3 starting simultaneously on different ADO agents. Changes: - DatabricksCPUTests.scala: Split into 3 test classes using partition indices - DatabricksUtilities.scala: Add cpuNotebookPartition() with deterministic absolute-path-based sorting, add timeoutSeconds parameter threading - DatabricksGPUTests.scala: Increase GPU workers 2->3 and concurrency 1->3 (GPU notebooks are independent; parallel execution reduces E2E wall clock) Add 30-minute GPU-specific timeout override - pipeline.yaml: Replace single databricks-cpu with cpu-1/cpu-2/cpu-3 matrix

BrendanWalsh · 2026-03-27T19:25:50Z

/azp run

azure-pipelines · 2026-03-27T19:26:01Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2026-03-27T19:41:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.61%. Comparing base (895752c) to head (bbbbf7a).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2527   +/-   ##
=======================================
  Coverage   84.61%   84.61%           
=======================================
  Files         335      335           
  Lines       17708    17708           
  Branches     1612     1612           
=======================================
  Hits        14984    14984           
  Misses       2724     2724

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings March 27, 2026 05:27

BrendanWalsh mentioned this pull request Mar 27, 2026

ci: containerize CI pipeline with pre-built Docker image #2529

Open

Copilot started reviewing on behalf of BrendanWalsh March 27, 2026 05:28 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

BrendanWalsh force-pushed the brwals/split-databricks-e2e branch from e9f9deb to 76a9b2f Compare March 27, 2026 19:09

BrendanWalsh force-pushed the brwals/split-databricks-e2e branch from 76a9b2f to bbbbf7a Compare March 27, 2026 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: split Databricks E2E into 3 parallel CPU partitions, tune timeouts#2527

perf: split Databricks E2E into 3 parallel CPU partitions, tune timeouts#2527
BrendanWalsh wants to merge 1 commit intomasterfrom
brwals/split-databricks-e2e

BrendanWalsh commented Mar 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

BrendanWalsh commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

codecov-commenter commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// Each partition creates its own cluster and runs ~15 notebooks in parallel.
	// Each partition creates its own cluster and contains ~15 notebooks, running up to the configured worker concurrency.

		val clusterId: String = createClusterInPool(GPUClusterName, AdbGpuRuntime, 3, GpuPoolId)

		databricksTestHelper(clusterId, GPULibraries, GPUNotebooks, 3, List(), gpuTimeoutMs)

Conversation

BrendanWalsh commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Snapshot Warnings

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

BrendanWalsh commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

codecov-commenter commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BrendanWalsh commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

codecov-commenter commented Mar 27, 2026 •

edited

Loading