perf: split Databricks E2E into 3 parallel CPU partitions, tune timeouts#2527
perf: split Databricks E2E into 3 parallel CPU partitions, tune timeouts#2527BrendanWalsh wants to merge 1 commit intomasterfrom
Conversation
|
Hey @BrendanWalsh 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR optimizes SynapseML’s Databricks E2E CI by splitting the long-running CPU notebook suite into multiple parallel partitions and by threading/tuning notebook timeouts to improve reliability and reduce wall-clock duration in the Azure DevOps pipeline.
Changes:
- Split CPU Databricks notebook E2E tests into three ScalaTest suites and three ADO matrix entries to run in parallel.
- Add timeout parameter plumbing to Databricks notebook submission/execution and adjust default timeouts/retry windows.
- Update GPU E2E suite to use an explicit (longer) timeout.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pipeline.yaml |
Replaces the single databricks-cpu matrix entry with three parallel CPU partitions. |
core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksUtilities.scala |
Adds CPU partitioning helper, threads timeout parameters through notebook runs, and tunes retry/timeout defaults. |
core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksGPUTests.scala |
Passes explicit GPU timeout through the helper (and changes concurrency/cluster sizing). |
core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksCPUTests.scala |
Splits the CPU E2E suite into three separate test classes, each running a partition. |
Comments suppressed due to low confidence (2)
core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksUtilities.scala:461
databricksTestHelpernow acceptstimeoutMs, but the suite-leveltestTimeoutInSecondsis set independently (toTimeoutInMillis / 1000). This makes it easy for callers to pass a largertimeoutMswithout also overridingtestTimeoutInSeconds, causing ScalaTest to abort tests early. Consider assertingtimeoutMs/1000 <= testTimeoutInSeconds(or documenting/enforcing that subclasses must override the suite timeout when they overridetimeoutMs).
// E2E notebooks run on remote Databricks clusters and take longer than unit tests.
// Override the 10-min TestBase timeout to match the per-notebook timeout.
override val testTimeoutInSeconds: Int = TimeoutInMillis / 1000
import DatabricksUtilities._
def databricksTestHelper(clusterId: String,
libraries: String,
notebooks: Seq[File],
maxConcurrency: Int,
retries: List[Int] = List(1000 * 15),
timeoutMs: Int = TimeoutInMillis): Unit = {
core/src/test/scala/com/microsoft/azure/synapse/ml/nbtest/DatabricksUtilities.scala:461
databricksTestHelpercreates a newExecutors.newFixedThreadPool(maxConcurrency)per suite but never shuts it down. With the CPU suite now split into 3 classes (plus GPU), this can leave multiple thread pools running and potentially prevent the test JVM from exiting cleanly. Consider shutting down the executor in afinallyblock (and optionallyawaitTermination) once allAwait.resultcalls complete.
def databricksTestHelper(clusterId: String,
libraries: String,
notebooks: Seq[File],
maxConcurrency: Int,
retries: List[Int] = List(1000 * 15),
timeoutMs: Int = TimeoutInMillis): Unit = {
|
|
||
| class DatabricksCPUTests extends DatabricksTestHelper { | ||
| // Split CPU E2E tests into 3 partitions that run as separate ADO matrix entries. | ||
| // Each partition creates its own cluster and runs ~15 notebooks in parallel. |
There was a problem hiding this comment.
The header comment says each partition "runs ~15 notebooks in parallel", but databricksTestHelper(..., maxConcurrency = NumWorkers) limits parallelism to NumWorkers (currently 5). If the intent is "~15 notebooks per partition", consider rewording the comment to avoid implying 15-way concurrency.
| // Each partition creates its own cluster and runs ~15 notebooks in parallel. | |
| // Each partition creates its own cluster and contains ~15 notebooks, running up to the configured worker concurrency. |
| val clusterId: String = createClusterInPool(GPUClusterName, AdbGpuRuntime, 3, GpuPoolId) | ||
|
|
||
| databricksTestHelper(clusterId, GPULibraries, GPUNotebooks, 3, List(), gpuTimeoutMs) |
There was a problem hiding this comment.
This change increases GPU cluster size (numWorkers 2→3) and notebook concurrency (1→3). The PR description only mentions timeout threading/tuning, so this is an unmentioned behavioral/cost change that could affect reliability if GPU notebooks contend for resources. Consider either updating the PR description to call this out (with rationale) or keeping GPU concurrency/workers unchanged while only threading the timeout parameter.
| private val SortedCPUNotebooks = CPUNotebooks.sortBy(_.getName) | ||
| val NumCPUPartitions = 3 | ||
| def cpuNotebookPartition(partIndex: Int): Seq[File] = { | ||
| SortedCPUNotebooks.zipWithIndex.filter(_._2 % NumCPUPartitions == partIndex).map(_._1) | ||
| } |
There was a problem hiding this comment.
cpuNotebookPartition is intended to be deterministic, but sorting by _.getName can be unstable if there are notebooks with the same filename in different directories, and it also relies on the traversal order from recursiveListFiles. Consider sorting by a unique, stable key (e.g., relative path from DocsDir or canonical/absolute path) to keep partitions consistent across machines/runs.
This issue also appears in the following locations of the same file:
- line 449
- line 455
e9f9deb to
76a9b2f
Compare
Split the monolithic DatabricksCPUTests (~43 min) into 3 partitions (DatabricksCPUTests1/2/3) that run as separate ADO matrix entries. Each partition creates its own Databricks cluster and runs ~15 notebooks, with all 3 starting simultaneously on different ADO agents. Changes: - DatabricksCPUTests.scala: Split into 3 test classes using partition indices - DatabricksUtilities.scala: Add cpuNotebookPartition() with deterministic absolute-path-based sorting, add timeoutSeconds parameter threading - DatabricksGPUTests.scala: Increase GPU workers 2->3 and concurrency 1->3 (GPU notebooks are independent; parallel execution reduces E2E wall clock) Add 30-minute GPU-specific timeout override - pipeline.yaml: Replace single databricks-cpu with cpu-1/cpu-2/cpu-3 matrix
76a9b2f to
bbbbf7a
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2527 +/- ##
=======================================
Coverage 84.61% 84.61%
=======================================
Files 335 335
Lines 17708 17708
Branches 1612 1612
=======================================
Hits 14984 14984
Misses 2724 2724 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Splits the single
databricks-cpuE2E job (~43 minutes) into 3 parallel partitions (~15-20 minutes each), increases GPU parallelism, and tunes Databricks job timeouts for reliability.Changes
DatabricksCPUTestsintoDatabricksCPUTests1,DatabricksCPUTests2,DatabricksCPUTests3using partition indicescpuNotebookPartition(numPartitions, partition)method with deterministic absolute-path-based sortingtimeoutSecondsparameter todatabricksSubmitRunandsubmitRunOnClusterdatabricks-cpumatrix entry withdatabricks-cpu-1,databricks-cpu-2,databricks-cpu-3Testing
Split from #2506 for independent review.