ci: containerize CI pipeline with pre-built Docker image#2529
ci: containerize CI pipeline with pre-built Docker image#2529BrendanWalsh wants to merge 1 commit intomasterfrom
Conversation
|
Hey @BrendanWalsh 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
Containerizes most Azure DevOps CI jobs to run inside a pre-built Docker image, aiming to reduce end-to-end CI time by pre-baking toolchains, caches, and datasets.
Changes:
- Add a new CI Docker image definition and a pipeline job to build/retag it using a content-hash tag.
- Move major CI jobs (style/tests/publish) to run with
container: ci, add disk cleanup, and scope compilation for unit tests. - Add dataset cache support in
build.sbtand introduce a global ScalaTest per-test timeout inTestBase.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/docker/ci/Dockerfile | Defines the CI container image (JDK8, conda env, Spark, SBT warmup, datasets). |
| templates/free_disk.yml | Adds a reusable host disk cleanup step for container jobs. |
| pipeline.yaml | Introduces ci container resource, BuildCIImage job, and migrates many jobs to containers/scoped compilation. |
| build.sbt | Uses DATASET_CACHE to avoid re-downloading test datasets in CI. |
| core/src/test/.../TestBase.scala | Wraps all tests with a global failAfter timeout. |
| project/CodegenPlugin.scala | Removes redundant LocalRootProject publishLocal from installPipPackage. |
| .gitignore | Ignores pipeline.yaml.bak. |
pipeline.yaml
Outdated
| (timeout 5m sbt setup) || (echo "retrying" && timeout 5m sbt setup) || (echo "retrying" && timeout 5m sbt setup) | ||
| sbt codegen | ||
| sbt publishM2 | ||
| timeout 5m sbt setup codegen publishM2 |
There was a problem hiding this comment.
This step wraps setup, codegen, and publishM2 under a single timeout 5m. setup compiles all projects (including tests) and downloads datasets, so 5 minutes is likely too short and may introduce flaky timeouts; consider either increasing the timeout substantially or only applying the 5m timeout to sbt setup (leaving codegen/publishM2 unbounded or separately timed).
| timeout 5m sbt setup codegen publishM2 | |
| timeout 30m sbt setup | |
| sbt codegen publishM2 |
| cancelTimeoutInMinutes: 0 | ||
| pool: | ||
| vmImage: $(UBUNTU_VERSION) | ||
| container: ci | ||
| strategy: |
There was a problem hiding this comment.
DatabricksE2E’s job-level condition currently overrides the default succeeded() gating (it only checks the parameter), so this job will still run even if earlier jobs fail. If the intent is consistent with the other updated jobs, include succeeded() (and optionally variables.runTests) in the condition to avoid spending time/resources on E2E when the build is already failing.
| // Global per-test timeout (10 minutes). Override in subclass if needed. | ||
| val testTimeoutInSeconds: Int = 10 * 60 | ||
|
|
||
| override def test(testName: String, testTags: Tag*)(testFun: => Any)(implicit pos: Position): Unit = { | ||
| super.test(testName, testTags: _*) { | ||
| failAfter(Span(testTimeoutInSeconds, Seconds)) { | ||
| testFun | ||
| } | ||
| } |
There was a problem hiding this comment.
The PR description mentions a 30-minute hard timeout for test suites, but the new global per-test timeout here is set to 10 minutes. If 30 minutes is the intended limit, update testTimeoutInSeconds accordingly (or make it configurable via an env var/system property); otherwise the PR description should be updated to avoid confusion.
tools/docker/ci/Dockerfile
Outdated
| && wget -q http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2.24_amd64.deb -O /tmp/libssl1.1.deb \ | ||
| && dpkg -i /tmp/libssl1.1.deb \ | ||
| && rm /tmp/libssl1.1.deb |
There was a problem hiding this comment.
The image installs libssl1.1 by downloading a .deb over plain HTTP and installing it with dpkg without any signature/checksum verification. This is vulnerable to tampering and also bypasses apt’s normal integrity checks; please switch to HTTPS and verify the artifact (e.g., checksum/GPG), or use a trusted apt repository/package source for this dependency.
tools/docker/ci/Dockerfile
Outdated
| COPY build.sbt /tmp/sbt-warmup/build.sbt | ||
| COPY sonatype.sbt /tmp/sbt-warmup/sonatype.sbt | ||
| RUN cd /tmp/sbt-warmup \ | ||
| && sbt --batch -Dsbt.supershell=false "update; Test/update" 2>/dev/null || true \ |
There was a problem hiding this comment.
sbt ... 2>/dev/null || true suppresses all warmup errors and discards stderr, which can silently produce an image with an incomplete/empty dependency cache and make CI failures harder to diagnose. Consider keeping logs (at least on failure) and failing the image build if the warmup command errors unexpectedly.
| && sbt --batch -Dsbt.supershell=false "update; Test/update" 2>/dev/null || true \ | |
| && sbt --batch -Dsbt.supershell=false "update; Test/update" \ |
tools/docker/ci/Dockerfile
Outdated
| RUN cd /tmp/sbt-warmup \ | ||
| && sbt --batch -Dsbt.supershell=false "update; Test/update" 2>/dev/null || true \ | ||
| && rm -rf /tmp/sbt-warmup /tmp/.sbt \ | ||
| && chmod -R 777 $COURSIER_CACHE |
There was a problem hiding this comment.
chmod -R 777 $COURSIER_CACHE makes the dependency cache world-writable. Even in CI, it’s safer to scope write permissions to the agent user/group (e.g., chown to the container user/UID used by ADO) to reduce the chance of cache poisoning between steps/jobs.
| && chmod -R 777 $COURSIER_CACHE | |
| && chown -R 1001:1001 "$COURSIER_CACHE" \ | |
| && chmod -R 775 "$COURSIER_CACHE" |
| pool: | ||
| vmImage: $(UBUNTU_VERSION) | ||
| steps: | ||
| #- template: templates/ivy_cache.yml | ||
| - template: templates/update_cli.yml | ||
| - template: templates/conda.yml | ||
| - template: templates/kv.yml |
There was a problem hiding this comment.
FabricE2E is still running on the host VM without container: ci and still includes the old update_cli.yml/conda.yml setup. That conflicts with the PR goal/description of containerizing the entire CI pipeline and may leave this job as a remaining slow/non-deterministic outlier; either migrate this job to the ci container (and use free_disk.yml like the others) or update the PR description to explicitly exclude Fabric E2E from containerization.
tools/docker/ci/Dockerfile
Outdated
| RUN wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh \ | ||
| && bash /tmp/miniconda.sh -b -p /opt/conda \ | ||
| && rm /tmp/miniconda.sh \ | ||
| && /opt/conda/bin/conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main \ | ||
| && /opt/conda/bin/conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r |
There was a problem hiding this comment.
Using Miniconda3-latest makes the CI image build non-reproducible (the contents can change over time without any repo change), which can introduce hard-to-debug CI breakages. Consider pinning to a specific Miniconda installer version (or checksum-verifying the installer) so the image build is deterministic.
tools/docker/ci/Dockerfile
Outdated
| nvidia-cuda-runtime-cu12 nvidia-cudnn-cu12 nvidia-cufft-cu12 \ | ||
| nvidia-curand-cu12 nvidia-cusolver-cu12 nvidia-cusparse-cu12 \ | ||
| nvidia-nccl-cu12 nvidia-nvjitlink-cu12 nvidia-nvtx-cu12 2>/dev/null || true \ | ||
| && chmod -R 777 /opt/conda/envs |
There was a problem hiding this comment.
chmod -R 777 /opt/conda/envs makes the entire conda env tree world-writable inside the image. This broad permission can enable unintended modification of executables/libraries during CI runs; prefer a narrower permission model (e.g., chown to the agent UID/GID, or group-writable with a dedicated group) to reduce tampering risk.
| && chmod -R 777 /opt/conda/envs | |
| && chmod -R 755 /opt/conda/envs |
6f18643 to
bbb23c6
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2529 +/- ##
==========================================
+ Coverage 84.61% 84.86% +0.24%
==========================================
Files 335 335
Lines 17708 17708
Branches 1612 1612
==========================================
+ Hits 14984 15028 +44
+ Misses 2724 2680 -44 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
025d3e6 to
7c1b41d
Compare
Add a BuildCIImage job that produces a content-hash-cached Docker image (Ubuntu 22.04, JDK 8, conda, SBT, Spark, pre-cached datasets) published to mmlsparkmcr.azurecr.io/synapseml/ci via Workload Identity Federation. All test/build jobs now run inside this container, eliminating per-job conda setup, CLI updates, and dataset downloads. Key changes: - tools/docker/ci/Dockerfile: multi-stage image definition with pinned Miniconda, SHA256-verified libssl1.1, tightened file permissions (755) - pipeline.yaml: container resource, BuildCIImage job, scoped compilation via PROJECT variable, SBT_OPTS tuning, free_disk.yml template, succeeded() gating on DatabricksE2E condition - build.sbt: ForkJoinPool deadlock fix for parallel test execution - TestBase.scala: 10-minute per-test hard timeout for all test suites (overridable in subclasses, e.g. DatabricksTestHelper uses longer) - CodegenPlugin.scala: remove redundant root publishLocal in installPipPackage - templates/free_disk.yml: disk cleanup helper Note: FabricE2E intentionally excluded from containerization — it requires Fabric SDK / sempy packages not in the CI container image. Performance: ~59% wall-clock reduction (66m -> 27m), 33.5% total compute savings (1228m -> 817m agent-minutes) across 59 test jobs.
7c1b41d to
aec4efd
Compare
Summary
Containerizes the entire SynapseML CI pipeline using a pre-built Docker image, reducing wall-clock time from ~66 minutes to ~27 minutes (59% reduction) and total compute from ~1228 to ~817 agent-minutes (33.5% reduction).
Changes
New files:
mmlsparkmcr.azurecr.io/synapseml/ci. Uses HTTPS + SHA256 verification for libssl1.1, tightened permissions (755 instead of 777).Modified files:
BuildCIImagejob with content-hash-based caching (~30s on cache hit)container: cito all test/build/publish jobs (except FabricE2E — see note below)PROJECTvariable in UnitTests matrix (each job compiles only its module)-Xmx4G, ForkJoinPool thread config)succeeded()gating to DatabricksE2E conditionsbt setuptimeout from 5m to 30mFabricE2E exclusion: The FabricE2E job intentionally runs on the host VM (not containerized) because it requires the Fabric SDK and sempy packages that are not in the CI container image.
Performance (measured on ADO Pipeline 17563)
Dependencies
This PR is independent but works best with these companion PRs:
All were split from the original #2506 for independent review.
Testing
This is the core containerization PR, split from #2506.