perf: performance foundations for parallel deployment execution#7600
perf: performance foundations for parallel deployment execution#7600
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a set of performance-focused improvements to azd aimed at reducing end-to-end latency during parallel azd up runs (4–8 services), primarily by improving HTTP connection reuse, reducing per-call ARM client construction overhead, tuning ARM poll intervals, and adding targeted retries/telemetry for common transient deployment failures.
Changes:
- Add a tuned HTTP transport and wire it into DI to improve connection pooling during parallel ARM calls.
- Cache ARM SDK clients per subscription, tune ARM poll frequencies, and add OTel spans across deployment-critical paths for profiling.
- Improve resiliency/latency in parallel deployments (SCM readiness probe + zip deploy retry, ACR credential exponential backoff) and fix Docker path resolution semantics.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/specs/perf-foundations/spec.md | Adds the detailed performance spec/rationale and test mapping for the changes. |
| cli/azd/resources/alpha_features.yaml | Registers the update alpha feature flag. |
| cli/azd/pkg/project/framework_service_docker.go | Adjusts .NET Dockerfile path resolution logic for dotnet publish vs Dockerfile builds. |
| cli/azd/pkg/project/framework_service_docker_test.go | Updates Docker build argument expectations to match new absolute path resolution behavior. |
| cli/azd/pkg/project/container_helper.go | Adds container OTel spans, switches ACR credential retries to exponential backoff, and introduces resolveDockerPaths. |
| cli/azd/pkg/project/container_helper_test.go | Updates mocks to tolerate tracing-wrapped contexts. |
| cli/azd/pkg/project/container_helper_coverage3_test.go | Adds unit tests for resolveDockerPaths. |
| cli/azd/pkg/httputil/util.go | Ensures raw responses close bodies and introduces TunedTransport(). |
| cli/azd/pkg/httputil/util_test.go | Adds tests validating transport tuning and default transport immutability. |
| cli/azd/pkg/azsdk/zip_deploy_client.go | Adds SCM readiness probe helper (IsScmReady). |
| cli/azd/pkg/azapi/webapp.go | Adds zip deploy retry logic with SCM readiness waiting, plus an App Service app setting helper and tracing. |
| cli/azd/pkg/azapi/webapp_test.go | Adds tests for transient build-failure classification. |
| cli/azd/pkg/azapi/standard_deployments.go | Adds ARM client caching, tuned poll intervals, and OTel spans for standard deployments. |
| cli/azd/pkg/azapi/stack_deployments.go | Adds deployment stacks client caching, tuned poll intervals, and OTel spans. |
| cli/azd/pkg/azapi/resource_service.go | Caches ARM resources/resource groups clients per subscription. |
| cli/azd/cmd/deps.go | Wires the tuned transport into the HTTP client used by DI. |
| cli/azd/.vscode/cspell-azd-dictionary.txt | Adds new words used by the change set to cspell dictionary. |
| .gitignore | Ignores new session/coverage artifacts (cover-* etc.) and Playwright MCP directory. |
a14130c to
186d31a
Compare
📋 Prioritization NoteThanks for the contribution! The linked issue isn't in the current milestone yet. |
c09faed to
5d98903
Compare
wbreza
left a comment
There was a problem hiding this comment.
Performance Foundations Review
Solid performance engineering with thorough documentation. The 11 implemented changes are well-reasoned and correctly handle concurrency, resource lifecycle, and error propagation. Each optimization targets a specific measured bottleneck. The spec document is exemplary.
Summary
| Priority | Count |
|---|---|
| Medium | 4 |
| Low | 4 |
| Total | 8 |
🟡 Medium (4 findings)
- Missing test coverage for new critical logic —
waitForScmReady,IsScmReady, the zip deploy retry loop, andUpdateAppServiceAppSettinglack unit tests despite being concurrency-sensitive code. - Context cancellation suppressed in
waitForScmReady—IsScmReadyerrors (includingcontext.Canceled) are logged and continued instead of propagated. Theselectat the top of the next iteration catchesctx.Done(), but the path is indirect. - Missing Go doc comments on new functions —
TunedTransport(),IsScmReady(),UpdateAppServiceAppSetting(),isBuildFailure(),waitForScmReady(),resolveDockerPaths(), andresolveServiceDir()need comprehensive documentation per repo standards. - Docker path resolution logic still partially duplicated —
useDotNetPublishForDockerBuild()reimplements the same resolution logic instead of callingresolveDockerPaths().
🟢 Low (4 findings)
- PR description out of sync — Description lists 13 changes (21 files, +991/-79) but diff contains 11 changes (19 files, +918/-76). Changes §12 (Container App Adaptive Poll) and §13 (ACR Login Caching) are described but not present.
waitForScmReadyusestime.Afterin loop — Creates a new timer each iteration (well-known Go anti-pattern). Usetime.NewTickerwithdefer ticker.Stop().isBuildFailureuses fragile string matching — Detection depends on exact Azure/Kudu error message wording; if messages change, retries stop silently.- Missing path validation in
resolveDockerPaths— User-supplieddocker.path/docker.contextfromazure.yamlare joined without..traversal validation per repo path safety standards.
✅ What Looks Good
- Thorough spec document — Every change has written rationale, evidence, and test mapping. Exemplary engineering documentation.
- Consistent caching pattern —
sync.MapLoad/LoadOrStore applied uniformly across all ARM client caches with clear “benign race” comments. - Clean resource leak fix —
defer response.Body.Close()inReadRawResponseis correct and verified safe against all callers. - Well-designed retry semantics — Zip deploy retry with SCM readiness probe is a targeted fix with
io.ReadSeekerrewind and OTel attempt tracking. - Smart poll frequency separation — 2s for deploys, 5s for validate/WhatIf respects ARM rate limits while reducing tail latency.
- Prior review items resolved — All 3 copilot-bot findings properly addressed.
Overall Assessment: Approve with suggestions — no functional bugs found. Findings are about completeness (tests, docs, description sync).
All 7 functions already have doc comments:
Both are present in the current diff:
|
6db2a9a to
d210dcc
Compare
wbreza
left a comment
There was a problem hiding this comment.
Code Review — Performance Foundations
Reviewed across 3 models (Claude Opus 4.6, GPT-5.3-Codex, GPT-5.4) covering 7 focus areas: Security, Error Handling, Performance, Code Quality, Testing, API Design, and Concurrency.
Solid performance engineering with thorough documentation. The spec document is exemplary and each optimization targets a real bottleneck. However, several findings need attention before merge — particularly around the ReadRawResponse body close interaction, ARM polling rate limits, and test coverage gaps.
Summary
| Priority | Count |
|---|---|
| High | 6 |
| Medium | 8 |
| Total | 14 |
🔴 High Findings
- SCM readiness failures swallowed —
waitForScmReadyerrors masked as non-fatal. Auth/TLS failures become invisible retry noise. - ReadRawResponse body close breaks poll handler —
deployPollingHandler.Poll()returns a response with closed body to azcore poller. - 2s ARM polling risks quota saturation — 8 parallel deploys at 2s = ~4 reads/sec = full ARM budget. Additional reads trigger 429s.
- 30s IdleConnTimeout regresses unchanged callers — call sites still using SDK default 30s polling lose idle connections between polls.
- Docker test assertions are self-referential — expected values computed by the same production helpers being tested.
- Context cancellation tests provide false coverage —
IsScmReadynever invoked (0 calls verified); error handling path untested.
🟡 Medium Findings
- ReadRawResponse changes exported body ownership contract
- DeployAppServiceZip at ~90 lines / 4 nesting levels
- Client-caching boilerplate duplicated 5×
- isBuildFailure uses brittle string matching
- slowPollFrequency=5s contradicts its own comment
- SCM tests add 20s wall time via real ticker
- Transport mutation guard covers 1/5 fields
- resolveDockerPaths tests miss
..traversal cases
✅ What Looks Good
- Thorough spec document with evidence and test mapping
- sync.Map caching pattern — consistent, documented, verified safe
- SCM retry semantics — io.ReadSeeker rewind + OTel tracking
- No security regressions — verified by all 3 models
- Concurrency patterns all correct — named returns, singleflight, sync.Map
Overall: Request Changes — the 6 High findings (especially #2 and #3) require resolution before merge.
d7ac3ce to
53fb44e
Compare
204dc29 to
dd71f9a
Compare
629609d to
82a4d35
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d/fig-completion) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ompletion parsing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore unconditional defer res.Body.Close() in federated_token_client.go - Fix spec.md IdleConnTimeout description to match actual 90s value - Use errors.AsType[net.Error] instead of errors.As per AGENTS.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
82a4d35 to
5770b6b
Compare
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
Related Issues
Fixes #7601
Related to #7526, #372, #6291
Overview
Targeted performance optimizations to reduce deployment latency in
azd, particularly when multiple services are provisioned and deployed concurrently duringazd up. Each change addresses a specific bottleneck observed during parallel runs with 4-8 services.Changes
1. HTTP Connection Pooling (spec §1)
Go's default transport limits to 2 idle connections per host. With 8 parallel deploys hitting
management.azure.com, connections are constantly torn down and re-established (50-150ms per TLS handshake).TunedTransport()raises per-host limits to 50 and wires it into the DI container.Files:
pkg/httputil/util.go,cmd/deps.go2. ARM Client Caching (spec §2)
ARM SDK clients were re-created on every API call, rebuilding the HTTP pipeline each time. Now cached in
sync.Mapfields keyed by subscription ID. Azure SDK ARM clients are stateless and goroutine-safe, so caching is safe.Files:
pkg/azapi/resource_service.go,pkg/azapi/standard_deployments.go,pkg/azapi/stack_deployments.go3. Adaptive Poll Frequency (spec §3)
PollUntilDone(ctx, nil)defaults to 30s intervals. Deploy operations now poll every 2s (reducing tail latency by up to 28s), while WhatIf/Validate use 5s to conserve ARM read quota.Files:
pkg/azapi/standard_deployments.go,pkg/azapi/stack_deployments.go4. Zip Deploy Retry with SCM Readiness Probe (spec §4)
During concurrent
azd up, ARM applying site config can restart the SCM (Kudu) container, causing transient build failures. Now detects these viaisBuildFailure(), waits for SCM readiness via/api/deploymentsprobe (90s timeout), and retries up to 2 additional times with zip rewind.Files:
pkg/azapi/webapp.go,pkg/azsdk/zip_deploy_client.go5. ACR Credential Exponential Backoff (spec §5)
Changed from constant 20s retry delay to exponential backoff starting at 2s (2→4→8→16→32s). Most transient 404s resolve on first retry; worst case is ~62s vs old 60s.
File:
pkg/project/container_helper.go6. Docker Path Resolution Fix (spec §6)
User-specified
docker.path/docker.contextresolve relative to project root; defaults resolve relative to service directory. ExtractedresolveDockerPaths()to apply the correct base path consistently acrossBuild(),runRemoteBuild(), andpackBuild().Files:
pkg/project/container_helper.go,pkg/project/framework_service_docker.go7. ReadRawResponse Body Close Fix (spec §7)
Added missing
defer response.Body.Close()— unclosed body prevents TCP connections from returning to the pool.File:
pkg/httputil/util.go8. UpdateAppServiceAppSetting Helper (spec §8)
New helper to read-modify-write a single app setting without replacing the entire set. Handles nil-properties edge case.
File:
pkg/azapi/webapp.go9. Alpha Feature Registration (spec §9)
Registered
updatealpha feature to gate theazd updatecommand behind the feature flag system.File:
resources/alpha_features.yaml10. OTel Tracing Spans (spec §10)
Added 11 OTel tracing spans across all deployment-critical paths for end-to-end profiling: 6 for ARM standard deployments, 2 for deployment stacks, 3 for container operations. Enables trace-viewer analysis of where wall-clock time is spent during parallel
azd up.Files:
pkg/azapi/standard_deployments.go,pkg/azapi/stack_deployments.go,pkg/project/container_helper.go11. Supporting Changes (spec §11)
.gitignoreentries for coverage artifacts; cspell dictionary additions for new terminology.12. Container App Adaptive Poll Frequency (spec §12)
Container App revision updates (
BeginCreateOrUpdate,BeginUpdate) were polling at the SDK default of 30s. For templates with multiple Container Apps (e.g., 5-service shop template), each service wastes up to 28s in tail latency waiting for the next poll after the operation completes server-side. AddedcontainerAppPollFrequency = 2sand applied to all 3PollUntilDonecall sites:DeployYaml,updateContainerApp, andUpdateContainerAppJob. Estimated savings: ~140s for 5-service Container App templates.File:
pkg/containerapps/container_app.go13. ACR Login Caching per Registry (spec §13)
When deploying multiple services to the same ACR,
containerRegistryService.Login()was performing a full credential exchange (AAD→ACR token via/oauth2/exchange+docker login) for each service — redundant work when the registry is the same. Added async.Maplogin cache keyed bysubscriptionId:loginServerto skip already-authenticated registries. Estimated savings: ~12s for 5-service templates sharing one ACR.File:
pkg/azapi/container_registry.goTest Coverage
pkg/httputil/util_test.gopkg/azapi/webapp_test.gopkg/project/framework_service_docker_test.gopkg/project/container_helper_coverage3_test.gopkg/project/container_helper_test.goInfrastructure changes (client caching, poll frequency, Container App poll frequency, ACR login caching, SCM retry, OTel spans) are validated through integration and playback tests.
Stats
21 files changed, +991, -79