Skip to content

re-implement compose logic#13641

Open
ndeloof wants to merge 3 commits intodocker:mainfrom
ndeloof:reconciliation
Open

re-implement compose logic#13641
ndeloof wants to merge 3 commits intodocker:mainfrom
ndeloof:reconciliation

Conversation

@ndeloof
Copy link
Contributor

@ndeloof ndeloof commented Mar 17, 2026

re-implement compose logic as a reconciliation between observed and desired state, producing a plan with pending operations.

  • build observed state
  • compute reconciliation plan by comparing observed vs desired state
  • execute plan

This allow to decorelate reconciliation (aka "convergence") logic from docker API, and write simpler and efficient tests to cover various scenarios

This PR was created with AI assistance

What I did

told Claude about the architecture I had in mind, and expectations regarding tests structure

@ndeloof ndeloof force-pushed the reconciliation branch 2 times, most recently from 6129bb0 to 53c93f1 Compare March 17, 2026 08:51
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 89.57107% with 124 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/compose/plan_executor.go 75.00% 74 Missing and 11 partials ⚠️
pkg/compose/reconcile.go 96.60% 13 Missing and 11 partials ⚠️
pkg/compose/create.go 82.19% 8 Missing and 5 partials ⚠️
pkg/compose/observed_state.go 96.82% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@ndeloof ndeloof marked this pull request as ready for review March 17, 2026 09:10
@ndeloof ndeloof requested a review from a team as a code owner March 17, 2026 09:10
@ndeloof ndeloof requested review from Copilot and glours March 17, 2026 09:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR re-implements the Docker Compose "up/create" convergence logic as a reconciliation between observed and desired state, producing an explicit plan with pending operations. The old convergence struct is removed in favor of a pure Reconcile function that computes a ReconciliationPlan, and a separate ExecutePlan method that executes it.

Changes:

  • Introduces ObservedState, Reconcile(), and ReconciliationPlan types to separate state inspection, diff computation, and execution
  • Replaces the convergence struct with executionState for resolving service references during plan execution
  • Adds comprehensive unit tests for the reconciliation logic, with cross-references to existing e2e tests

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/compose/reconcile.go New pure reconciliation logic computing operation plans from observed vs desired state
pkg/compose/plan_executor.go New DAG-based concurrent plan executor and display utilities
pkg/compose/observed_state.go New types and Docker API queries for capturing current project state
pkg/compose/create.go Rewired create() to use InspectState → Reconcile → ExecutePlan pipeline
pkg/compose/convergence.go Removed old convergence struct, mustRecreate, recreateContainer, etc.
pkg/compose/run.go Updated to use executionState instead of old convergence
pkg/compose/reconcile_test.go Comprehensive unit tests for reconciliation logic
pkg/compose/observed_state_test.go Tests for observed state types and plan utilities
pkg/e2e/*.go Added TODO comments cross-referencing new unit tests
pkg/compose/publish.go Minor fmt.Fprintf cleanup
pkg/compose/progress.go Fixed misleading comment
cmd/compose/compose.go Unrelated: use consts.ComposeProfiles for default profile env var
AI_POLICY.md New AI usage policy document

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- build observed state
- compute reconciliation plan by comparing observed vs desired state
- execute plan

this decorelates reconciliation (aka "convergence") logic from docker API,
and write simpler and efficient tests to cover various scenarios

Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
@ndeloof ndeloof force-pushed the reconciliation branch 3 times, most recently from 428b020 to f1743e1 Compare March 23, 2026 13:40
@ndeloof ndeloof requested a review from Copilot March 23, 2026 15:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Compose “up”/create behavior into a pure reconciliation phase (observed vs desired state) that produces an execution plan (DAG of operations), then executes that plan, with extensive new unit tests for reconcile decision logic.

Changes:

  • Add ObservedState + Reconcile(...) to compute a deterministic reconciliation plan (networks/volumes/containers/orphans/dependency edges).
  • Add a DAG-based ExecutePlan(...) executor to run operations concurrently while respecting dependencies.
  • Add large reconcile_test.go suite and annotate some e2e tests with TODOs pointing to the new unit coverage.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/compose/reconcile.go New reconciliation planner (ops, dependency edges, recreate/scale logic).
pkg/compose/plan_executor.go New plan executor + execution-state used for resolving service references during execution.
pkg/compose/observed_state.go New “observed state” snapshot builder from Docker engine (containers/networks/volumes/orphans).
pkg/compose/observed_state_test.go Unit tests for observed-state types and plan helpers (Roots, String, ContainerTouched).
pkg/compose/reconcile_test.go Large new unit test suite for reconcile scenarios (scale, recreate, networks/volumes/orphans, dependency edges).
pkg/compose/create.go Switch create flow to InspectState → Reconcile → ExecutePlan; add external network validation; emit events for untouched containers (when plan non-empty).
pkg/compose/convergence.go Remove old convergence/apply logic (scale/recreate/start logic) and related helpers.
pkg/compose/run.go Replace convergence-based service-reference resolution with executionState.
pkg/compose/publish.go Minor strings.Builder formatting change (fmt.Fprintf instead of WriteString(fmt.Sprintf(...))).
pkg/compose/progress.go Comment tweak.
cmd/compose/compose.go Default --profile values now come from env (COMPOSE_PROFILES) via compose-go consts.
pkg/e2e/volumes_test.go Add TODO linking e2e volume recreate coverage to new reconcile unit tests.
pkg/e2e/up_test.go Add TODO linking e2e scale/no-recreate coverage to new reconcile unit tests.
pkg/e2e/scale_test.go Add TODOs linking e2e scale behaviors to reconcile unit tests.
pkg/e2e/recreate_no_deps_test.go Add TODO linking force-recreate/no-deps coverage to reconcile unit tests.
pkg/e2e/orphans_test.go Add TODO linking orphan removal coverage to reconcile unit tests.
pkg/e2e/networks_test.go Add TODOs linking network recreate/change detection coverage to reconcile unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Precompute which containers are obsolete to avoid repeated hashing in the sort comparator.
obsolete := make(map[string]bool, len(containers))
for _, ctr := range containers {
recreate, _, _ := needsRecreate(service, ctr, networkIDs, volumeNames, policy)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The obsolete precomputation ignores the error from needsRecreate (recreate, _, _ := ...). If ServiceHash/hash checks fail, the sort will silently treat the container as non-obsolete and proceed, which can mask real errors and produce inconsistent plans. Suggest handling the error (e.g., return it from reconcileServiceContainers) instead of discarding it.

Suggested change
recreate, _, _ := needsRecreate(service, ctr, networkIDs, volumeNames, policy)
recreate, _, err := needsRecreate(service, ctr, networkIDs, volumeNames, policy)
if err != nil {
return err
}

Copilot uses AI. Check for mistakes.
Comment on lines +523 to +557
// topologicalSort returns operations in dependency order using Kahn's algorithm.
func topologicalSort(plan *ReconciliationPlan) []*Operation {
inDegree := make(map[string]int, len(plan.Operations))
for _, op := range plan.Operations {
inDegree[op.ID] = len(op.DependsOn)
}

// Start with nodes that have no dependencies
var queue []string
for _, op := range plan.Operations {
if inDegree[op.ID] == 0 {
queue = append(queue, op.ID)
}
}
sort.Strings(queue) // deterministic ordering

var sorted []*Operation
for len(queue) > 0 {
id := queue[0]
queue = queue[1:]
sorted = append(sorted, plan.Operations[id])

var next []string
for _, depID := range plan.Dependents[id] {
inDegree[depID]--
if inDegree[depID] == 0 {
next = append(next, depID)
}
}
sort.Strings(next)
queue = append(queue, next...)
}

return sorted
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topologicalSort (and ExecutePlan in general) doesn't detect cycles or missing dependencies. With a cycle, topologicalSort will return a partial list and ExecutePlan can deadlock waiting for ops whose dependencies can never be satisfied. Suggested fix: validate the plan before execution (e.g., ensure every DependsOn target exists and that a topo-sort returns len(sorted)==len(Operations)), and return an explicit error if a cycle is detected.

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +135
if plan.IsEmpty() {
return nil
}

s.emitUntouchedContainerEvents(project, observed, plan)

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create() returns early when plan.IsEmpty(), which means no progress events are emitted for already-running, up-to-date containers. Previously (via convergence) running containers would still emit runningEvent(...). If callers rely on those events (e.g., UX/progress output consistency), consider calling emitUntouchedContainerEvents even when the plan is empty, or intentionally document/adjust the behavior change.

Suggested change
if plan.IsEmpty() {
return nil
}
s.emitUntouchedContainerEvents(project, observed, plan)
s.emitUntouchedContainerEvents(project, observed, plan)
if plan.IsEmpty() {
return nil
}

Copilot uses AI. Check for mistakes.
Comment on lines 112 to 115
// Validate external networks exist before reconciling
if err := s.validateExternalNetworks(ctx, project, options.Services); err != nil {
return err
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create() now validates external networks before reconciling, but there’s no equivalent validation for external volumes. Previously, ensureProjectVolumes called ensureVolume(...) for all project volumes, which returned a clear error when an external volume was missing. With the new plan-based flow, external volumes are skipped by reconcileVolumes, so a missing external volume will only surface later during container create (and potentially with a less specific error). Consider adding a validateExternalVolumes step similar to validateExternalNetworks, limited to volumes referenced by the targeted services.

Copilot uses AI. Check for mistakes.
Comment on lines +437 to +450
// Stop the container (needed before disconnect)
stopID := fmt.Sprintf("stop-container:%s", ctrName)
if _, exists := plan.Operations[stopID]; !exists {
plan.Operations[stopID] = &Operation{
ID: stopID,
Type: OpStopContainer,
ServiceName: ctr.Labels[api.ServiceLabel],
Resource: ctrName,
ContainerOp: &ContainerOperation{
ContainerName: ctrName,
Existing: &ctr,
Timeout: opts.Timeout,
},
Reason: fmt.Sprintf("network %q is being recreated", n.Name),
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop ops created during network recreation don’t populate ContainerOp.Service (and therefore executePlanStopContainer passes service=nil to stopContainer). This skips service-level pre_stop hooks and other service-aware stop behavior, and also risks missing timeout propagation. Consider looking up the service config from project.Services[ctr.Labels[api.ServiceLabel]] and setting ContainerOp.Service (and Timeout) so the stop behaves consistently with other stop paths.

Copilot uses AI. Check for mistakes.
Comment on lines +607 to +637
// Stop the container
stopID := fmt.Sprintf("stop-container:%s", ctrName)
if _, exists := plan.Operations[stopID]; !exists {
plan.Operations[stopID] = &Operation{
ID: stopID,
Type: OpStopContainer,
ServiceName: ctr.Labels[api.ServiceLabel],
Resource: ctrName,
ContainerOp: &ContainerOperation{
ContainerName: ctrName,
Existing: &ctr,
},
Reason: fmt.Sprintf("volume %q is being recreated", v.Name),
}
}

// Remove the container (volumes require container removal, not just stop)
removeID := fmt.Sprintf("remove-container:%s", ctrName)
if _, exists := plan.Operations[removeID]; !exists {
plan.Operations[removeID] = &Operation{
ID: removeID,
Type: OpRemoveContainer,
ServiceName: ctr.Labels[api.ServiceLabel],
Resource: ctrName,
ContainerOp: &ContainerOperation{
ContainerName: ctrName,
Existing: &ctr,
},
DependsOn: []string{stopID},
Reason: fmt.Sprintf("volume %q is being recreated", v.Name),
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileVolumes generates stop/remove container ops for containers using a recreated volume, but it can’t propagate ReconcileOptions.Timeout or service-level stop behavior because it doesn’t take opts and it doesn’t set ContainerOp.Service. This changes semantics vs other stop/remove paths (no pre_stop hooks; daemon default timeout). Suggested fix: pass opts into reconcileVolumes and set ContainerOp.Service (lookup by service label) and Timeout on the generated container ops.

Copilot uses AI. Check for mistakes.
readyOpsByService := indexReadyOps(plan)

for _, op := range plan.Operations {
if (op.Type == OpCreateContainer || op.Type == OpStartContainer) && op.ContainerOp != nil {
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildDependencyEdges applies addOperationDependencies to all OpCreateContainer ops, including the temp container create used in a recreate chain (ContainerOp.Existing != nil). When a network is being recreated, this makes the temp create depend on create-network:<name>, while create-network depends on remove-network -> disconnect-network -> stop-container, and that stop-container is also reused by the recreate chain (often depending on the temp create). This can introduce a dependency cycle and/or force the temp container to be created on a network that is about to be removed (network remove can fail because the temp container wasn't part of findContainersOnNetwork). Suggested fix: only add network/volume/service dependency edges to “final” create/start ops (e.g., OpCreateContainer where Existing == nil, and OpStartContainer), or otherwise explicitly exclude temp recreate-create ops from addOperationDependencies and handle the recreate+network-recreate interaction as a special case.

Suggested change
if (op.Type == OpCreateContainer || op.Type == OpStartContainer) && op.ContainerOp != nil {
if op.ContainerOp == nil {
continue
}
// Only add dependency edges for "final" create operations (fresh containers)
// and for start operations. Temporary recreate creates (Existing != nil)
// should not get network/volume/service dependency edges, to avoid cycles
// and incorrect ordering with network recreation.
if op.Type == OpCreateContainer && op.ContainerOp.Existing != nil {
continue
}
if op.Type == OpCreateContainer || op.Type == OpStartContainer {

Copilot uses AI. Check for mistakes.
Comment on lines +1017 to +1019
if len(op.DependsOn) == 1 {
if dep := plan.Operations[op.DependsOn[0]]; dep != nil && dep.Type == OpRenameContainer {
readyOpsByService[op.ServiceName] = append(readyOpsByService[op.ServiceName], id)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexReadyOps only treats a service as “ready” when its OpStartContainer has exactly one dependency and that dependency is a rename op. Start ops can legitimately have additional deps (e.g., reconnects when a container is connected to multiple recreated networks, or other future edges), which would make the service disappear from readyOpsByService and break depends_on ordering. Suggest checking for the presence of any OpRenameContainer in DependsOn (or another explicit marker on the op) rather than requiring len(DependsOn) == 1.

Suggested change
if len(op.DependsOn) == 1 {
if dep := plan.Operations[op.DependsOn[0]]; dep != nil && dep.Type == OpRenameContainer {
readyOpsByService[op.ServiceName] = append(readyOpsByService[op.ServiceName], id)
for _, depID := range op.DependsOn {
if dep := plan.Operations[depID]; dep != nil && dep.Type == OpRenameContainer {
readyOpsByService[op.ServiceName] = append(readyOpsByService[op.ServiceName], id)
break

Copilot uses AI. Check for mistakes.
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants