Skip to content

Testcontainers migration#1336

Merged
Slach merged 50 commits intomasterfrom
testcontainers_migration
Mar 28, 2026
Merged

Testcontainers migration#1336
Slach merged 50 commits intomasterfrom
testcontainers_migration

Conversation

@Slach
Copy link
Copy Markdown
Collaborator

@Slach Slach commented Mar 24, 2026

to increase parallelism and flexibility

Slach and others added 8 commits March 23, 2026 12:21
Signed-off-by: slach <bloodjazman@gmail.com>
…mpose YAMLs, cleanup references

- Rename docker-compose/ -> docker/, keep only scripts (custom_entrypoint.sh, dynamic_settings.sh)
- Delete docker-compose.yml, clickhouse-service.yml, kafka-service.yml, zookeeper-service.yml
- Rename docker_compose_project_dir -> docker_dir, _compose_dir -> _docker_dir in cluster.py
- Remove unused docker_compose/docker_compose_file params from Cluster.__init__
- Add port 7171 conflict detection and logging in _do_down()
- Make --debug flag in run.sh conditional on TESTFLOWS_DEBUG env var
- Update README.md and argparser.py help text

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: slach <bloodjazman@gmail.com>
- Replace fixed 7171:7171 port binding with dynamic host port mapping
- Add Cluster.get_mapped_port() for querying mapped ports at runtime
- api.py uses dynamic backup_api_port from context instead of hardcoded 7171
- Always clean up containers in Cluster.down() (remove local mode skip)
- run.sh: auto-discover suites from regression.py, run in parallel via xargs
- RUN_PARALLEL=1 by default, each suite gets its own Cluster (~11 containers)
- Suite results collected from log files, summary printed at end

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…fixes

- Each regression.py process creates its own configs/backup_<PID>/ dir
- Storage path prefix set to testflows_<PID> for s3/gcs/azblob/ftp/sftp/cos
- Cluster accepts backup_config_dir to mount per-process config into container
- Per-process config dir cleaned up in finally block
- Fixes cloud_storage and api test failures when running with RUN_PARALLEL>1

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…run.sh

- Wire TestContainers into pool factory when USE_TESTCONTAINERS=1
- Add TestMain to clean up containers after test run
- Add cleanupStaleTestContainers() to remove leftover tc_ resources
  (containers, networks, volumes) from interrupted runs
- Create Docker named volumes before using them in container binds
- Add "azure" network alias for Azurite container (ClickHouse configs
  reference http://azure:10000)
- Support extra network aliases in startContainer()
- Update run.sh: USE_TESTCONTAINERS=1 is the new default, skips all
  docker compose up/down logic; legacy compose mode still available

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: slach <bloodjazman@gmail.com>
@Slach Slach added this to the 2.7.0 milestone Mar 24, 2026
Slach and others added 16 commits March 24, 2026 16:55
Each test now creates its own containers in NewTestEnvironment and
destroys them in Cleanup. Concurrency is controlled by go test -parallel.
Removes go-commons-pool dependency and simplifies TestMain.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The old run.sh used CLICKHOUSE_VERSION == 2* to select the advanced
compose file, which included dynamic_settings.sh (storage policies).
CH 20.3+ needs hot_and_cold policy for TestHardlinksExistsFiles.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ough

Simplify build.yaml testflows step to call ./test/testflows/run.sh.
run.sh now handles tfs report generation, coverage formatting, and
permission fixes. Adds RUN_PARALLEL=3 and DEBUG/NO_COLORS env vars.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Avoids slow inline pull that gets mixed into the SAS token output.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	test/testflows/.gitignore
Binary is already built and downloaded as artifact in CI.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ilure

log.Fatal kills the entire test process, including all parallel tests.
When a ClickHouse container restarts, port bindings temporarily disappear.
Now returns error to let connectWithWait retry. Also increased retries
from 10 to 30 with 1s sleep to tolerate container restarts.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Shows container status, health, exit code, OOMKilled flag, and last
50 lines of logs when a container fails to become healthy. Helps
diagnose why ClickHouse or other containers fail to start in CI.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…to 300s

ClickHouse 26.1 needs 2+ minutes to initialize S3/Azure object storage
disks. With StartPeriod=2s Docker marks the container unhealthy before
ClickHouse finishes startup. Increase StartPeriod so health failures
during init don't count as retries, and wait up to 5 minutes total.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@coveralls
Copy link
Copy Markdown

coveralls commented Mar 26, 2026

Pull Request Test Coverage Report for Build 23679338147

Details

  • 11 of 13 (84.62%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on testcontainers_migration at 67.396%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/backup/restore.go 11 13 84.62%
Totals Coverage Status
Change from base Build 23675471871: 67.4%
Covered Lines: 11952
Relevant Lines: 17734

💛 - Coveralls

Slach and others added 4 commits March 26, 2026 11:56
…kip redundant pulls

- Start all independent support services (sshd, ftp, minio, gcs, azure, zookeeper,
  mysql, pgsql) in parallel goroutines instead of sequentially
- Wait for all health checks in parallel
- Pre-pull all Docker images once in TestMain before tests start, so parallel tests
  don't race to pull the same images
- Skip Docker pull if image already exists locally (ImageInspect check)
- Add sync.Mutex to protect concurrent map writes during parallel startup
- Enable TEST_LOG_LEVEL=debug in CI for better diagnostics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Creating containers per-test added ~40-80 minutes of overhead for 41 tests.
Now pre-creates RUN_PARALLEL environments in TestMain and reuses them via
a buffered channel pool. Tests acquire env from pool, clean shared state
(disk_s3, backups, rsync, restic, kopia) in Cleanup, and return to pool.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
RUN_TESTS='*' in CI was treated as a specific filter, bypassing the
parallel xargs branch. Now '*' falls through to the parallel suite
discovery path. Also guard source .env for CI where file doesn't exist.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
restore uses ON CLUSTER for CREATE TABLE, so DROP DATABASE without
ON CLUSTER leaves pending DDL tasks in ZooKeeper that can recreate
tables after the database is dropped. This fixes TestSkipEmptyTables
flakiness where empty_table reappeared after being skipped.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Slach and others added 15 commits March 26, 2026 18:45
RBAC restore does SYSTEM SHUTDOWN internally. Without an explicit
container restart, the immediate reconnect hits an unready ClickHouse.
Replace commented-out compose restart with tc.RestartContainer.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
After SYSTEM SHUTDOWN, ClickHouse briefly accepts TCP connections while
shutting down. Connect+Ping succeeds but the next query gets EOF.
Add 5s delay for shutdown to complete and verify with SELECT 1 after
reconnect to ensure ClickHouse is truly ready, not just accepting TCP.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…s on test failure

- Increase reconnect timeout from 180s to 300s (CH 23.3 with S3/Azure disks needs ~3.5 min to restart)
- Use per-query 5s timeout for SELECT 1 instead of outer closeCtx which may be nearly expired
- Increase retry count from 60 to 120
- Dump all container state + last 50 log lines when a test fails (DumpAllContainerLogs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ClickHouse may still be loading RBAC objects after restart, causing
EOF on first query. Add retry loop with reconnect for SHOW queries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Named Docker volumes have significant overhead for file-heavy operations.
Replace with host bind-mount directories in /tmp for native filesystem speed.
This fixes TestGCS timeout (67 min -> should be ~40 min like on master).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ClickHouse creates files as root inside the container, so the host
Go process cannot delete them. Clean shared dirs via docker exec
before stopping containers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On shared environments parallel tests add IO pressure to minio,
causing cached list to occasionally be slower than uncached.
Retry cached measurement up to 3 times before failing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace CH 26.1 with 26.3 in CI matrix (26.1 has known BlobKillerThread bug)
- Update default CLICKHOUSE_VERSION to 26.3 in run.sh scripts
- Increase go test timeout from 90m to 120m (TestGCS needs ~50 min)
- Add fail-fast: false to CI matrix to avoid cascading cancellations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…vior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…leanup

- TestNamedCollections: drop database before named collection (CH 26.3
  forbids DROP NAMED COLLECTION while tables reference it)
- checkObjectStorageIsEmpty: call SYSTEM WAIT BLOBS CLEANUP before
  checking minio (CH 26.2+ async BlobKillerThread leaves disk_s3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g grep

Results section only checked for "Failing" in log files, missing suites
that crashed with exit code 1 (e.g. missing docker image). Now tracks
exit code via .rc files and prints stdout on failure for CI visibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…use exit codes only

- Start all N test environments concurrently instead of sequentially (~19s vs ~77s for 4 envs)
- Stop all environments and their containers in parallel on teardown
- Remove grep "Failing" fallback from testflows/run.sh, rely solely on exit codes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix transient DNS/network failures (exit code 6) when downloading yq, restic,
and kopia inside containers during CI. Add --retry 5 --retry-delay 5
--retry-connrefused to all curl commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Slach Slach closed this Mar 27, 2026
@Slach Slach reopened this Mar 27, 2026
Slach and others added 7 commits March 27, 2026 22:49
…rtPeriod

- Revert shared volumes from host bind-mount directories back to Docker
  named volumes (matching working commit cdb05d3). Bind mounts + rm -rf
  /var/lib/clickhouse was destroying ClickHouse data.
- Fix CUR_DIR fallback: go test already sets cwd to test/integration,
  so don't append test/integration again.
- Restore ClickHouse healthcheck StartPeriod to 120s (was incorrectly
  reduced to 10s).
- Keep parallelized env startup/shutdown and container stop improvements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…estcontainers_migration

# Conflicts:
#	go.mod
#	go.sum
…eLocalDownloadRestore

Race condition: async download/restore API returns immediately, but fixed
sleep 2/sleep 8 was insufficient — restore could start before download's
pid file was cleaned up via defer. Now polls /backup/status by operation_id
until completion, then waits 1s for defer cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The query checking s3 parts used `name='table_s3'` (part name column)
instead of `table='table_s3'` (table name column), making the assertion
always pass regardless of whether data was actually restored. Also reset
the variable before reuse to prevent stale values from prior query.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: slach <bloodjazman@gmail.com>
SYSTEM WAIT BLOBS CLEANUP is only available in CH 26.3+, not 26.2.
checkObjectStorageIsEmpty is called before runMainIntegrationScenario
which means env.ch is nil. Connect/disconnect around the query.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in testAPIDeleteLocalDownloadRestore

The /backup/status?operationid= endpoint returns a single JSON object
(via sendJSONEachRow), not a JSON array. Changed jq from .[0].status
to .status. Also narrowed error assertion to match "status":"error"
instead of bare "error" which false-matched bash -xe trace output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Slach Slach merged commit db4ccf5 into master Mar 28, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants