Testcontainers migration by Slach · Pull Request #1336 · Altinity/clickhouse-backup

Slach · 2026-03-24T11:54:00Z

to increase parallelism and flexibility

Signed-off-by: slach <bloodjazman@gmail.com>

…mpose YAMLs, cleanup references - Rename docker-compose/ -> docker/, keep only scripts (custom_entrypoint.sh, dynamic_settings.sh) - Delete docker-compose.yml, clickhouse-service.yml, kafka-service.yml, zookeeper-service.yml - Rename docker_compose_project_dir -> docker_dir, _compose_dir -> _docker_dir in cluster.py - Remove unused docker_compose/docker_compose_file params from Cluster.__init__ - Add port 7171 conflict detection and logging in _do_down() - Make --debug flag in run.sh conditional on TESTFLOWS_DEBUG env var - Update README.md and argparser.py help text Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: slach <bloodjazman@gmail.com>

- Replace fixed 7171:7171 port binding with dynamic host port mapping - Add Cluster.get_mapped_port() for querying mapped ports at runtime - api.py uses dynamic backup_api_port from context instead of hardcoded 7171 - Always clean up containers in Cluster.down() (remove local mode skip) - run.sh: auto-discover suites from regression.py, run in parallel via xargs - RUN_PARALLEL=1 by default, each suite gets its own Cluster (~11 containers) - Suite results collected from log files, summary printed at end Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…fixes - Each regression.py process creates its own configs/backup_<PID>/ dir - Storage path prefix set to testflows_<PID> for s3/gcs/azblob/ftp/sftp/cos - Cluster accepts backup_config_dir to mount per-process config into container - Per-process config dir cleaned up in finally block - Fixes cloud_storage and api test failures when running with RUN_PARALLEL>1 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…run.sh - Wire TestContainers into pool factory when USE_TESTCONTAINERS=1 - Add TestMain to clean up containers after test run - Add cleanupStaleTestContainers() to remove leftover tc_ resources (containers, networks, volumes) from interrupted runs - Create Docker named volumes before using them in container binds - Add "azure" network alias for Azurite container (ClickHouse configs reference http://azure:10000) - Support extra network aliases in startContainer() - Update run.sh: USE_TESTCONTAINERS=1 is the new default, skips all docker compose up/down logic; legacy compose mode still available Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: slach <bloodjazman@gmail.com>

…race conditions

Each test now creates its own containers in NewTestEnvironment and destroys them in Cleanup. Concurrency is controlled by go test -parallel. Removes go-commons-pool dependency and simplifies TestMain. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The old run.sh used CLICKHOUSE_VERSION == 2* to select the advanced compose file, which included dynamic_settings.sh (storage policies). CH 20.3+ needs hot_and_cold policy for TestHardlinksExistsFiles. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ough Simplify build.yaml testflows step to call ./test/testflows/run.sh. run.sh now handles tfs report generation, coverage formatting, and permission fixes. Adds RUN_PARALLEL=3 and DEBUG/NO_COLORS env vars. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Avoids slow inline pull that gets mixed into the SAS token output. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # test/testflows/.gitignore

Binary is already built and downloaded as artifact in CI. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ilure log.Fatal kills the entire test process, including all parallel tests. When a ClickHouse container restarts, port bindings temporarily disappear. Now returns error to let connectWithWait retry. Also increased retries from 10 to 30 with 1s sleep to tolerate container restarts. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Shows container status, health, exit code, OOMKilled flag, and last 50 lines of logs when a container fails to become healthy. Helps diagnose why ClickHouse or other containers fail to start in CI. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…to 300s ClickHouse 26.1 needs 2+ minutes to initialize S3/Azure object storage disks. With StartPeriod=2s Docker marks the container unhealthy before ClickHouse finishes startup. Increase StartPeriod so health failures during init don't count as retries, and wait up to 5 minutes total. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

coveralls · 2026-03-26T06:55:17Z

Pull Request Test Coverage Report for Build 23679338147

Details

11 of 13 (84.62%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall first build on testcontainers_migration at 67.396%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/backup/restore.go	11	13	84.62%

Totals
Change from base Build 23675471871:	67.4%
Covered Lines:	11952
Relevant Lines:	17734

💛 - Coveralls

…kip redundant pulls - Start all independent support services (sshd, ftp, minio, gcs, azure, zookeeper, mysql, pgsql) in parallel goroutines instead of sequentially - Wait for all health checks in parallel - Pre-pull all Docker images once in TestMain before tests start, so parallel tests don't race to pull the same images - Skip Docker pull if image already exists locally (ImageInspect check) - Add sync.Mutex to protect concurrent map writes during parallel startup - Enable TEST_LOG_LEVEL=debug in CI for better diagnostics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Creating containers per-test added ~40-80 minutes of overhead for 41 tests. Now pre-creates RUN_PARALLEL environments in TestMain and reuses them via a buffered channel pool. Tests acquire env from pool, clean shared state (disk_s3, backups, rsync, restic, kopia) in Cleanup, and return to pool. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

RUN_TESTS='*' in CI was treated as a specific filter, bypassing the parallel xargs branch. Now '*' falls through to the parallel suite discovery path. Also guard source .env for CI where file doesn't exist. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

restore uses ON CLUSTER for CREATE TABLE, so DROP DATABASE without ON CLUSTER leaves pending DDL tasks in ZooKeeper that can recreate tables after the database is dropped. This fixes TestSkipEmptyTables flakiness where empty_table reappeared after being skipped. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

RBAC restore does SYSTEM SHUTDOWN internally. Without an explicit container restart, the immediate reconnect hits an unready ClickHouse. Replace commented-out compose restart with tc.RestartContainer. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

After SYSTEM SHUTDOWN, ClickHouse briefly accepts TCP connections while shutting down. Connect+Ping succeeds but the next query gets EOF. Add 5s delay for shutdown to complete and verify with SELECT 1 after reconnect to ensure ClickHouse is truly ready, not just accepting TCP. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…s on test failure - Increase reconnect timeout from 180s to 300s (CH 23.3 with S3/Azure disks needs ~3.5 min to restart) - Use per-query 5s timeout for SELECT 1 instead of outer closeCtx which may be nearly expired - Increase retry count from 60 to 120 - Dump all container state + last 50 log lines when a test fails (DumpAllContainerLogs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ClickHouse may still be loading RBAC objects after restart, causing EOF on first query. Add retry loop with reconnect for SHOW queries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Named Docker volumes have significant overhead for file-heavy operations. Replace with host bind-mount directories in /tmp for native filesystem speed. This fixes TestGCS timeout (67 min -> should be ~40 min like on master). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ClickHouse creates files as root inside the container, so the host Go process cannot delete them. Clean shared dirs via docker exec before stopping containers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On shared environments parallel tests add IO pressure to minio, causing cached list to occasionally be slower than uncached. Retry cached measurement up to 3 times before failing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace CH 26.1 with 26.3 in CI matrix (26.1 has known BlobKillerThread bug) - Update default CLICKHOUSE_VERSION to 26.3 in run.sh scripts - Increase go test timeout from 90m to 120m (TestGCS needs ~50 min) - Add fail-fast: false to CI matrix to avoid cascading cancellations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…vior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…leanup - TestNamedCollections: drop database before named collection (CH 26.3 forbids DROP NAMED COLLECTION while tables reference it) - checkObjectStorageIsEmpty: call SYSTEM WAIT BLOBS CLEANUP before checking minio (CH 26.2+ async BlobKillerThread leaves disk_s3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…g grep Results section only checked for "Failing" in log files, missing suites that crashed with exit code 1 (e.g. missing docker image). Now tracks exit code via .rc files and prints stdout on failure for CI visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…use exit codes only - Start all N test environments concurrently instead of sequentially (~19s vs ~77s for 4 envs) - Stop all environments and their containers in parallel on teardown - Remove grep "Failing" fallback from testflows/run.sh, rely solely on exit codes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix transient DNS/network failures (exit code 6) when downloading yq, restic, and kopia inside containers during CI. Add --retry 5 --retry-delay 5 --retry-connrefused to all curl commands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rtPeriod - Revert shared volumes from host bind-mount directories back to Docker named volumes (matching working commit cdb05d3). Bind mounts + rm -rf /var/lib/clickhouse was destroying ClickHouse data. - Fix CUR_DIR fallback: go test already sets cwd to test/integration, so don't append test/integration again. - Restore ClickHouse healthcheck StartPeriod to 120s (was incorrectly reduced to 10s). - Keep parallelized env startup/shutdown and container stop improvements. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…estcontainers_migration # Conflicts: # go.mod # go.sum

…eLocalDownloadRestore Race condition: async download/restore API returns immediately, but fixed sleep 2/sleep 8 was insufficient — restore could start before download's pid file was cleaned up via defer. Now polls /backup/status by operation_id until completion, then waits 1s for defer cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The query checking s3 parts used `name='table_s3'` (part name column) instead of `table='table_s3'` (table name column), making the assertion always pass regardless of whether data was actually restored. Also reset the variable before reuse to prevent stale values from prior query. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: slach <bloodjazman@gmail.com>

SYSTEM WAIT BLOBS CLEANUP is only available in CH 26.3+, not 26.2. checkObjectStorageIsEmpty is called before runMainIntegrationScenario which means env.ch is nil. Connect/disconnect around the query. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…in testAPIDeleteLocalDownloadRestore The /backup/status?operationid= endpoint returns a single JSON object (via sendJSONEachRow), not a JSON array. Changed jq from .[0].status to .status. Also narrowed error assertion to match "status":"error" instead of bare "error" which false-matched bash -xe trace output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Slach and others added 8 commits March 23, 2026 12:21

[wip] testcontainers migration, testflows broken

2188452

Signed-off-by: slach <bloodjazman@gmail.com>

Merge branch 'master' into testcontainers_migration

e1c77ed

update go.mod

ff54dfa

Signed-off-by: slach <bloodjazman@gmail.com>

fix TestAlibabaOverS3, add _logs_ to .gitignore

8680566

Signed-off-by: slach <bloodjazman@gmail.com>

Slach added this to the 2.7.0 milestone Mar 24, 2026

Slach and others added 16 commits March 24, 2026 16:55

remove docker compose from build.yaml

31bb589

add pullImageIfNeeded

61e29bc

add CGO_ENABLED in integration test build to check tests is not have …

1b4f0c1

…race conditions

Merge branch 'master' into testcontainers_migration

8cc5a26

fix actions versions

14f5c63

pull azure-cli image before docker run in TestAzure

248e5ac

Avoids slow inline pull that gets mixed into the SAS token output. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

add timing to testflows run.sh: per-suite and total elapsed time

98c073f

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'master' into testcontainers_migration

32dec57

# Conflicts: # test/testflows/.gitignore

skip make build in testflows run.sh when running in GitHub Actions

679e1d3

Binary is already built and downloaded as artifact in CI. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

set ClickHouse healthcheck StartPeriod to 10s

7986fc3

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Slach and others added 4 commits March 26, 2026 11:56

Slach and others added 15 commits March 26, 2026 18:45

fix TestRBAC: retry SHOW queries after container restart

45d22e4

ClickHouse may still be loading RBAC objects after restart, causing EOF on first query. Add retry loop with reconnect for SHOW queries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix permission denied on shared dir cleanup

974b1eb

ClickHouse creates files as root inside the container, so the host Go process cannot delete them. Clean shared dirs via docker exec before stopping containers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

reduce container stop timeout to 1s matching master docker rm -f beha…

bf61b89

…vior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

trigger CI rebuild after cancelling hung GHA runs

64a3903

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

retrigger CI

08835c0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Slach closed this Mar 27, 2026

Slach reopened this Mar 27, 2026

Slach and others added 7 commits March 27, 2026 22:49

Merge branch 'master' of github.com:Altinity/clickhouse-backup into t…

f8d6000

…estcontainers_migration # Conflicts: # go.mod # go.sum

fix go.sum conflicts

a1a28a9

Signed-off-by: slach <bloodjazman@gmail.com>

Slach merged commit db4ccf5 into master Mar 28, 2026
27 checks passed

Slach mentioned this pull request Mar 28, 2026

weird list_duration in logs #1337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testcontainers migration#1336

Testcontainers migration#1336
Slach merged 50 commits intomasterfrom
testcontainers_migration

Slach commented Mar 24, 2026

Uh oh!

coveralls commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Slach commented Mar 24, 2026

Uh oh!

coveralls commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 23679338147

Details

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coveralls commented Mar 26, 2026 •

edited

Loading