207 add support for iceberg tables by nwheeler81 · Pull Request #210 · aws-samples/cql-replicator

nwheeler81 · 2026-02-27T01:14:45Z

Issue #, if available:

Description of changes:

CQLReplicator Release Notes

What's New

Apache Iceberg Integration

Replaced the Parquet-based head/tail snapshot storage with Apache Iceberg tables registered in the AWS Glue Data Catalog. Each tile now has its own Iceberg table with built-in snapshot versioning, eliminating the manual head/tail swap logic and simplifying change detection via time-travel reads.

Per-tile Iceberg tables: {catalog}.{keyspace}_db.{table}_tile_{N}_pk_snapshots
Automatic snapshot versioning — no more head/tail directory management
Time-travel reads for delta computation between discovery cycles
Backward compatibility: existing Parquet snapshots are automatically migrated to Iceberg on first run
Glue Data Catalog integration for table metadata management

AWS Glue 5.0 Support

Upgraded from AWS Glue 4.0 to Glue 5.0, bringing Spark 3.5.4, Java 17, and Iceberg 1.7.1.

Glue job version bumped from 4.0 to 5.0
Spark Cassandra Connector updated from 3.4.1 to 3.5.1
Cleanup block migrated from AWS SDK v1 to SDK v2 GlueClient
All existing Iceberg, Spark, and S3 operations verified compatible

Multiple writetime columns

--writetime-column parameter to accept multiple comma-separated column names. The core insight is that Cassandra's WRITETIME() function only accepts a single column, so we must call it separately for each column and compute the max. The Iceberg table schema remains unchanged (single ts BIGINT column), and the replication change detection logic is unaffected

Configurable Log Level

Added --logging-level flag to control Spark's internal logging verbosity. Defaults to ERROR to reduce CloudWatch noise.

cqlreplicator --cmd run --logging-level WARN ...

Valid values: ALL, TRACE, DEBUG, INFO, WARN, ERROR (default), FATAL, OFF. CQLReplicator's own messages via GlueLogger are always visible regardless of this setting.

Glue Data Catalog IAM Permission Validation

The init script now validates Glue Data Catalog permissions (glue:CreateDatabase, glue:GetDatabase, glue:CreateTable, glue:GetTable, glue:UpdateTable, glue:DeleteTable, glue:GetTables) alongside existing S3 and Keyspaces checks.

Bug Fixes

Fixed AWS SDK v1 Glue client import conflict in --cr cleanup block — replaced local imports with fully qualified class names
Fixed ClassCastException for tinyint/smallint columns — Iceberg/Parquet widens these to Integer; now uses Number.byteValue()/shortValue()
Fixed markReplicationComplete overwriting offload_status — now only writes load_status and dt_load
Fixed recordDiscoverySnapshot resetting load_status on curr→prev rotation — now preserves existing load_status so replication correctly distinguishes delta vs initial load
Fixed delta replication path selection — properly checks load_status to handle the race condition where discovery runs faster than replication
Fixed null-safe writetime comparison for update detection — Cassandra writetime() returns null for unwritten columns; changed from =!= to not(eqNullSafe) to correctly detect null→value transitions
Replaced all println calls with proper logger.error or passed messages to RuntimeException parent class

Breaking Changes

Requires AWS Glue 5.0 — existing Glue 4.0 jobs must be recreated via --cmd init
Requires Glue Data Catalog VPC endpoint (com.amazonaws.{region}.glue) with PrivateDnsEnabled=true
The --datalake-formats iceberg and Iceberg catalog --conf parameters must be set in the Glue job definition (handled automatically by --cmd init)
Existing Parquet snapshots are automatically migrated to Iceberg on first discovery run — no manual intervention needed

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…berg tables registered in the AWS Glue Data Catalog. Each tile now has its own Iceberg table with built-in snapshot versioning, eliminating the manual head/tail swap logic and simplifying change detection via time-travel reads.

… accept multiple comma-separated column names. The core insight is that Cassandra's `WRITETIME()` function only accepts a single column, so we must call it separately for each column and compute the max. The Iceberg table schema remains unchanged (single `ts` BIGINT column), and the replication change detection logic is unaffected.

…lesCount()` in `CQLReplicator.scala` from legacy Parquet/S3 to the per-tile Iceberg table architecture. Reuses existing Iceberg helper functions. Mirror changes to applicable glue-test keyspaces variants.

Nikolai Kolesnikov added 2 commits February 26, 2026 17:35

nwheeler81 self-assigned this Feb 27, 2026

nwheeler81 added bug Something isn't working documentation Improvements or additions to documentation labels Feb 27, 2026

nwheeler81 linked an issue Feb 27, 2026 that may be closed by this pull request

Add support for Iceberg tables #207

Open

nwheeler81 added the enhancement New feature or request label Feb 27, 2026

This was linked to issues Feb 27, 2026

Add support for Glue 5.1 #170

Open

Keyspace: Buggy program behavior if rows have been removed between discovery and replication phase #180

Open

SecurityGroupIdList missing double quotes #206

Open

nwheeler81 assigned naveenkantamneni Feb 27, 2026

nwheeler81 requested a review from naveenkantamneni February 27, 2026 14:55

nwheeler81 unassigned naveenkantamneni Feb 27, 2026

Migrated the recomputeTiles() function and its helper `getCurrentTi…

a89dcfd

…lesCount()` in `CQLReplicator.scala` from legacy Parquet/S3 to the per-tile Iceberg table architecture. Reuses existing Iceberg helper functions. Mirror changes to applicable glue-test keyspaces variants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

207 add support for iceberg tables#210

207 add support for iceberg tables#210
nwheeler81 wants to merge 3 commits intomainfrom
207-add-support-for-iceberg-tables

nwheeler81 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nwheeler81 commented Feb 27, 2026

CQLReplicator Release Notes

What's New

Apache Iceberg Integration

AWS Glue 5.0 Support

Multiple writetime columns

Configurable Log Level

Glue Data Catalog IAM Permission Validation

Bug Fixes

Breaking Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants