Open
Conversation
added 2 commits
February 26, 2026 17:35
…berg tables registered in the AWS Glue Data Catalog. Each tile now has its own Iceberg table with built-in snapshot versioning, eliminating the manual head/tail swap logic and simplifying change detection via time-travel reads.
… accept multiple comma-separated column names. The core insight is that Cassandra's `WRITETIME()` function only accepts a single column, so we must call it separately for each column and compute the max. The Iceberg table schema remains unchanged (single `ts` BIGINT column), and the replication change detection logic is unaffected.
This was
linked to
issues
Feb 27, 2026
…lesCount()` in `CQLReplicator.scala` from legacy Parquet/S3 to the per-tile Iceberg table architecture. Reuses existing Iceberg helper functions. Mirror changes to applicable glue-test keyspaces variants.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
#206 #180 #207 #170
Description of changes:
CQLReplicator Release Notes
What's New
Apache Iceberg Integration
Replaced the Parquet-based head/tail snapshot storage with Apache Iceberg tables registered in the AWS Glue Data Catalog. Each tile now has its own Iceberg table with built-in snapshot versioning, eliminating the manual head/tail swap logic and simplifying change detection via time-travel reads.
{catalog}.{keyspace}_db.{table}_tile_{N}_pk_snapshotsAWS Glue 5.0 Support
Upgraded from AWS Glue 4.0 to Glue 5.0, bringing Spark 3.5.4, Java 17, and Iceberg 1.7.1.
Multiple writetime columns
--writetime-columnparameter to accept multiple comma-separated column names. The core insight is that Cassandra'sWRITETIME()function only accepts a single column, so we must call it separately for each column and compute the max. The Iceberg table schema remains unchanged (singletsBIGINT column), and the replication change detection logic is unaffectedConfigurable Log Level
Added
--logging-levelflag to control Spark's internal logging verbosity. Defaults toERRORto reduce CloudWatch noise.Valid values:
ALL,TRACE,DEBUG,INFO,WARN,ERROR(default),FATAL,OFF. CQLReplicator's own messages via GlueLogger are always visible regardless of this setting.Glue Data Catalog IAM Permission Validation
The init script now validates Glue Data Catalog permissions (
glue:CreateDatabase,glue:GetDatabase,glue:CreateTable,glue:GetTable,glue:UpdateTable,glue:DeleteTable,glue:GetTables) alongside existing S3 and Keyspaces checks.Bug Fixes
--crcleanup block — replaced local imports with fully qualified class namesClassCastExceptionfortinyint/smallintcolumns — Iceberg/Parquet widens these toInteger; now usesNumber.byteValue()/shortValue()markReplicationCompleteoverwritingoffload_status— now only writesload_statusanddt_loadrecordDiscoverySnapshotresettingload_statuson curr→prev rotation — now preserves existingload_statusso replication correctly distinguishes delta vs initial loadload_statusto handle the race condition where discovery runs faster than replicationwritetime()returns null for unwritten columns; changed from=!=tonot(eqNullSafe)to correctly detect null→value transitionsprintlncalls with properlogger.erroror passed messages toRuntimeExceptionparent classBreaking Changes
--cmd initcom.amazonaws.{region}.glue) withPrivateDnsEnabled=true--datalake-formats icebergand Iceberg catalog--confparameters must be set in the Glue job definition (handled automatically by--cmd init)By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.