Skip to content

207 add support for iceberg tables#210

Open
nwheeler81 wants to merge 3 commits intomainfrom
207-add-support-for-iceberg-tables
Open

207 add support for iceberg tables#210
nwheeler81 wants to merge 3 commits intomainfrom
207-add-support-for-iceberg-tables

Conversation

@nwheeler81
Copy link
Contributor

Issue #, if available:

#206 #180 #207 #170

Description of changes:

CQLReplicator Release Notes

What's New

Apache Iceberg Integration

Replaced the Parquet-based head/tail snapshot storage with Apache Iceberg tables registered in the AWS Glue Data Catalog. Each tile now has its own Iceberg table with built-in snapshot versioning, eliminating the manual head/tail swap logic and simplifying change detection via time-travel reads.

  • Per-tile Iceberg tables: {catalog}.{keyspace}_db.{table}_tile_{N}_pk_snapshots
  • Automatic snapshot versioning — no more head/tail directory management
  • Time-travel reads for delta computation between discovery cycles
  • Backward compatibility: existing Parquet snapshots are automatically migrated to Iceberg on first run
  • Glue Data Catalog integration for table metadata management

AWS Glue 5.0 Support

Upgraded from AWS Glue 4.0 to Glue 5.0, bringing Spark 3.5.4, Java 17, and Iceberg 1.7.1.

  • Glue job version bumped from 4.0 to 5.0
  • Spark Cassandra Connector updated from 3.4.1 to 3.5.1
  • Cleanup block migrated from AWS SDK v1 to SDK v2 GlueClient
  • All existing Iceberg, Spark, and S3 operations verified compatible

Multiple writetime columns

--writetime-column parameter to accept multiple comma-separated column names. The core insight is that Cassandra's WRITETIME() function only accepts a single column, so we must call it separately for each column and compute the max. The Iceberg table schema remains unchanged (single ts BIGINT column), and the replication change detection logic is unaffected

Configurable Log Level

Added --logging-level flag to control Spark's internal logging verbosity. Defaults to ERROR to reduce CloudWatch noise.

cqlreplicator --cmd run --logging-level WARN ...

Valid values: ALL, TRACE, DEBUG, INFO, WARN, ERROR (default), FATAL, OFF. CQLReplicator's own messages via GlueLogger are always visible regardless of this setting.

Glue Data Catalog IAM Permission Validation

The init script now validates Glue Data Catalog permissions (glue:CreateDatabase, glue:GetDatabase, glue:CreateTable, glue:GetTable, glue:UpdateTable, glue:DeleteTable, glue:GetTables) alongside existing S3 and Keyspaces checks.

Bug Fixes

  • Fixed AWS SDK v1 Glue client import conflict in --cr cleanup block — replaced local imports with fully qualified class names
  • Fixed ClassCastException for tinyint/smallint columns — Iceberg/Parquet widens these to Integer; now uses Number.byteValue()/shortValue()
  • Fixed markReplicationComplete overwriting offload_status — now only writes load_status and dt_load
  • Fixed recordDiscoverySnapshot resetting load_status on curr→prev rotation — now preserves existing load_status so replication correctly distinguishes delta vs initial load
  • Fixed delta replication path selection — properly checks load_status to handle the race condition where discovery runs faster than replication
  • Fixed null-safe writetime comparison for update detection — Cassandra writetime() returns null for unwritten columns; changed from =!= to not(eqNullSafe) to correctly detect null→value transitions
  • Replaced all println calls with proper logger.error or passed messages to RuntimeException parent class

Breaking Changes

  • Requires AWS Glue 5.0 — existing Glue 4.0 jobs must be recreated via --cmd init
  • Requires Glue Data Catalog VPC endpoint (com.amazonaws.{region}.glue) with PrivateDnsEnabled=true
  • The --datalake-formats iceberg and Iceberg catalog --conf parameters must be set in the Glue job definition (handled automatically by --cmd init)
  • Existing Parquet snapshots are automatically migrated to Iceberg on first discovery run — no manual intervention needed

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Nikolai Kolesnikov added 2 commits February 26, 2026 17:35
…berg tables registered in the AWS Glue Data Catalog. Each tile now has its own Iceberg table with built-in snapshot versioning, eliminating the manual head/tail swap logic and simplifying change detection via time-travel reads.
… accept multiple comma-separated column names. The core insight is that Cassandra's `WRITETIME()` function only accepts a single column, so we must call it separately for each column and compute the max. The Iceberg table schema remains unchanged (single `ts` BIGINT column), and the replication change detection logic is unaffected.
…lesCount()` in `CQLReplicator.scala` from legacy Parquet/S3 to the per-tile Iceberg table architecture. Reuses existing Iceberg helper functions. Mirror changes to applicable glue-test keyspaces variants.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

2 participants