-
Notifications
You must be signed in to change notification settings - Fork 23
CASSANALYTICS-128: Add flag to allow bulk write to indexed tables #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -69,7 +69,8 @@ public TableSchema(StructType dfSchema, | |
| TTLOption ttlOption, | ||
| TimestampOption timestampOption, | ||
| String lowestCassandraVersion, | ||
| boolean quoteIdentifiers) | ||
| boolean quoteIdentifiers, | ||
| boolean skipSecondaryIndexCheck) | ||
| { | ||
| this.writeMode = writeMode; | ||
| this.ttlOption = ttlOption; | ||
|
|
@@ -79,7 +80,21 @@ public TableSchema(StructType dfSchema, | |
| this.quoteIdentifiers = quoteIdentifiers; | ||
|
|
||
| validateDataFrameCompatibility(dfSchema, tableInfo); | ||
| validateNoSecondaryIndexes(tableInfo); | ||
| // If a table has indexes on it, some external process (application, DB, etc.) is responsible for rebuilding | ||
| // indexes on the table after the bulk write completes; cassandra does this as part of the SSTable import | ||
| // process today. 2i and SAI have different ergonomics here regarding if stale data is served during index build; | ||
| // ultimately we want the bulk writer to also write native SAI index files alongside sstables but until | ||
| // then, this is allowable and fine for users who Know What They're Doing. | ||
| if (!skipSecondaryIndexCheck) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please add a test if possible that check this behavior, basically the if block
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this not explicitly tested with the following? As Or am I misunderstanding the request? |
||
| { | ||
| validateNoSecondaryIndexes(tableInfo); | ||
| } | ||
| else if (tableInfo.hasSecondaryIndex()) | ||
| { | ||
| LOGGER.warn("Bulk writing to tables with SecondaryIndexes will have an asynchronous index rebuild " | ||
| + "take place automatically after writing. Reads against the index during this time " | ||
| + "window will produce inconsistent or stale results until index rebuild is complete."); | ||
| } | ||
| validateUserAddedColumns(lowestCassandraVersion, quoteIdentifiers, ttlOption, timestampOption); | ||
|
|
||
| this.createStatement = getCreateStatement(tableInfo); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -132,4 +132,12 @@ public enum WriterOptions implements WriterOption | |
| * - a failure otherwise | ||
| */ | ||
| JOB_TIMEOUT_SECONDS, | ||
| /** | ||
| * Option to bypass the secondary index validation check during bulk write job setup. | ||
| * By default, bulk writes to tables with secondary indexes are rejected. | ||
| * Setting this option to {@code true} allows bulk writes to proceed on tables that have secondary indexes, | ||
| * with the understanding that the secondary indexes will NOT be updated by the bulk write and must be | ||
| * rebuilt separately after the job completes. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does C* detect and rebuild them, or does the user need to run a command to trigger the rebuild?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. C* automatically rebuilds them on import. Part of the SSTableImporter logic paths. |
||
| */ | ||
| SKIP_SECONDARY_INDEX_CHECK, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about defining it in the spark conf, instead of write options? The rationale is that the toggle is to for advanced use case and not directly related to the write behavior. There is another existing spark conf, Admittedly, there are some existing inconsistency of where to have conf and where to have write options in the project.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would putting this in My intuition right now is that this is something that probably should have been enabled by default all this time w/a configurable guardrail to turn it off, so while I'm sympathetic to the idea of taking a small step from "don't allow it" to "allow it but make it hard to use", the risk of this being easily accessible for users is that they'll bulk write to a table that will then have a long running index building operation happen in the background. Which, other than "load on node" and "application might read a partial index if you don't have automation that clearly delineates when a bulk insert and reindex finish from application accessing it", doesn't represent a structural or correctness risk to the data.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Josh explained to me offline too. I think it makes sense to have it as part of writer option. Especially in the scenario when the spark job has multiple write sessions, writer option allows finer control to toggle the flag on specific tables / write sessions. |
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default value be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe default value shouldn't be true, because user need to understand that rebuild need to happen after the job.
@jmckenzie-dev can we print a note/warning to the user mentioning rebuild needs to happen after the job, so they are aware of penalty of using this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call; added: