[oneshot] SS-49 Add validation for NOT NULL columns to COPY FROM query planning#34862
Conversation
|
this PR is "stacked" on top of #34849, so it technically includes the changes from there, as well. |
bffa533 to
c9096b1
Compare
ggevay
left a comment
There was a problem hiding this comment.
I wrote some comments.
Also, it seems that the data that we read from the file can also have nulls, in which case we still crash. For example, adding the following to copy-from-s3-minio.td crashes (hello,,world means 3 fields, of which the middle one is null):
> CREATE TABLE t5 (a text, b text not null, c text)
$ s3-file-upload bucket=copytos3 key=csv/with_null.csv
hello,,world
$ s3-set-presigned-url bucket=copytos3 key=csv/with_null.csv var-name=nulls_in_csv_url
> COPY INTO t5 (a,b,c) FROM '${nulls_in_csv_url}' (FORMAT CSV);
So, it seems we should do some validation when we have the actual data at hand. (In which case we might not even need a validation in planning or sequencing at all, since the later validation could cover also those nulls that come from default values.)
464a325 to
9712972
Compare
ggevay
left a comment
There was a problem hiding this comment.
LGTM from the Adapter side, but maybe wait for a S&S team review as well.
ublubu
left a comment
There was a problem hiding this comment.
This PR looks like two independent bugfixes.
I've looked at the "check the nullness in each row" and left one comment. That bugfix seems pretty straightforward. Idk if you want to pull that out and merge it separately.
The other bugfix will take some thought. I'll look through it next.
| match maybe_row { | ||
| // Happy path, add the Row to our batch! | ||
| Ok(row) => { | ||
| // Happy path, add the Row to our batch ! |
There was a problem hiding this comment.
How did that space get in there?
There was a problem hiding this comment.
hmmm it appears I accidentally pressed the space bar with my cursor there, will fix lol
| _ => { | ||
| let err = | ||
| StorageErrorXKind::invalid_record_batch("invalid row for collection"); | ||
| // Clean up our in-progress batch so we don't leak data. | ||
| let batch = batch_builder | ||
| .finish(upper) | ||
| .await | ||
| .expect("failed to cleanup batch"); | ||
| batch.delete().await; | ||
| proto_batch_handle | ||
| .give(&proto_batch_cap, Err(err).context("stage batches")); | ||
| return; | ||
| } |
There was a problem hiding this comment.
This looks a lot like the Err(err) => ... branch.
What if, instead of just match maybe_row {, we did a:
maybe_row.and_then(|row| { /* validate(...) -> "invalid row for collection" Err */ })
They are really more like 2 different tactics for the same bug, first to check at the query sequencing stage if the "schema" of the copy target is valid, and second (the S&S specific one) is checking each row for validity (not something we can see from schema in the case of csv) to ensure no nulls are going to be inserted to non null columns. @ggevay had a look at the sequencer changes, so no need to dig too deep there. |
Currently, if a
NOT NULLcolumn in the destination table does not have a default, and is not specified to be copied into by theCOPY FROMstatement, materialize will panic at the time of Row encoding, which is very undesirable behavior. By checking in the adapter sequencing logic, we can reject theCOPY FROMstatement earlier on with a descriptive error message.Motivation
https://github.com/MaterializeInc/database-issues/issues/9886
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.