Conversation
…inlog (mysql version <8.*)
| return fmt.Errorf("could not convert mysql value for %s: %w", field.Name, err) | ||
| } | ||
|
|
||
| mapped, err := c.mapQValue(qv, field, enumMap, binlogMetadataSupported) |
There was a problem hiding this comment.
so, I thought it would be cleaner to map values inside onRow callback rather than trying to manipulate config.Query with smth like:
select enumField + 0 from table;
does it make sense?
There was a problem hiding this comment.
Let's explore both options. With the current one:
- The metadata query and the main query need to be consistent with each other, so it would need to be {begin tx; select 1 from WatermarkTable (metadata lock); fetch enum maps; run query; commit}
- Would prefer to avoid hand-parsing enum statements if possible. In CDC path we're using
github.com/pingcap/tidb/pkg/parser, can it be used for this?
With enumField+0 one:
- snapshot_flow cloneTable() already constructs the list of columns in case there are exclusions and has the column type already available there
I find the +0 simpler, wdyt?
There was a problem hiding this comment.
Yeah, I saw that we already have the list of columns on snapshot_flow lvl.
So, yes, for snapshot_flow it would be quite easy to pass this list of columns to QRep workflow.
I was more concerned about arbitrary qrep queries that might be coming if users run QRep directly. But I guess it's fine in this case, because we wouldn't have this consistency issues at all without cdc.
Yep. I think in that case +0 should work fine.
Just one question though. Would it be ok to pass some sort of query builder struct from snapshot into qrep. So, instead of config.Query=select * from ... where ..., we could pass smth like:
type Query struct {
Table string // or just use WatermarkTable from config
Columns []FieldDescription
Where string
}
and then in every connector we would convert this Query struct to actual SQL query.
since we query table schema anyways inside cloneTable(), i think it's fine to always pass this list of columns into qrep (ofc respecting exclude logic).
I feel like manipulating the current Query string inside connector code is a bit error-prone.
There was a problem hiding this comment.
Right, the * vs col1, col2+0, col3 discrimination logic would be per-connector. Introducing a structured query object makes a lot of sense here. PG partitioned tables improvement in #3998 would also need go that route.
There was a problem hiding this comment.
+1 on using structured query :)
@dtunikov if you are planning to introduce the Query struct in this PR, i will adopt it in a follow-up.
There was a problem hiding this comment.
Currently QRepConfig is also used by QRepFlowWorkflow which can introduce arbitrary queries (this should be deprecated eventually but is currently available in OSS). Anyways, without overcomplicating the scope, using a shared function that parses the string into a structured query makes the most sense here.
In terms of the struct, Table/Columns/Where sounds good to me as well (and we can evolve it later if needed). this should cover the case where if len(columns) == 0, construct query with *.
There was a problem hiding this comment.
If we leave the standalone QRep as is, couldn't a struct field be added on the side without needing parsing? We build snapshot queries ourselves in snapshot_flow.
There was a problem hiding this comment.
ah gotcha, as in introduce a separate StructuredQuery field used by snapshotting, and let standalone QRep continue to use the existing Query field (since StructuredQuery would be nil). Just keep both. sounds good to me.
There was a problem hiding this comment.
I just noticed the newly introduced BaseQuery in #3983, given we haven't deployed this yet, would it make sense to have a StructuredQuery and then generate the base query from it dynamically which should be trivial now? (thinking that having a Query, BaseQuery, and StructuredQuery is a bit confusing)
There was a problem hiding this comment.
Merged #4050 which should allow to just handle all of this in the connector
| RequireEnvCanceled(s.t, env) | ||
| } | ||
|
|
||
| func (s ClickHouseSuite) Test_MySQL_Enum_Consistency() { |
There was a problem hiding this comment.
I decided to keep Test_MySQL_Enum as it is, because it tests both enums and sets.
This PR only addresses enums for now.
Also Test_MySQL_Enum checks complete similarity between src and dst tables.
Here we just check that dst.table.rows[0].enumValue == dst.table.rows[1].enumValue (qrep enum value == cdc enum value)
There was a problem hiding this comment.
Pull request overview
This PR fixes MySQL enum value inconsistencies between initial snapshot (QRep) and CDC for older MySQL versions that lack binlog_row_metadata=full, by mapping snapshot enum strings to their corresponding 1-based enum indexes (matching CDC output).
Changes:
- Added enum option discovery from
information_schema.columnsand enum string→index mapping during MySQL QRep pulls. - Added unit/integration tests for enum parsing/mapping and enum column option retrieval.
- Added an E2E ClickHouse↔MySQL test to assert snapshot vs CDC enum consistency.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| flow/e2e/clickhouse_mysql_test.go | Adds an E2E test that validates snapshot/CDC enum value consistency. |
| flow/connectors/mysql/qrep_test.go | Adds tests for enum parsing, mapping behavior, and enum option lookup. |
| flow/connectors/mysql/qrep.go | Implements enum option parsing/lookup and applies enum mapping in QRep streaming. |
| flow/connectors/mysql/mysql.go | Adds a helper to detect whether binlog row metadata is supported based on server version/flavor. |
| flow/connectors/mysql/cdc.go | Refactors CDC binlog-metadata support check to use the new helper and closes a result set. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
…s-mysql-without-metadata
ilidemi
left a comment
There was a problem hiding this comment.
Couple of integration points:
- Let's introduce QKindIntEnum that maps to an integer on the destination
- Add an internal version to only enable it for new mirrors/pipes
| return fmt.Errorf("could not convert mysql value for %s: %w", field.Name, err) | ||
| } | ||
|
|
||
| mapped, err := c.mapQValue(qv, field, enumMap, binlogMetadataSupported) |
There was a problem hiding this comment.
Let's explore both options. With the current one:
- The metadata query and the main query need to be consistent with each other, so it would need to be {begin tx; select 1 from WatermarkTable (metadata lock); fetch enum maps; run query; commit}
- Would prefer to avoid hand-parsing enum statements if possible. In CDC path we're using
github.com/pingcap/tidb/pkg/parser, can it be used for this?
With enumField+0 one:
- snapshot_flow cloneTable() already constructs the list of columns in case there are exclusions and has the column type already available there
I find the +0 simpler, wdyt?
To simplify handling of partitioned tables in postgres and enums special case in mysql 5.7, just move query building into connector layer. Proto change assumed to be safe as there was no release since the last change. Bringing back the CH null partition test as the MySQL path didn't have any without it - h/t @jgao54 for [fixing](#4037) it. Intended to unblock #3998 and #4025
This PR addresses issue - #3950.
It converts enum strings to integer values during QRep for older MySQL versions without metadata binlog support. It aligns QRep behaviour with CDC behaviour.