fix: Disallows dropping duplicate keys when using full outer join #1320
fix: Disallows dropping duplicate keys when using full outer join #1320renato2099 wants to merge 4 commits intoapache:mainfrom
Conversation
kosiew
left a comment
There was a problem hiding this comment.
Thanks for working on this.
Documentation gaps:
The doc string for join in python/datafusion/dataframe.py states:
drop_duplicate_keys: When True, the columns from the right DataFrame
that have identical names in the ``on`` fields to the left DataFrame
will be dropped.It does not mention the full join exception. Users reading this may assume the parameter works the same for all join types.
Similarly, docs/source/user-guide/common-operations/joins.rst should also document for the full join and drop_duplicate_keys behaviour.
|
Hi @kosiew , thanks for taking a look at the PR! I have added some documentation notes. Let me know if this is sufficient, otherwise I can add more explanations. |
|
I am thinking that we could have a follow up on this path to be more ergonomic though + a more future-proof API (non-breaking path). Basically, we could introduce an enum-like parameter alongside the boolean, deprecating the latter later on, we could have something like: Then
but we could do that in a follow up PR, wdyt @kosiew ? |
|
Closing this PR since the consensus has landed on using the coalesce approach instead. Thank you for the PR and helpful discussions! |
Which issue does this PR close?
Closes #1305
Rationale for this change
Datafusion-python should follow datafusion implementation and disallow dropping keys when doing a full outer join as both keys are not equivalent thus they can't just be dropped. Users can then decide on how to proceed based on their use case.
What changes are included in this PR?
fix + unit test
Are there any user-facing changes?
yes, disallowing dropping keys when doing a full outer join as that is not semantically correct.