Enabling binary operations with list-like Python objects.#2054
Enabling binary operations with list-like Python objects.#2054itholic wants to merge 6 commits intodatabricks:masterfrom
Conversation
| else: | ||
| raise TypeError("date subtraction can only be applied to date series.") | ||
| return column_op(Column.__rsub__)(self, other) | ||
| return column_op(lambda left, right: right - left)(self, other) |
There was a problem hiding this comment.
FYI: Column.__rsub__ doesn't support pyspark.sql.column.Column for second parameter.
>>> kdf = ks.DataFrame({"A": [1, 2, 3, 4], "B": [10, 20, 30, 40]})
>>> sdf = kdf.to_spark()
>>> col1 = sdf.A
>>> col2 = sdf.B
>>> Column.__rsub__(col1, col2)
Traceback (most recent call last):
...
TypeError: Column is not iterableThere was a problem hiding this comment.
It does support:
>>> Column.__rsub__(df.id, 1)
Column<'(1 - id)'>It doesn't work in your case above because the instance is Spark column. In practice, that wouldn't happen because it will only be called when the first operand doesn't know how to handle Spark column e.g.) 1 - df.id.
There was a problem hiding this comment.
Does it cause any exception?
There was a problem hiding this comment.
If we use column_op(Column.__rsub__)(self, other) as it is, it raises TypeError: Column is not iterable for the case below.
>>> kser = ks.Series([1, 2, 3, 4])
>>> [10, 20, 30, 40] - kser
Traceback (most recent call last):
...
TypeError: Column is not iterableThere was a problem hiding this comment.
Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.
| # other = tuple with the different length | ||
| other = (np.nan, 1, 3, 4, np.nan) | ||
| with self.assertRaisesRegex( | ||
| ValueError, "operands could not be broadcast together with shapes" |
There was a problem hiding this comment.
The error message looks weird. Is it matched with pandas'?
There was a problem hiding this comment.
The original error message from pandas looks like :
ValueError: operands could not be broadcast together with shapes (4,) (8,)@ueshin , maybe we don't include the (4,) (8,) part since it requires to compute length of both objects which can be expensive ??
Codecov Report
@@ Coverage Diff @@
## master #2054 +/- ##
==========================================
- Coverage 94.71% 93.26% -1.45%
==========================================
Files 54 54
Lines 11503 11735 +232
==========================================
+ Hits 10895 10945 +50
- Misses 608 790 +182
Continue to review full report at Codecov.
|
| return left.isNull() | right.isNull() | comp(left, right) | ||
|
|
||
|
|
||
| def check_same_length(left: "IndexOpsMixin", right: Union[list, tuple]): |
There was a problem hiding this comment.
Nice utility! The function name might be misleading considering its return type. Would it be possible to annotate the return type or rename the function?
| if LooseVersion(pd.__version__) < LooseVersion("1.2.0"): | ||
| right = pd.Index(right, name=pindex_ops.name) |
There was a problem hiding this comment.
What happens with pandas<1.2?
Seems like it's working with pandas >= 1.0 in the test?
Actually it works:
>>> pd.__version__
'1.0.5'
>>> pd.Index([1,2,3]) + [4,5,6]
Int64Index([5, 7, 9], dtype='int64')
>>> [4,5,6] + pd.Index([1,2,3])
Int64Index([5, 7, 9], dtype='int64')There was a problem hiding this comment.
Ohh,,, seems like It doesn't work for only rmod in pandas < 1.2.
>>> [4, 5, 6] % pd.Index([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Int64Index' object has no attribute 'rmod'Let me address for only this case.
Thanks!
| raise ValueError( | ||
| "operands could not be broadcast together with shapes ({},) ({},)".format( | ||
| len_pindex_ops, len_right | ||
| ) |
There was a problem hiding this comment.
We can show the length of left if it's less than the length of right, but if it's greater, the actual length is unknown.
| def __add__(self, other) -> Union["Series", "Index"]: | ||
| if isinstance(other, (list, tuple)): | ||
| pindex_ops, other = check_same_length(self, other) | ||
| return ks.from_pandas(pindex_ops + other) # type: ignore |
There was a problem hiding this comment.
Shall we avoid using # type: ignore as possible? We can use cast instead.
| if isinstance(other, (list, tuple)): | ||
| other = ks.Index(other, name=self.name) # type: ignore |
| else: | ||
| raise TypeError("date subtraction can only be applied to date series.") | ||
| return column_op(Column.__rsub__)(self, other) | ||
| return column_op(lambda left, right: right - left)(self, other) |
There was a problem hiding this comment.
Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.
| if isinstance(other, (list, tuple)): | ||
| other = ks.Index(other, name=self.name) # type: ignore |
So far, Koalas doesn't support list-like Python objects for Series binary operations.
This PR enables it.
ref #2022 (comment)