Enabling binary operations with list-like Python objects. by itholic · Pull Request #2054 · databricks/koalas

itholic · 2021-02-15T07:50:49Z

So far, Koalas doesn't support list-like Python objects for Series binary operations.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64

>>> kser + [10, 20, 30, 40, 50, 60]
Traceback (most recent call last):
...

This PR enables it.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64
>>> kser + [10, 20, 30, 40, 50, 60]
0    11
1    22
2    33
3    44
4    55
5    66
Name: x, dtype: int64
>>> kser - [10, 20, 30, 40, 50, 60]
0    -9
1   -18
2   -27
3   -36
4   -45
5   -54
Name: x, dtype: int64
>>> kser * [10, 20, 30, 40, 50, 60]
0     10
1     40
2     90
3    160
4    250
5    360
Name: x, dtype: int64
>>> kser / [10, 20, 30, 40, 50, 60]
0    0.1
1    0.1
2    0.1
3    0.1
4    0.1
5    0.1
Name: x, dtype: float64

ref #2022 (comment)

itholic · 2021-02-15T13:46:56Z

            else:
                raise TypeError("date subtraction can only be applied to date series.")
-        return column_op(Column.__rsub__)(self, other)
+        return column_op(lambda left, right: right - left)(self, other)


FYI: Column.__rsub__ doesn't support pyspark.sql.column.Column for second parameter.

>>> kdf = ks.DataFrame({"A": [1, 2, 3, 4], "B": [10, 20, 30, 40]}) >>> sdf = kdf.to_spark() >>> col1 = sdf.A >>> col2 = sdf.B >>> Column.__rsub__(col1, col2) Traceback (most recent call last): ... TypeError: Column is not iterable

It does support:

>>> Column.__rsub__(df.id, 1) Column<'(1 - id)'>

It doesn't work in your case above because the instance is Spark column. In practice, that wouldn't happen because it will only be called when the first operand doesn't know how to handle Spark column e.g.) 1 - df.id.

Does it cause any exception?

If we use column_op(Column.__rsub__)(self, other) as it is, it raises TypeError: Column is not iterable for the case below.

>>> kser = ks.Series([1, 2, 3, 4]) >>> [10, 20, 30, 40] - kser Traceback (most recent call last): ... TypeError: Column is not iterable

Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

HyukjinKwon · 2021-02-17T04:35:34Z

+        # other = tuple with the different length
+        other = (np.nan, 1, 3, 4, np.nan)
+        with self.assertRaisesRegex(
+            ValueError, "operands could not be broadcast together with shapes"


The error message looks weird. Is it matched with pandas'?

The original error message from pandas looks like :

ValueError: operands could not be broadcast together with shapes (4,) (8,)

@ueshin , maybe we don't include the (4,) (8,) part since it requires to compute length of both objects which can be expensive ??

codecov-io · 2021-02-18T07:57:43Z

Codecov Report

Merging #2054 (0fd3666) into master (87f5b18) will decrease coverage by 1.44%.
The diff coverage is 91.17%.

@@            Coverage Diff             @@
##           master    #2054      +/-   ##
==========================================
- Coverage   94.71%   93.26%   -1.45%     
==========================================
  Files          54       54              
  Lines       11503    11735     +232     
==========================================
+ Hits        10895    10945      +50     
- Misses        608      790     +182

Impacted Files	Coverage Δ
databricks/koalas/utils.py	`93.66% <75.00%> (-1.71%)`	⬇️
databricks/koalas/base.py	`97.35% <96.00%> (+0.06%)`	⬆️
databricks/koalas/indexes/base.py	`97.43% <100.00%> (ø)`
databricks/koalas/usage_logging/__init__.py	`26.66% <0.00%> (-65.84%)`	⬇️
databricks/koalas/usage_logging/usage_logger.py	`47.82% <0.00%> (-52.18%)`	⬇️
databricks/koalas/__init__.py	`80.00% <0.00%> (-12.00%)`	⬇️
databricks/conftest.py	`91.30% <0.00%> (-8.70%)`	⬇️
databricks/koalas/accessors.py	`86.43% <0.00%> (-7.04%)`	⬇️
databricks/koalas/spark/accessors.py	`88.67% <0.00%> (-6.29%)`	⬇️
databricks/koalas/typedef/typehints.py	`91.06% <0.00%> (-2.75%)`	⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87f5b18...0fd3666. Read the comment docs.

xinrong-meng · 2021-02-18T18:04:59Z

    return left.isNull() | right.isNull() | comp(left, right)
+
+
+def check_same_length(left: "IndexOpsMixin", right: Union[list, tuple]):


Nice utility! The function name might be misleading considering its return type. Would it be possible to annotate the return type or rename the function?

ueshin

Also could you try to reduce the amount of test codes by using loop or parameterizing if there is no difference except for the operators?

ueshin · 2021-02-18T20:42:33Z

+            if LooseVersion(pd.__version__) < LooseVersion("1.2.0"):
+                right = pd.Index(right, name=pindex_ops.name)


What happens with pandas<1.2?
Seems like it's working with pandas >= 1.0 in the test?

Actually it works:

>>> pd.__version__ '1.0.5' >>> pd.Index([1,2,3]) + [4,5,6] Int64Index([5, 7, 9], dtype='int64') >>> [4,5,6] + pd.Index([1,2,3]) Int64Index([5, 7, 9], dtype='int64')

Ohh,,, seems like It doesn't work for only rmod in pandas < 1.2.

>>> [4, 5, 6] % pd.Index([1,2,3]) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'Int64Index' object has no attribute 'rmod'

Let me address for only this case.

Thanks!

ueshin · 2021-02-18T20:45:46Z

+            raise ValueError(
+                "operands could not be broadcast together with shapes ({},) ({},)".format(
+                    len_pindex_ops, len_right
+                )


We can show the length of left if it's less than the length of right, but if it's greater, the actual length is unknown.

ueshin · 2021-02-18T20:54:51Z

    def __add__(self, other) -> Union["Series", "Index"]:
+        if isinstance(other, (list, tuple)):
+            pindex_ops, other = check_same_length(self, other)
+            return ks.from_pandas(pindex_ops + other)  # type: ignore


Shall we avoid using # type: ignore as possible? We can use cast instead.

ueshin · 2021-02-18T20:55:22Z

+        if isinstance(other, (list, tuple)):
+            other = ks.Index(other, name=self.name)  # type: ignore


not needed?

ueshin · 2021-02-18T21:01:54Z

            else:
                raise TypeError("date subtraction can only be applied to date series.")
-        return column_op(Column.__rsub__)(self, other)
+        return column_op(lambda left, right: right - left)(self, other)


Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

ueshin · 2021-02-18T21:03:06Z

+        if isinstance(other, (list, tuple)):
+            other = ks.Index(other, name=self.name)  # type: ignore


not needed?

xinrong-meng · 2021-08-05T21:55:00Z

https://issues.apache.org/jira/browse/SPARK-36437

Add tests

2e20719

itholic force-pushed the series_op branch from 8c16164 to 2e20719 Compare February 15, 2021 12:49

itholic marked this pull request as draft February 15, 2021 13:22

Fix rsub

922ba6a

itholic commented Feb 15, 2021

View reviewed changes

itholic marked this pull request as ready for review February 15, 2021 14:45

itholic requested review from HyukjinKwon, ueshin and xinrong-meng February 17, 2021 02:44

HyukjinKwon reviewed Feb 17, 2021

View reviewed changes

Comment thread databricks/koalas/series.py Outdated

itholic added 2 commits February 18, 2021 12:01

Add Index

02c3334

Fix test for mod and rmod

8df4ff9

itholic added 2 commits February 18, 2021 18:56

Use pandas

5d2d1c5

Fix test

0fd3666

xinrong-meng reviewed Feb 18, 2021

View reviewed changes

ueshin reviewed Feb 18, 2021

View reviewed changes

		return left.isNull() \| right.isNull() \| comp(left, right)


		def check_same_length(left: "IndexOpsMixin", right: Union[list, tuple]):

		if LooseVersion(pd.__version__) < LooseVersion("1.2.0"):
		right = pd.Index(right, name=pindex_ops.name)

		if isinstance(other, (list, tuple)):
		other = ks.Index(other, name=self.name) # type: ignore

Conversation

itholic commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xinrong-meng Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

itholic commented Feb 15, 2021 •

edited

Loading

ueshin Feb 18, 2021 •

edited

Loading

itholic Feb 17, 2021 •

edited

Loading

codecov-io commented Feb 18, 2021 •

edited

Loading

xinrong-meng Feb 18, 2021 •

edited

Loading

ueshin left a comment •

edited

Loading

itholic Feb 19, 2021 •

edited

Loading

ueshin Feb 18, 2021 •

edited

Loading