Skip to content

Enabling binary operations with list-like Python objects.#2054

Open
itholic wants to merge 6 commits intodatabricks:masterfrom
itholic:series_op
Open

Enabling binary operations with list-like Python objects.#2054
itholic wants to merge 6 commits intodatabricks:masterfrom
itholic:series_op

Conversation

@itholic
Copy link
Copy Markdown
Contributor

@itholic itholic commented Feb 15, 2021

So far, Koalas doesn't support list-like Python objects for Series binary operations.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64

>>> kser + [10, 20, 30, 40, 50, 60]
Traceback (most recent call last):
...

This PR enables it.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64
>>> kser + [10, 20, 30, 40, 50, 60]
0    11
1    22
2    33
3    44
4    55
5    66
Name: x, dtype: int64
>>> kser - [10, 20, 30, 40, 50, 60]
0    -9
1   -18
2   -27
3   -36
4   -45
5   -54
Name: x, dtype: int64
>>> kser * [10, 20, 30, 40, 50, 60]
0     10
1     40
2     90
3    160
4    250
5    360
Name: x, dtype: int64
>>> kser / [10, 20, 30, 40, 50, 60]
0    0.1
1    0.1
2    0.1
3    0.1
4    0.1
5    0.1
Name: x, dtype: float64

ref #2022 (comment)

@itholic itholic marked this pull request as draft February 15, 2021 13:22
Comment thread databricks/koalas/base.py
else:
raise TypeError("date subtraction can only be applied to date series.")
return column_op(Column.__rsub__)(self, other)
return column_op(lambda left, right: right - left)(self, other)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Column.__rsub__ doesn't support pyspark.sql.column.Column for second parameter.

>>> kdf = ks.DataFrame({"A": [1, 2, 3, 4], "B": [10, 20, 30, 40]})
>>> sdf = kdf.to_spark()
>>> col1 = sdf.A
>>> col2 = sdf.B
>>> Column.__rsub__(col1, col2)
Traceback (most recent call last):
...
TypeError: Column is not iterable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does support:

>>> Column.__rsub__(df.id, 1)
Column<'(1 - id)'>

It doesn't work in your case above because the instance is Spark column. In practice, that wouldn't happen because it will only be called when the first operand doesn't know how to handle Spark column e.g.) 1 - df.id.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it cause any exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use column_op(Column.__rsub__)(self, other) as it is, it raises TypeError: Column is not iterable for the case below.

>>> kser = ks.Series([1, 2, 3, 4])
>>> [10, 20, 30, 40] - kser
Traceback (most recent call last):
...
TypeError: Column is not iterable

Copy link
Copy Markdown
Contributor

@ueshin ueshin Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

@itholic itholic marked this pull request as ready for review February 15, 2021 14:45
# other = tuple with the different length
other = (np.nan, 1, 3, 4, np.nan)
with self.assertRaisesRegex(
ValueError, "operands could not be broadcast together with shapes"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message looks weird. Is it matched with pandas'?

Copy link
Copy Markdown
Contributor Author

@itholic itholic Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original error message from pandas looks like :

ValueError: operands could not be broadcast together with shapes (4,) (8,)

@ueshin , maybe we don't include the (4,) (8,) part since it requires to compute length of both objects which can be expensive ??

Comment thread databricks/koalas/series.py Outdated
@codecov-io
Copy link
Copy Markdown

codecov-io commented Feb 18, 2021

Codecov Report

Merging #2054 (0fd3666) into master (87f5b18) will decrease coverage by 1.44%.
The diff coverage is 91.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2054      +/-   ##
==========================================
- Coverage   94.71%   93.26%   -1.45%     
==========================================
  Files          54       54              
  Lines       11503    11735     +232     
==========================================
+ Hits        10895    10945      +50     
- Misses        608      790     +182     
Impacted Files Coverage Δ
databricks/koalas/utils.py 93.66% <75.00%> (-1.71%) ⬇️
databricks/koalas/base.py 97.35% <96.00%> (+0.06%) ⬆️
databricks/koalas/indexes/base.py 97.43% <100.00%> (ø)
databricks/koalas/usage_logging/__init__.py 26.66% <0.00%> (-65.84%) ⬇️
databricks/koalas/usage_logging/usage_logger.py 47.82% <0.00%> (-52.18%) ⬇️
databricks/koalas/__init__.py 80.00% <0.00%> (-12.00%) ⬇️
databricks/conftest.py 91.30% <0.00%> (-8.70%) ⬇️
databricks/koalas/accessors.py 86.43% <0.00%> (-7.04%) ⬇️
databricks/koalas/spark/accessors.py 88.67% <0.00%> (-6.29%) ⬇️
databricks/koalas/typedef/typehints.py 91.06% <0.00%> (-2.75%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87f5b18...0fd3666. Read the comment docs.

return left.isNull() | right.isNull() | comp(left, right)


def check_same_length(left: "IndexOpsMixin", right: Union[list, tuple]):
Copy link
Copy Markdown
Contributor

@xinrong-meng xinrong-meng Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice utility! The function name might be misleading considering its return type. Would it be possible to annotate the return type or rename the function?

Copy link
Copy Markdown
Contributor

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you try to reduce the amount of test codes by using loop or parameterizing if there is no difference except for the operators?

Comment on lines +832 to +833
if LooseVersion(pd.__version__) < LooseVersion("1.2.0"):
right = pd.Index(right, name=pindex_ops.name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with pandas<1.2?
Seems like it's working with pandas >= 1.0 in the test?

Actually it works:

>>> pd.__version__
'1.0.5'
>>> pd.Index([1,2,3]) + [4,5,6]
Int64Index([5, 7, 9], dtype='int64')
>>> [4,5,6] + pd.Index([1,2,3])
Int64Index([5, 7, 9], dtype='int64')

Copy link
Copy Markdown
Contributor Author

@itholic itholic Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh,,, seems like It doesn't work for only rmod in pandas < 1.2.

>>> [4, 5, 6] % pd.Index([1,2,3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Int64Index' object has no attribute 'rmod'

Let me address for only this case.

Thanks!

Comment on lines +838 to +841
raise ValueError(
"operands could not be broadcast together with shapes ({},) ({},)".format(
len_pindex_ops, len_right
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can show the length of left if it's less than the length of right, but if it's greater, the actual length is unknown.

Comment thread databricks/koalas/base.py
def __add__(self, other) -> Union["Series", "Index"]:
if isinstance(other, (list, tuple)):
pindex_ops, other = check_same_length(self, other)
return ks.from_pandas(pindex_ops + other) # type: ignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we avoid using # type: ignore as possible? We can use cast instead.

Comment thread databricks/koalas/base.py
Comment on lines +475 to +476
if isinstance(other, (list, tuple)):
other = ks.Index(other, name=self.name) # type: ignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

Comment thread databricks/koalas/base.py
else:
raise TypeError("date subtraction can only be applied to date series.")
return column_op(Column.__rsub__)(self, other)
return column_op(lambda left, right: right - left)(self, other)
Copy link
Copy Markdown
Contributor

@ueshin ueshin Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

Comment thread databricks/koalas/base.py
Comment on lines +599 to +600
if isinstance(other, (list, tuple)):
other = ks.Index(other, name=self.name) # type: ignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

@xinrong-meng
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants