Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Sep 20, 2025

NOTE: This PR is currently being split into multiple smaller ones. DO NOT MERGE.

This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into arrow/python/pyarrow-stubs/, restructure them somewhat and add more annotations. We remove docstrings from annotations and provide a script to include docstrings into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for all project files. We introduce a CI check to make sure all mypy, pyright and ty annotation checks pass (see python/pyproject.toml for any exceptions).

PR introduces:

  1. adds pyarrow-stubs into arrow/python/pyarrow-stubs/
  2. fixes pyarrow-stubs to pass ty, mypy and pyright check
  3. adds ty, mypy and pyright check to CI (crudely)
  4. adds a tool (update_stub_docstrings.py) to insert annotation docstrings into stubfiles at wheel-build-time

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Sep 20, 2025
@rok rok changed the title [Python] Add type annotations to PyArrow GH-32609: [Python] Add type annotations to PyArrow Sep 20, 2025
@apache apache deleted a comment from github-actions bot Sep 20, 2025
@apache apache deleted a comment from github-actions bot Sep 20, 2025
@rok rok requested review from pitrou and raulcd September 22, 2025 10:30
@rok rok force-pushed the pyarrow-stubs-2 branch 5 times, most recently from b564265 to 127e741 Compare September 22, 2025 23:56
Copy link

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rok, I come bearing unsolicited suggestions 😉

A lot of this is from 2 recent PRs that have had me battling the current stubs more

def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...


def scalar(value: bool | float | str) -> Expression: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on

@staticmethod
def _scalar(value):
cdef:
Scalar scalar
if isinstance(value, Scalar):
scalar = value
else:
scalar = lib.scalar(value)
return Expression.wrap(CMakeScalarExpression(scalar.unwrap()))

The Expression version (pc.scalar) should accept the same types as pa.scalar right?

Ran into it the other day here where I needed to add a cast

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what are you suggesting. Do you mean:

diff --git i/python/pyarrow-stubs/compute.pyi w/python/pyarrow-stubs/compute.pyi
index df660e0c0c..f005c5f552 100644
--- i/python/pyarrow-stubs/compute.pyi
+++ w/python/pyarrow-stubs/compute.pyi
@@ -84,7 +84,7 @@ _R = TypeVar("_R")
 def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...


-def scalar(value: bool | float | str) -> Expression: ...
+def scalar(value: Any) -> Expression: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I guess Any is what you have there so that could work.

But I think it would be more helpful to use something like this to start:
https://github.com/rok/arrow/blob/6a310149ed305d7e2606066f5d0915e9c23310f4/python/pyarrow-stubs/_stubs_typing.pyi#L50

PyScalar: TypeAlias = (bool | int | float | Decimal | str | bytes |
                       dt.date | dt.datetime | dt.time | dt.timedelta)

Then the snippet from (#47609 (comment)) seems to imply pa.Scalar is valid as well.
So maybe this would document it more clearly?

def scalar(value: PyScalar | lib.Scalar[Any] | None) -> Expression: ...

def name(self) -> str: ...
@property
def num_kernels(self) -> int: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#45919 (reply in thread)

I wonder if the overloads can be generated instead of written out and maintained manually.

Took me a while to discover this without it being in the stubs 😅

Suggested change
@property
def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]:

I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to be a bit fancier, maybe add some Generics into the mix?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok

look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

That would probably be more useful than the route I was going for here.

In python there's only the repr to work with, but there is quite a lot of information encoded in it

import pyarrow.compute as pc
>>> pc.get_function("array_take").kernels[:10]
[VectorKernel<(primitive, integer) -> computed>,
 VectorKernel<(binary-like, integer) -> computed>,
 VectorKernel<(large-binary-like, integer) -> computed>,
 VectorKernel<(fixed-size-binary-like, integer) -> computed>,
 VectorKernel<(null, integer) -> computed>,
 VectorKernel<(Type::DICTIONARY, integer) -> computed>,
 VectorKernel<(Type::EXTENSION, integer) -> computed>,
 VectorKernel<(Type::LIST, integer) -> computed>,
 VectorKernel<(Type::LARGE_LIST, integer) -> computed>,
 VectorKernel<(Type::LIST_VIEW, integer) -> computed>]
>>> pc.get_function("min_element_wise").kernels[:10]
[ScalarKernel<varargs[uint8*] -> uint8>,
 ScalarKernel<varargs[uint16*] -> uint16>,
 ScalarKernel<varargs[uint32*] -> uint32>,
 ScalarKernel<varargs[uint64*] -> uint64>,
 ScalarKernel<varargs[int8*] -> int8>,
 ScalarKernel<varargs[int16*] -> int16>,
 ScalarKernel<varargs[int32*] -> int32>,
 ScalarKernel<varargs[int64*] -> int64>,
 ScalarKernel<varargs[float*] -> float>,
 ScalarKernel<varargs[double*] -> double>]
>>> pc.get_function("approximate_median").kernels
[ScalarAggregateKernel<(any) -> double>]

@rok
Copy link
Member Author

rok commented Sep 30, 2025

Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful!

@rok
Copy link
Member Author

rok commented Oct 2, 2025

Just a mental note: @pitrou suggested to look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025
@rok rok force-pushed the pyarrow-stubs-2 branch from 596fd29 to 6a31014 Compare October 6, 2025 17:09
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 6, 2025
@rok rok requested a review from raulcd November 10, 2025 19:34
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 10, 2025
@rok rok force-pushed the pyarrow-stubs-2 branch 11 times, most recently from 04916ca to c9cf7e1 Compare November 13, 2025 12:52
@dangotbanned
Copy link

dangotbanned commented Nov 18, 2025

#47609 (comment)

I'll try to do another pass sometime next week and ask for a review on the discussions thread, but meanwhile feel free to do pass if you can spare the time :D

Sorry for the delay on getting back to you @rok!

I've been putting in a lot of time working with pyarrow in (https://github.com/narwhals-dev/narwhals/pulls?q=sort%3Aupdated-desc+is%3Apr+%22%28expr-ir%29%2F%22+in%3Atitle) and collecting more issues along the way 🙂

Some fairly high-level things that might be worth checking:

I'll try to dive into some more specific cases soon in a review - so this is just homework for you if you wanted it for now 😉

@rok
Copy link
Member Author

rok commented Nov 18, 2025

Some fairly high-level things that might be worth checking:

I went through the list. ~4 were missing.

  • Does every pyarrow.compute function that is annotated with Expression actually support them?

You might be right, this remains my homework to check.

@rok
Copy link
Member Author

rok commented Nov 18, 2025

Removed scatter and inverse_permutation annotations as their options objects are not yet wrapped in Python. Opened #48167 to track progress.

@rok
Copy link
Member Author

rok commented Dec 22, 2025

I've split out the CI part into a separate PR and will proceed to split it down further to enable review.

@rok
Copy link
Member Author

rok commented Dec 22, 2025

I've added another the second PR (#48622) to follow #48618. I'll wait for those to merge before splitting this further.

rebase and some minor work

dan's homework

minor post rebase change

lint

annotation fix

fix some annotations

fix path on macos

add type checking guidelines for developers

package stubs into wheels and test for presence

Add typechecking for Windows

Add typechecking for macos

Moving typechecks under 'Execute Docker Build' step

test for pyarrow

Review feedback

some fixes

remove some newlines

fixes

more fix

fix

cleanup

test

minor fix

fix CI

fix mypy

minor fixes

fix ty checks

WIP

WIP pyright for test_{pandas,scalars,schema,substrait}.py

pyright for test_{sparse_tensor,substrait,tensor,types,udf,without_numpy}.py and util.py

pyright for test_compute.py

pyright for test_dataset.py
pyright test_types.py

pyright work

yet further pyright work

further pyright work

Make pyright stricter

workaround for shadowed types module

bumpy python in pyright

Update python/pyproject.toml

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

misc

Update python/pyarrow-stubs/pyarrow/compute.pyi

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

ty

reduce mypy errors

experiment

fix pyright config

fixing missing-imports

Some changes

Update python/pyarrow-stubs/_compute.pyi

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

Apply suggestions from code review

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

minor

adding some ignores to pass more checks

Add CI check

Add utility for adding docstrings into annotations

Minor changes to pyarrow so some typechecks pass

Add pyarrow-stubs minus their docstings
@rok rok marked this pull request as draft February 9, 2026 19:36
@rok rok force-pushed the pyarrow-stubs-2 branch from 1ce9c09 to 6e8b983 Compare February 9, 2026 19:36
@rok
Copy link
Member Author

rok commented Feb 9, 2026

Rebased this on main to make it easier to split further into smaller PRs.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

❌ GitHub issue #32609 could not be retrieved.

@rok
Copy link
Member Author

rok commented Feb 9, 2026

You can track progress on subissues of #32609 or PRs here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants