Skip to content

[SPARK-56113][PS] Improve pandas 3 string restoration in pandas-on-Spark#54926

Open
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-56113/string_restoration
Open

[SPARK-56113][PS] Improve pandas 3 string restoration in pandas-on-Spark#54926
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-56113/string_restoration

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Mar 20, 2026

What changes were proposed in this pull request?

This PR updates string restoration in python/pyspark/pandas/data_type_ops/string_ops.py so string columns are restored with the pandas dtype carried in the internal field when converting back to pandas in pandas 3 environments.

This improves pandas 3 compatibility for string round-trips and also fixes downstream cases where restored string-related metadata could differ from pandas behavior.

Why are the changes needed?

pandas 3 is stricter about string dtype restoration and missing-value handling.

In pandas-on-Spark, converting string data back to pandas should preserve the intended pandas dtype instead of falling back to less precise restoration behavior. Without that, pandas 3 comparisons can fail even when the underlying values match.

Does this PR introduce any user-facing change?

Yes, it will behave more like pandas 3.

How was this patch tested?

Added the related tests and the other existing tests should pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@ueshin
Copy link
Member Author

ueshin commented Mar 20, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant