Skip to content

fix: Pandas 3 compatibility - robust dtype checks and test fixes#885

Open
ankitlade12 wants to merge 28 commits intofeature-engine:mnt-pandas3from
ankitlade12:mnt-pandas3
Open

fix: Pandas 3 compatibility - robust dtype checks and test fixes#885
ankitlade12 wants to merge 28 commits intofeature-engine:mnt-pandas3from
ankitlade12:mnt-pandas3

Conversation

@ankitlade12
Copy link
Contributor

  • Fix UnboundLocalError in _variable_type_checks.py by initializing is_cat/is_dt
  • Add robust dtype checking using both is_object_dtype and is_string_dtype
  • Update find_variables.py with same robust logic for consistency
  • Fix warning count assertions in encoder tests (Pandas 3 adds extra deprecation warnings)
  • Fix floating point precision assertion in recursive feature elimination test
  • Apply ruff formatting and fix linting errors
  • All 1900 tests passing

- Fix UnboundLocalError in _variable_type_checks.py by initializing is_cat/is_dt
- Add robust dtype checking using both is_object_dtype and is_string_dtype
- Update find_variables.py with same robust logic for consistency
- Fix warning count assertions in encoder tests (Pandas 3 adds extra deprecation warnings)
- Fix floating point precision assertion in recursive feature elimination test
- Apply ruff formatting and fix linting errors
- All 1900 tests passing
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 98.27586% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (mnt-pandas3@de4d663). Learn more about missing BASE report.

Files with missing lines Patch % Lines
feature_engine/creation/math_features.py 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             mnt-pandas3     #885   +/-   ##
==============================================
  Coverage               ?   98.20%           
==============================================
  Files                  ?      113           
  Lines                  ?     4853           
  Branches               ?      770           
==============================================
  Hits                   ?     4766           
  Misses                 ?       55           
  Partials               ?       32           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

Thank you so much for fixing the pandas compatibility issue. Much appreciated :)

Can it be that you applied flake8 to the entire code base and then committed all modified files? We need to remove all files unrelated to the pandas fix changes.

For some reason flake8 does different things on different computers. I generally suggest applying flake8 to modified files only, to avoid this kind of behavior.

Could you update the PR leaving only the changed files that solve the pandas issue?

Thank you very much!

@ankitlade12
Copy link
Contributor Author

Thank you so much for fixing the pandas compatibility issue. Much appreciated :)

Can it be that you applied flake8 to the entire code base and then committed all modified files? We need to remove all files unrelated to the pandas fix changes.

For some reason flake8 does different things on different computers. I generally suggest applying flake8 to modified files only, to avoid this kind of behavior.

Could you update the PR leaving only the changed files that solve the pandas issue?

Thank you very much!

@ankitlade12
Copy link
Contributor Author

Hi @solegalli,

Thank you for the feedback! You're right—I accidentally applied styling fixes to the entire codebase. I've now cleaned up the PR and reverted all unrelated changes.

The PR now only includes the 18 files essential for the Pandas 3 compatibility mission. These cover:

  • Core logic fixes in dataframe_checks.py, similarity_encoder.py, and variable handling.
  • Silencing a new Pandas4Warning in the forecasting transformers (lag_features.py and window_features.py).
  • Necessary infrastructure updates in tox.ini and .circleci/config.yml to support the new testing environments.
  • Restored compatibility logic in test_mean_encoder.py, test_ordinal_encoder.py, and test_woe_encoder.py to match Pandas 3's warning and dtype behavior.

All relevant tests are passing locally. Please let me know if there's anything else you'd like me to adjust!



def test_string_dtype_with_pd_na():
# Test StringDtype with pd.NA to hit "<NA>" branch in transform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this solving a pandas 3 release fail? or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was added for coverage purposes, not to fix a Pandas 3 failure.

The transform() method has code that explicitly handles "<NA>" strings (which occur when pd.NA is cast to string). The existing tests only covered np.nan, so this test ensures we also exercise the pd.NA path.

If you'd prefer, I can remove it and accept lower coverage on that branch. Let me know!

X = encoder.fit_transform(df)
assert (X.isna().sum() == 0).all(axis=None)
assert "nan" in encoder.encoder_dict_["var_A"]
assert "<NA>" in encoder.encoder_dict_["var_A"]
Copy link
Collaborator

@solegalli solegalli Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to assert that the resulting dataframe is the one we expect.

I am a bit confused, does this resolve a pandas 3 release backward compatibility issue? Or it's enhancing the functionality of stringsimilarity? If the latter, we need to pass all related files to a new PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right to call this out. Let me clarify:

  • Pandas 3 fix: The change from astype(object).fillna().astype(str) to astype(str).mask(isna(), "") is the Pandas 3 compatibility fix (handles Categorical dtype and is faster).

  • Coverage tests: The new tests (test_string_dtype_with_pd_na, test_string_dtype_with_literal_nan_strings) and the "nan"/"<NA>" handling are for robustness/coverage.

I'll add the proper dataframe assertions to verify the expected output and keep everything in the same PR for simplicity.

import numpy as np
import pandas as pd
import pytest
from unittest.mock import patch
Copy link
Collaborator

@solegalli solegalli Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I want unittest to be a dependency. Could we find a different solution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unittest.mock is part of Python's standard library, not an external package, so it doesn't add any dependencies. However, if you prefer using pytest's native approach, I can switch to using a conftest.py autouse fixture with monkeypatch.

The alternative would look like:

@pytest.fixture(autouse=True, scope="session")
def mock_california_housing(monkeypatch):
    monkeypatch.setattr(
        "sklearn.datasets.fetch_california_housing",
        mock_fetch_california_housing
    )

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

This is an incredible amount of work! Thank you so much! I really appreciate it.

I went through all the changes and had a few questions about some of them. A few changes seem to address things beyond the pandas 3 release; I think those would be better handled in a separate PR.

For tests where results differ by pandas version, it would be great to explicitly assert the expected behavior for < 3 versus >= 3, so we can clean that up later.

For now, we should keep compatibility with older versions. After this is merged, I’ll add tests for older pandas versions on CircleCI, and we’ll need those to pass as well.

…on for init params - make tests conditional on pandas version - restore encoder_dict_ assertion
@ankitlade12 ankitlade12 requested a review from solegalli February 4, 2026 16:15
@ankitlade12
Copy link
Contributor Author

Hi @solegalli,

I have updated all the changes as requested across the PRs. Once the pandas-related compatibility issues are fully sorted out, we can proceed to merge the two branches (PR #880 and PR #879).

Thanks for the thorough review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants