Skip to content

Balanced feature subsampling#851

Merged
bejaeger merged 30 commits intomainfrom
ben/balanced-feature-subsampling
Apr 14, 2026
Merged

Balanced feature subsampling#851
bejaeger merged 30 commits intomainfrom
ben/balanced-feature-subsampling

Conversation

@bejaeger
Copy link
Copy Markdown
Collaborator

@bejaeger bejaeger commented Apr 1, 2026

  • Refactor preprocessing pipeline creation; everything is handled in TabPFNEnsemblePreprocessor now.
  • Adds new feature subsampling methods.
  • Still defaults to random feature subsampling.

@bejaeger bejaeger requested a review from a team as a code owner April 1, 2026 13:49
@bejaeger bejaeger requested review from oscarkey and removed request for a team April 1, 2026 13:49
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces balanced feature subsampling and refactors the preprocessing pipeline to be aware of feature budgets. Key changes include the addition of a num_added_features method across preprocessing steps to accurately calculate post-transformation feature counts and the migration of pipeline creation to the TabPFNEnsemblePreprocessor initialization. Feedback focuses on performance bottlenecks, specifically the O(N^2) complexity of feature budget calculations, memory inefficiencies caused by data slicing in the main process, and redundant object copies or instantiations that could impact high-dimensional data processing and fine-tuning speed.

Comment thread src/tabpfn/preprocessing/ensemble.py Outdated
Comment thread src/tabpfn/preprocessing/transform.py Outdated
Comment thread src/tabpfn/finetuning/data_util.py
Comment thread src/tabpfn/preprocessing/ensemble.py Outdated
Comment thread src/tabpfn/preprocessing/ensemble.py Outdated
Comment thread src/tabpfn/preprocessing/transform.py Outdated
@bejaeger bejaeger removed the request for review from oscarkey April 1, 2026 17:30
@bejaeger bejaeger requested a review from LeoGrin April 9, 2026 17:02
Copy link
Copy Markdown
Collaborator

@LeoGrin LeoGrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great thanks a lot!

Nit: maybe we could add an end to end test that our estimator works with balanced feature subsampling on, and with features > max_features per estimators?

Comment thread src/tabpfn/preprocessing/ensemble.py Outdated
Comment thread changelog/851.added.md Outdated
Comment thread src/tabpfn/preprocessing/ensemble.py Outdated
Comment thread tests/test_preprocessing/test_ensemble.py
Comment thread src/tabpfn/preprocessing/steps/reshape_feature_distribution_step.py Outdated
Comment thread src/tabpfn/preprocessing/ensemble.py Outdated
Comment thread src/tabpfn/preprocessing/transform.py Outdated
Comment thread src/tabpfn/preprocessing/ensemble.py
@bejaeger bejaeger enabled auto-merge April 14, 2026 07:05
@bejaeger bejaeger added this pull request to the merge queue Apr 14, 2026
Merged via the queue into main with commit 83eefb6 Apr 14, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants