diff --git a/capabilities/fine-tuning.mdx b/capabilities/fine-tuning.mdx index 989b89f..0d18081 100644 --- a/capabilities/fine-tuning.mdx +++ b/capabilities/fine-tuning.mdx @@ -1,84 +1,222 @@ --- -title: "Fine Tuning" -description: "Optimize TabPFN models to your own data with fine-tuning." +title: "Fine-Tuning" +description: "Adapt TabPFN's pretrained foundation model to your data with gradient-based fine-tuning." --- -Fine-tuning enables you to **optimize TabPFN’s pretrained foundation models** to your own datasets. It works by **updating the pretrained transformer parameters** by training with a user-provided dataset using gradient descent. This process retains TabPFN’s learned priors while aligning it more closely with the target data distribution. +Fine-tuning updates TabPFN's pretrained transformer parameters using gradient descent on your dataset. This retains TabPFN's learned priors while aligning the model more closely with your target data distribution. You can fine-tune both: -- [`TabPFNClassifier`](https://github.com/PriorLabs/tabpfn/blob/main/src/tabpfn/classifier.py) – for classification tasks -- [`TabPFNRegressor`](https://github.com/PriorLabs/tabpfn/blob/main/src/tabpfn/regressor.py) – for regression tasks +- [`FinetunedTabPFNClassifier`](https://github.com/PriorLabs/tabpfn/blob/main/src/tabpfn/finetuning/finetuned_classifier.py) — for classification tasks +- [`FinetunedTabPFNRegressor`](https://github.com/PriorLabs/tabpfn/blob/main/src/tabpfn/finetuning/finetuned_regressor.py) — for regression tasks + +## When to Fine-Tune + +Fine-tuning is not always necessary. TabPFN's in-context learning already adapts to your data at inference time. Fine-tuning adds value in specific scenarios: + +### Good Candidates for Fine-Tuning + + + + Your data represents a distribution not well-covered by TabPFN's pretraining priors — e.g., molecular properties, specialized sensor data, or domain-specific financial instruments. + + + You have a stable schema that you'll predict on repeatedly. Fine-tuning amortizes the upfront cost across many future predictions. + + + With more data, fine-tuning can learn meaningful adaptations without overfitting. + + + You have a family of related datasets (e.g., multiple experiments, regional variants) and want to fine-tune a single model across them. + + + +### When Fine-Tuning is Less Likely to Help + +- On very small datasets (< 1000 rows), the risk of overfitting outweighs adaptation benefits. Try [feature engineering](/tips-and-tricks#feature-engineering) or [AutoTabPFN ensembles](/extensions/post-hoc-ensembles) instead. +- If baseline TabPFN is already within a few percent of your target metric, the simpler approaches in [Tips & Tricks](/tips-and-tricks) often close the gap with less effort. +- On datasets with gradual temporal distribution shifts and many features, fine-tuning can be less stable. Make sure your train/validation split respects the time ordering. + +### Decision Flowchart + + + + Evaluate default `TabPFNClassifier` or `TabPFNRegressor` on your task. + + + Apply [feature engineering, metric tuning, and preprocessing tuning](/tips-and-tricks) — these are faster to iterate on. + + + If you need more, try [AutoTabPFN ensembles](/extensions/post-hoc-ensembles) or [hyperparameter optimization](/extensions/hpo). + + + If performance has plateau'd and you have sufficient data (1000+ rows), fine-tuning can push past the ceiling by adapting the model's internal representations. + + -Fine-tuning can help especially when: +## Getting Started -- Your data represents an edge case or niche distribution not well-covered by TabPFN’s priors. -- You want to specialize the model for a single domain (e.g., healthcare, finance, IoT sensors) +Fine-tuning shares the same interface as `TabPFNClassifier` and `TabPFNRegressor`. -**Recommended setup** +### 1. Prepare Your Dataset -Fine-tuning requires **GPU acceleration** for efficient training. +Load and split your data into train and test sets. Use a proper validation strategy: for time-dependent data, use temporal splits rather than random splits. -## Getting Started +```python +from sklearn.model_selection import train_test_split + +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42 +) +``` + +### 2. Configure and Train + + + +```python Classifier +from tabpfn.finetuning import FinetunedTabPFNClassifier + +finetuned_clf = FinetunedTabPFNClassifier( + device="cuda", + epochs=30, + learning_rate=1e-5, +) + +finetuned_clf.fit(X_train, y_train) +``` + +```python Regressor +from tabpfn.finetuning import FinetunedTabPFNRegressor + +finetuned_reg = FinetunedTabPFNRegressor( + device="cuda", + epochs=30, + learning_rate=1e-5, +) + +finetuned_reg.fit(X_train, y_train) +``` + + + +By default, fine-tuning splits off 10% of the training data for validation and uses early stopping (patience of 8 epochs). You can also provide your own validation set, which is useful for temporal data or other cases where a random split isn't appropriate: + +```python +finetuned_clf.fit(X_train, y_train, X_val=X_val, y_val=y_val) +``` + +### 3. Predict + + + +```python Classifier +y_pred = finetuned_clf.predict(X_test) +y_pred_proba = finetuned_clf.predict_proba(X_test) +``` + +```python Regressor +y_pred = finetuned_reg.predict(X_test) +``` + + + +## Hyperparameters + +### Core Parameters + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `epochs` | `30` | Number of fine-tuning epochs. More epochs allow deeper adaptation but risk overfitting. | +| `learning_rate` | `1e-5` | Step size for gradient updates. Lower values are safer but slower to converge. | +| `device` | `"cuda"` | GPU is strongly recommended. Fine-tuning on CPU is very slow. | + +### Tuning Guidelines + +**Learning rate:** +- Start with `1e-5` (the default). This is conservative and preserves pretrained knowledge. +- For larger datasets (10k+ rows), you can try `3e-5` to `1e-4` for faster convergence. +- If you see training loss spike or diverge, reduce the learning rate. + +**Epochs:** +- `10–30` epochs is a good starting range for most datasets. +- For high-accuracy tasks where you're fine-tuning carefully, use more epochs (50–100) with a lower learning rate to allow gradual adaptation without destroying pretrained representations. +- Monitor validation loss to detect overfitting — stop if validation performance degrades. + + + Fine-tuning requires GPU acceleration. While it will run on CPU, training times will be impractical for most use cases. + + +## Multi-GPU Fine-Tuning + +Fine-tuning supports multi-GPU training via PyTorch DDP (Distributed Data Parallel). This is auto-detected when launched with `torchrun`: -The fine-tuning process is similar for classifiers and regressors and shares the same interface as the standard `TabPFNClassifier` and `TabPFNRegressor` classes. +```bash +torchrun --nproc-per-node=4 your_finetuning_script.py +``` -1. **Prepare your dataset:** Load and split your data into a train and test set. -2. **Configure your model:** Initialize a [`FinetunedTabPFNClassifier`](https://github.com/PriorLabs/tabpfn/blob/main/src/tabpfn/finetuning/finetuned_classifier.py) or [`FinetunedTabPFNRegressor`](https://github.com/PriorLabs/tabpfn/blob/main/src/tabpfn/finetuning/finetuned_regressor.py) with your desired finetuning hyperparameters. +No code changes are needed. The DDP setup is handled internally based on the `LOCAL_RANK` environment variable that `torchrun` sets. Note that `.fit()` should only be called once per `torchrun` session. - +## How It Works - ```python Classifier - finetuned_clf = FinetunedTabPFNClassifier( - device="cuda", - epochs=30, - learning_rate=1e-5, - ) - ``` +TabPFN performs in-context learning: during inference, it processes both training data and test samples in a single forward pass, using attention to identify relevant patterns. Fine-tuning adapts the transformer's weights so that the attention mechanism more accurately reflects the similarity structure of your specific data. - - ```python Regressor - finetuned_reg = FinetunedTabPFNRegressor( - device="cuda", - epochs=30, - learning_rate=1e-5, - ) - ``` +Concretely, after fine-tuning: +- The query representations of test samples and key representations of training samples produce dot products that better reflect their target similarity. +- This allows the fine-tuned model to more appropriately weight relevant in-context samples when making predictions. - +The fine-tuning process decouples the preprocessing pipeline to generate transformed tensors that mirror the preprocessing configurations used during inference, ensuring the model optimizes on the exact same data variations it encounters when making predictions. +## Best Practices -3. **Run fit on your train set:** This will run the finetuning training loop for the specified number of epochs. + + + Before fine-tuning, establish a baseline with the default `TabPFNClassifier` or `TabPFNRegressor`. Fine-tuning should measurably improve on this baseline — if it doesn't, the simpler model is preferable. + - + + Split a held-out validation set and monitor performance across epochs. For time-series or temporal data, use a temporal split rather than random cross-validation. + - ```python Classifier - finetuned_clf.fit(X_train, y_train) - ``` + + Begin with the defaults (`epochs=30`, `learning_rate=1e-5`). Only increase aggressiveness if you see clear room for improvement without signs of overfitting. + - - ```python Regressor - finetuned_reg.fit(X_train, y_train) - ``` + + Fine-tuning and [feature engineering](/tips-and-tricks#feature-engineering) are complementary. Good features make fine-tuning more effective by giving the model better signal to adapt to. + - + + With fewer than ~1000 rows, fine-tuning can overfit quickly. Use fewer epochs, a lower learning rate, or consider whether [AutoTabPFN ensembles](/extensions/post-hoc-ensembles) might be more appropriate. + + +## Enterprise Fine-Tuning -5. **Make predictions with the finetuned model:** +For organizations with proprietary datasets, Prior Labs offers an enterprise fine-tuning program that includes: - +- Fine-tuning on your organization's data corpus for a customized, high-performance model +- Support for fine-tuning across collections of related datasets +- Optimized training infrastructure - ```python Classifier - y_pred_proba = finetuned_clf.predict_proba(X_test) - ``` + + Learn more about fine-tuning TabPFN for your organization. + - - ```python Regressor - y_pred = finetuned_reg.predict(X_test) - ``` +## Related - + + + Quick wins to try before fine-tuning. + + + Automated ensembling as an alternative to fine-tuning. + + + Automated search over TabPFN's hyperparameter space. + + See more examples and fine-tuning utilities in our TabPFN GitHub repository. - \ No newline at end of file + diff --git a/docs.json b/docs.json index 68cb2aa..b6ea8b3 100644 --- a/docs.json +++ b/docs.json @@ -81,6 +81,7 @@ "quickstart", "models", "best-practices", + "tips-and-tricks", "faq" ] }, diff --git a/tips-and-tricks.mdx b/tips-and-tricks.mdx new file mode 100644 index 0000000..33062eb --- /dev/null +++ b/tips-and-tricks.mdx @@ -0,0 +1,236 @@ +--- +title: "Tips & Tricks" +description: "Practical strategies to improve TabPFN performance beyond the default configuration." +--- + +TabPFN works well out of the box, handles many things natively that traditional ML pipelines require and we recommend to feed in the data as raw as possible as additional processing can hurt performance. +We suggest avoiding additional scaling with `StandardScaler` / `MinMaxScaler`, imputation of missing values, or one-hot encoding of categoricals. + + +Beyond the default settings, there are several strategies you can use to potentially push performance further. This guide covers feature engineering, feature selection, preprocessing configuration, and common pitfalls to avoid. + +## Feature Engineering + +Feature engineering is one of the most impactful ways to improve TabPFN's performance. The goal is to encode domain knowledge that TabPFN cannot learn from raw columns alone. + +### Domain-Specific Features + +Create features that capture known relationships in your data: + +- **Ratios:** `price / area`, `revenue / headcount` +- **Interactions:** `weight / height**2` (BMI), `voltage * current` (power) +- **Group aggregations:** mean, count, or standard deviation of a numeric column grouped by a categorical (e.g., average spend per customer segment) + +### Datetime Features + +TabPFN cannot interpret raw datetime objects. Extract structured features instead: + +```python +df["year"] = df["date"].dt.year +df["month"] = df["date"].dt.month +df["dayofweek"] = df["date"].dt.dayofweek +df["hour"] = df["date"].dt.hour + +# Cyclical encoding for periodic features +import numpy as np +df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12) +df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12) +``` + +For datasets with a time dimension, also consider adding a **running index feature** (sequential 0, 1, 2, ...) to help TabPFN detect trends. + + + The TabPFN API automatically detects and embeds date features. This manual extraction is primarily needed when using the local package. + + +### Text and String Features + +The best approach depends on cardinality and semantic content: + +- **Low cardinality:** Feed directly to TabPFN, which auto-encodes strings as categoricals +- **Medium/High cardinality:** Use `CountVectorizer` or `TfidfVectorizer` with dimensionality reduction (PCA or TruncatedSVD) +- **Semantic content:** Use TabPFN API that automatically handles semantic text encoding. + +## Feature Selection + +When your dataset has many features (especially beyond 500), feature selection can improve both performance and speed. + +### Why It Helps + +TabPFN uses transformer attention over all features. Irrelevant or noisy features dilute the model's attention budget and can reduce predictive power, especially as feature count grows. + +### Approaches + +**Greedy feature selection** - remove features individually and check performance. This works particularly well on smaller data with low computational costs. + +**Mutual information filtering** — rank features by mutual information with the target and keep the top-k: + +```python +from sklearn.feature_selection import mutual_info_classif, SelectKBest + +selector = SelectKBest(mutual_info_classif, k=50) +X_train_selected = selector.fit_transform(X_train, y_train) +X_test_selected = selector.transform(X_test) +``` + +**PCA / TruncatedSVD** — reduce dimensionality while retaining variance: + +```python +from sklearn.decomposition import PCA + +pca = PCA(n_components=50) +X_train_reduced = pca.fit_transform(X_train) +X_test_reduced = pca.transform(X_test) +``` + +## Tuning Preprocessing Transforms + +TabPFN's internal preprocessing pipeline is one of the most powerful tuning levers. Each estimator in the ensemble cycles through a list of preprocessing configurations, creating diversity. + +### PREPROCESS_TRANSFORMS + +Control how features are transformed before being fed to the transformer. + +#### Configuration Options + +| Field | Default | Options | +|-------|---------|---------| +| `name` | (required) | `"quantile_uni"`, `"squashing_scaler_default"`, `"safepower"`, `"quantile_uni_coarse"`, `"kdi"`, `"robust"`, `"none"` | +| `categorical_name` | `"none"` | `"none"`, `"numeric"`, `"onehot"`, `"ordinal"`, `"ordinal_shuffled"`, `"ordinal_very_common_categories_shuffled"` | +| `append_original` | `False` | `True`, `False`, `"auto"` | +| `max_features_per_estimator` | `500` | int — subsamples features if above this limit | +| `global_transformer_name` | `None` | `None`, `"svd"`, `"svd_quarter_components"` | + + + For optimal diversity, use as many different preprocessing transforms as you have estimators (default 8). Each estimator cycles through the list. + + +### Target Transforms (Regression) + +For regression tasks, you can control how the target variable `y` is transformed. This is especially useful for skewed targets: + +```python +from tabpfn import TabPFNRegressor + +model = TabPFNRegressor( + inference_config={ + "REGRESSION_Y_PREPROCESS_TRANSFORMS": ( + "none", + "safepower", + "quantile_norm", + "quantile_uni", + "1_plus_log" + ), + }, +) +``` + +| Transform | When to Use | +|-----------|-------------| +| `"none"` | Symmetric, well-behaved targets | +| `"safepower"` | Skewed targets (handles negatives) | +| `"quantile_norm"` | Heavily skewed or multi-modal targets | +| `"quantile_uni"` | Alternative to `quantile_norm` | +| `"1_plus_log"` | Non-negative targets with large range | + +Adding more transforms to the tuple increases ensemble diversity, which helps when the target distribution is non-trivial. + +### Other Inference Settings + +```python +model = TabPFNClassifier( + inference_config={ + "POLYNOMIAL_FEATURES": "no", # "no", int, or "all" for O(n^2) interactions + "FINGERPRINT_FEATURE": True, # hash-based row identifier + "OUTLIER_REMOVAL_STD": "auto", # "auto" (12.0), None, or float + "SUBSAMPLE_SAMPLES": None, # None, int, float, or list + }, +) +``` + +- **`POLYNOMIAL_FEATURES`**: Generates interaction features. Can help when interactions matter but increases feature count quadratically. +- **`FINGERPRINT_FEATURE`**: Adds a hash-based row identifier. Useful by default; try disabling if you have very few features. +- **`OUTLIER_REMOVAL_STD`**: Removes extreme outliers before fitting. Lower values are more aggressive. +- **`SUBSAMPLE_SAMPLES`**: Subsample training rows for faster iteration during experimentation. + +## Tuning Model Parameters + +### softmax_temperature + +Controls prediction sharpness (classification only): + +- **Lower values** (e.g., `0.7`): sharper, more confident predictions — useful when accuracy is already high +- **Higher values** (e.g., `1.2`): softer, more calibrated predictions — useful when probability calibration matters + +```python +model = TabPFNClassifier(softmax_temperature=0.8) +``` + + + If you use `tuning_config={"calibrate_temperature": True}`, the temperature is tuned automatically and overrides this value. + + +### Metric Tuning + +For metrics that are sensitive to decision thresholds (F1, balanced accuracy, precision, recall), use the built-in [metric tuning](/capabilities/metric-tuning): + +```python +model = TabPFNClassifier( + eval_metric="f1", + tuning_config={ + "calibrate_temperature": True, + "tune_decision_thresholds": True, + }, +) +``` + +### Handling Imbalanced Data + +- Set `balance_probabilities=True` as a quick heuristic for imbalanced datasets +- For more control, use `eval_metric="balanced_accuracy"` with threshold tuning + + + `balance_probabilities` does not always help. In some cases it can balance predictions at the cost of overall predictive power. Test both settings. + + +## Escalation Path + +When the default TabPFN does not meet your needs, try these approaches in roughly this order: + + + + Add domain features, extract datetime components, encode text meaningfully. This is usually the highest-impact change. + + + If you have many features (100+), try filtering to the most informative ones. + + + Use `eval_metric` and `tuning_config` to optimize for your specific evaluation metric. + + + Experiment with different `PREPROCESS_TRANSFORMS` and target transforms. + + + Use the [HPO extension](/extensions/hpo) for automated search over the TabPFN hyperparameter space. + + + Use the [AutoTabPFN extension](/extensions/post-hoc-ensembles) for an automatically tuned ensemble of TabPFN models. Typically gives a few percent boost. + + + [Fine-tune](/capabilities/fine-tuning) the pretrained model on your data when you have a specialized domain or distribution shift. + + + +## Related + + + + Adapt TabPFN's pretrained weights to your domain. + + + Automated ensembling for maximum accuracy. + + + Bayesian optimization over TabPFN's hyperparameter space. + +