← Articles

Can a Foundation Model Replace XGBoost for Credit Risk Modeling?

Are foundation models for tabular data better than gradient boosting for credit risk? The results are clear. Full code and reproducible notebook: Trees vs TabPFN on Colab.

// The AI revolution that skipped tables

We spent years organizing the messy world into clean spreadsheets, just to watch the AI revolution work for the messy world instead.

I build credit risk models. My data lives and dies in a table. And I can tell you firsthand: an enormous share of the work isn’t modeling. It’s turning wildly heterogeneous information into structured features that traditional ML can digest.

So when a foundation model for tables finally showed up, I had to test it where it matters to me: on credit data.

// TabPFN: A Foundation Model for Tabular Data

It was about time. When you train XGBoost (current king of tables), you teach the algorithm to recognize patterns for that specific dataset. The model has never seen another dataset and it starts from zero.

Here comes TabPFN. A Transformer pre-trained on MILLIONS of different synthetic tabular datasets who asks the question: given these training rows AND their labels, what are the best labels for these new rows. Your target variable becomes what LLMs treat as the next token, a simple prediction.

TabPFN removes the need for preprocessing and hyperparameter optimization.
No more preprocessing text, NULLs or hyperparameter optimization.

At inference, TabPFN treats your data as context. The same way an LLM treats the prompt. It processes the whole thing and outputs its predictions. No preprocessing, no feature creation.

// Testing TabPFN on Credit Risk Data

For this experiment, I used Kaggle Credit Risk dataset.

Things I did:

TabPFN just got the data. No help from me:

clf = TabPFNClassifier()
clf.fit(X_train, y_train)

Compare that to the hundreds of lines of preprocessing code for tree-based models, and trees are already the least demanding algorithms when it comes to data preparation.

// Results: TabPFN vs XGBoost vs LightGBM

Each model was trained on the full training set and the holdout was 2,000 samples.

Five-fold cross-validation:

ModelROC AUCStd
TabPFN-2.50.8728± 0.0064
LightGBM0.8466± 0.0083
XGBoost0.8377± 0.0058

Holdout confirmation:

ModelROC AUCPR AUC
TabPFN-2.50.87260.7738
LightGBM0.84950.7398
XGBoost0.84690.7293

TabPFN won every metric. A 2-3 point AUC gap is significant in credit risk, where models are compared at the third decimal.

ROC curves on holdout set showing TabPFN-2.5 (AUC=0.8726) outperforming LightGBM (0.8495) and XGBoost (0.8469)
ROC curves on the holdout set. TabPFN-2.5 consistently dominates across all operating points.

The cost: TabPFN took 5-12 seconds per fold versus 0.07-0.30 seconds for trees. A 40-100x latency penalty.

// Why Calibration Matters More Than AUC in Credit Risk

Calibration lets the models output actual probabilities (if they’re well calibrated). In Credit Risk, you MUST care because if your model tells you a 15% probability of default you actually want to OBSERVE a 15% of default rate.

AUC is a beauty contest, it tells you the model can rank borrowers. But ranking isn’t enough. Pricing, loss provisioning, capital reserves, regulatory stress testing, all of these depend on the predicted probability of default being accurate, not just ordinally correct.

Calibration plot showing TabPFN-2.5 tracking closer to the perfect calibration diagonal than XGBoost and LightGBM
Calibration curves (quantile bins). TabPFN tracks the diagonal of a perfectly calibrated model more closely.

TabPFN delivers calibrated probabilities out of the box. No isotonic regression. No Platt scaling. No post-hoc fix. The probabilities are right because the model is doing Bayesian inference, averaging over possible explanations rather than committing to a single tree structure.

Observed default rate versus predicted probability of default by risk decile, showing TabPFN predictions align more closely with observed rates
Observed default rate vs. predicted PD by risk decile. TabPFN’s predictions align more tightly with actual outcomes across the risk spectrum.

Look at the predictions from the tree-based models. Constantly higher than the observed default rate making the models much more conservative and wildly uncalibrated.

// Where TabPFN Falls Short: Scale, and Latency

Before you replace all of your models, think about this:

And above all, this was ONE dataset. You should test TabPFN on your data, with your features, against your tuned pipeline.

// What TabPFN Means for Credit Risk Modeling

A foundation model just delivered strong credit risk performance with no domain knowledge, no feature engineering, and no hyperparameter tuning.

It’s not production-ready yet. Sample size limits and latency concerns are real.

Domain expertise is staying for now: translating model outputs into business decisions, satisfying model risk requirements, knowing which features leak; that is staying.