DSAIEngineering

[P12] Understanding tabular foundation models: time series forecasting with TabPFN

Mohit Saharan — Tue, 28 Apr 2026 13:20:33 GMT

1. Introduction

This post continues my series on tabular foundation models. So far, I have covered the basic vocabulary of tabular foundation models in P3, the posterior predictive distribution in P4, the architecture in P5, pre-training in P6, the TabPFN repository in P7, the hands-on demo’s classification and regression examples in P8, TabPFN Client in P9, TabPFN embeddings in P10, and TabPFN’s predictive behavior in P11.

For a new reader, the minimum background is this: TabPFN is a pretrained tabular foundation model. Unlike XGBoost or Random Forest, its ordinary .fit() call does not update model weights to learn a new model from scratch for the current dataset. Instead, .fit() prepares the labelled rows as context, and TabPFN uses that context to predict new rows. This is why I have repeatedly described TabPFN as a context-conditioned predictor rather than just another sklearn-like estimator.

Today I cover the time series forecasting section of the official TabPFN hands-on demo notebook. You can find my local version of the notebook here.

In P3, I left time series forecasting as “Later.” This post fills that gap. The interesting question is: if TabPFN is a tabular foundation model, how can it forecast a sequence? At first, this is not obvious because ordinary tabular models usually treat rows as examples in a table, while time series forecasting depends on the order of observations, seasonality, and future horizons. The answer is not that TabPFN suddenly becomes an ARIMA model, a recurrent neural network, or a native time-series transformer. The main idea in TabPFN-TS is to convert forecasting into a tabular regression problem.

This is where TabPFN-TS comes in. The notebook cites the work of Hoo et al., whose current arXiv paper is listed as From Tables to Time: Extending TabPFN-v2 to Time Series Forecasting. That work is the reference behind the TabPFN-TS workflow used in the notebook. The TabPFN-TS repository summarizes the workflow as:

Transform a time series into a table.
Extract temporal features and add them to the table.
Perform regression on the table using TabPFNv2.
Use the regression output as the time series forecast.

The post is organized as follows:

Introduction: why time series forecasting belongs in this TabPFN series.
Conceptual background: the forecasting vocabulary and the tabular-regression formulation.
Hands-on demo: loading the Chronos data, adding features, predicting, and reading the forecast plot.
Summary and conclusion: what this example shows and how to evaluate forecasts in practice.
Thanks for reading DSAIEngineering! Subscribe for free to receive new posts and support my work.

2. Conceptual Background

Before going to the hands-on demo, I want to set up the concepts that make the example meaningful. This section does five things:

It defines the basic vocabulary of time series forecasting.
It explains how a sequence can be represented as a supervised tabular regression problem.
It separates what is standard supervised ML from what is specific to TabPFN-TS.
It explains why temporal features matter.
It connects point forecasts and quantile forecasts back to the predictive-distribution language from earlier posts.

2.1 Working Vocabulary

The key terms for this post are:

Time series: observations indexed by time, such as monthly tourism demand, hourly electricity load, daily sales, or sensor readings.
Forecast horizon: the future window we want to predict. In the notebook, prediction_length = 24, so the model predicts 24 future monthly values.
History/context window: the observed part of the time series that is available before the forecast starts.
Point forecast: a single predicted value for each future timestamp.
Probabilistic forecast: a forecast that describes uncertainty, often through quantiles.
Quantile forecast: a prediction for a chosen quantile level, such as the 0.1 or 0.9 quantile.
Covariates/features: extra columns known at prediction time, such as calendar features, holidays, weather, promotions, or a running time index.
Zero-shot forecasting: applying a pretrained model to a new forecasting problem without training a task-specific forecasting model from scratch.

In ordinary supervised regression, we usually have a table:

Each row \(x_i\) contains features, and \(y_i\) is the target value. A regressor learns or uses a mapping from rows to targets.

Here, \(X\) is the feature matrix, \(y\) is the target vector, \(n\) is the number of rows, and \(x_i^\top\) means that the feature vector for row \(i\) is written as a row vector. The superscript \(\top\) denotes transpose.

In time series forecasting, the data initially looks different. For one item, we observe:

and want to predict:

Here, \(T\) is the last observed time index, and \(H\) is the forecast horizon.

The key move in TabPFN-TS is to make the second problem look like the first problem.

2.2 Forecasting as Tabular Regression

Suppose we have multiple time series indexed by item \(i\). For item \(i\), let \(y_{i,t}\) be the observed value at time \(t\), and let \(T_i\) be the last observed time index available before forecasting starts. The forecasting task is to estimate future values:

In the single-series notation above, the last observed index was \(T\). With multiple series, I write this as \(T_i\) because each item \(i\) may have its own last observed timestamp. Here, \(H\) is the forecast horizon, and \(h\) is the number of steps ahead from the end of the observed history for item \(i\). To use a tabular model, we build a feature vector for each item-time pair:

Here, \(g(\cdot)\) is the feature-construction function. It turns time information and any known covariates into ordinary tabular columns. The training table contains rows where the target is known:

Here, \(\mathcal{I}\) is the set of item IDs included in the forecasting task. The future table contains rows where the target is unknown:

Now the forecasting problem has become a tabular regression problem:

Here, \(f\) is a prediction function, and \(\hat{y}_{i,T_i+h}\) is the predicted value for item \(i\) at forecast step \(h\).

For a classical supervised model, \(f\) would usually be a model fitted specifically to the current training table. For example, a supervised regressor might choose:

Here, \(\mathcal{F}\) is the model class, \(\ell\) is the regression loss, and \(\hat{f}\) is the fitted task-specific model. Random Forest, XGBoost, LightGBM, and CatBoost all differ in how they define and optimize \(\mathcal{F}\), but in this workflow they are still learning a fresh model from the current transformed table.

For TabPFN, the meaning is different. TabPFN is already pretrained, and the current training rows become context. Conceptually, the prediction is closer to:

Here, \(p(\cdot)\) denotes a predictive distribution over the future target value.

This is the same posterior-predictive-distribution viewpoint I discussed in P4 and reused in P11. The difference is that the row \(x_{i,T_i+h}\) now represents a future timestamp, not a generic tabular row.

This framing also explains why the time-series package can support point forecasts and probabilistic forecasts. If TabPFN produces a predictive distribution for the target at a future row, then the output can be summarized as a mean, median, or quantiles. This is a useful conceptual lens, not a guarantee that the output is perfectly calibrated for every dataset.

2.3 Standard Supervised ML vs What Is New Here

The conversion from a time series to a tabular regression problem is not unique to TabPFN. A practitioner could build the same kind of table and fit XGBoost, LightGBM, Random Forest, CatBoost, or a linear model on the generated rows. In that sense, the feature-engineering idea is a standard supervised-ML move.

What is new in the TabPFN-TS workflow is the model used after the transformation. Instead of training and tuning a new forecasting model from scratch, TabPFN-TS uses a pretrained tabular foundation model as the regression engine. The training rows act as context, the future rows act as queries, and the model returns point and quantile predictions through the time-series wrapper.

So the split is:

Standard supervised ML part: turn timestamps into rows, create temporal features, define known-target training rows and unknown-target future rows.
TabPFN-specific part: use a pretrained, context-conditioned tabular model instead of fitting a task-specific model from scratch.
TabPFN-TS convenience: return both point forecasts and quantile forecasts through one forecasting interface.

There is an important caveat. TabPFN-TS relies on temporal featurization; TabPFN is not modeling sequence order natively in the same way as a dedicated sequence model. The sequence structure becomes available to the model through columns such as running index, calendar features, seasonal features, and known covariates.

2.4 Why Temporal Features Matter

If we only create rows without useful time-derived features, a tabular model has no direct way to know that January 1980 and January 1981 are related, or that December and January are adjacent months. This is why temporal feature engineering is central to the TabPFN-TS workflow.

The notebook uses three feature groups:

selected_features = [
    RunningIndexFeature(),
    CalendarFeature(),
    AutoSeasonalFeature(),
]

The running index gives each timestamp an ordered numeric position within each item. If item \(i\) has \(n_i\) observed rows, the running index over the observed history is:

This \(n_i\) counts rows in the observed history; it is separate from \(T_i\), which denotes the last observed time index used in the forecasting equations. The running index helps the model see trend-like behavior. Calendar features encode timestamp information such as year, month, day of week, and similar components. Seasonal features encode repeated patterns. A standard way to encode cyclic seasonality is:

where \(P\) is the period. For monthly data with annual seasonality, \(P=12\). Using both sine and cosine is useful because it represents the cycle on a circle. This avoids treating the end of a period and the beginning of the next period as far apart.

In the notebook output, the transformed table contains columns such as:

target, running_index, year, second_of_minute_sin, second_of_minute_cos, ...,
sin_#0, cos_#0, sin_#1, cos_#1, sin_#2, cos_#2

The target column is known in the training rows and missing in the future rows. The time-derived features are known for both training and future rows. That is exactly what forecasting needs: at prediction time, we do not know the future target, but we do know the future timestamps.

2.5 Point Forecasts, Quantiles, and Coverage

For item \(i\) and forecast step \(h\), the future row is \(x_{i,T_i+h}\), and the random future target is \(Y_{i,T_i+h}\). A point forecast gives one value:

A probabilistic forecast gives more information. The conditional cumulative distribution function is:

Here, \(\mathbb{P}\) denotes probability, and \(F_{i,h}(y)\) is the probability that the future target is less than or equal to the candidate value \(y\), given the future row and the training context.

The \(\alpha\)-quantile is:

Here, \(\alpha\) is a quantile level between 0 and 1, and \(\inf\) means the infimum: the smallest value, or limiting lower bound, where the cumulative probability reaches at least \(\alpha\).

For example, the interval:

is an 80% central prediction interval. In the demo, TabPFN-TS returns the point forecast and quantile columns from 0.1 to 0.9.

As in yesterday’s post, quantile intervals should not be treated as automatic guarantees. They need to be checked on held-out data. Let the held-out future points be indexed by \((i_j,h_j)\) for \(j=1,\ldots,m\), where \(m\) is the number of held-out item-horizon pairs being evaluated. The empirical 80% coverage is:

Here, \(\mathbf{1}\{\cdot\}\) is the indicator function: it equals 1 when the condition is true and 0 otherwise.

For context, when quantile models are trained directly, a common loss is the pinball loss. For quantile level \(\alpha\), true value \(y\), and quantile prediction \(q\), it is:

This loss penalizes under-prediction and over-prediction asymmetrically, which is exactly what is needed for quantile estimation.

The coverage calculation answers a different question from the pinball loss. If the empirical coverage value is close to 0.8, the interval is roughly calibrated on that held-out sample. If it is much lower, the forecast intervals are overconfident. If it is much higher, the intervals may be too wide to be useful.

3. Hands-on Demo

The conceptual background gave us the main objects: a time-indexed sequence, a transformed tabular representation, a future table with unknown targets, and point/quantile forecasts. Now I use the notebook to walk through the time-series example.

The mental model for the demo is:

Training rows: past timestamps with known target values.
Future rows: future timestamps with target = NaN.
Features: running index, calendar features, seasonal features, and any known covariate columns.
Output: point forecast plus quantile columns for each future row.

The full notebook contains the setup code and imports. Below, I show the parts that matter for understanding the workflow.

3.1 Loading the Time Series Data

The demo uses a dataset from the Chronos datasets collection on Hugging Face. To keep the example small, it uses only two time series from monash_tourism_monthly.

dataset_metadata = {
    "monash_tourism_monthly": {"prediction_length": 24},
    "m4_hourly": {"prediction_length": 48},
}

dataset_choice = "monash_tourism_monthly"
num_time_series_subset = 2

The notebook then loads the dataset, converts it into a TimeSeriesDataFrame, keeps only two item IDs, and creates a train/test split. The last 24 months are held out as the future window.

from datasets import load_dataset
from tabpfn_time_series import TimeSeriesDataFrame
from tabpfn_time_series.data_preparation import generate_test_X, to_gluonts_univariate

prediction_length = dataset_metadata[dataset_choice]["prediction_length"]
dataset = load_dataset("autogluon/chronos_datasets", dataset_choice)

tsdf = TimeSeriesDataFrame(to_gluonts_univariate(dataset["train"]))
tsdf = tsdf[
    tsdf.index.get_level_values("item_id").isin(tsdf.item_ids[:num_time_series_subset])
]

train_tsdf, test_tsdf_ground_truth = tsdf.train_test_split(
    prediction_length=prediction_length
)
test_tsdf = generate_test_X(train_tsdf, prediction_length)

The first important object is train_tsdf: the observed history. The second is test_tsdf_ground_truth: the future values that we hide from the model but keep for evaluation. The third is test_tsdf: the future table that contains the timestamps where predictions are needed. The function generate_test_X creates those future timestamp rows for the forecast horizon, with unknown targets.

The following plot shows the two tourism series and the train/test split.

Time series train/test split.

Both series show strong yearly seasonality. The vertical dashed red line marks the point where the training history ends and the held-out future window starts. Since the forecast horizon is 24 months, the model is asked to forecast two full seasonal cycles.

3.2 Adding Time Features

The next step is the most important conceptual step in the demo. The raw time series is transformed into a tabular regression problem by adding features.

from tabpfn_time_series import FeatureTransformer
from tabpfn_time_series.features import (
    AutoSeasonalFeature,
    CalendarFeature,
    RunningIndexFeature,
)

selected_features = [
    RunningIndexFeature(),
    CalendarFeature(),
    AutoSeasonalFeature(),
]

feature_transformer = FeatureTransformer(selected_features)

train_tsdf, test_tsdf = feature_transformer.transform(train_tsdf, test_tsdf)

After this transformation, the training table has a known target column and many feature columns. The future table has the same feature columns, but the target column is missing:

item_id  timestamp    target     running_index    year    ...    sin_#0    cos_#0
0        1979-01-31   1149.8700  0                1979    ...    0.0000    1.0000
0        1979-02-28   1053.8002  1                1979    ...    0.5000    0.8660
...
0        1992-08-31   NaN        163              1992    ...   -0.5000   -0.8660

This is the point where forecasting becomes tabular. The rows with known targets form the context. The rows with unknown targets form the query set.

3.3 Predicting with TabPFN-TS

In my run, I used local mode, which runs TabPFN on my local GPU, instead of client mode, which uses GPUs hosted in Prior Labs’ cloud:

from tabpfn_time_series import TabPFNMode, TabPFNTimeSeriesPredictor

predictor = TabPFNTimeSeriesPredictor(
    tabpfn_mode=TabPFNMode.LOCAL,
)

pred = predictor.predict(train_tsdf, test_tsdf)

The output pred is again indexed by item_id and timestamp. It contains a point forecast in the target column and quantile forecasts in columns such as 0.1, 0.2, ..., 0.9.

                         target          0.1          0.2  ...          0.8          0.9
item_id timestamp
0       1992-08-31  6632.519531  6147.241211  6321.268066  ...  6938.606445  7118.754395
        1992-09-30  4159.460938  3881.989502  3977.088379  ...  4355.097656  4479.076172
        1992-10-31  3012.987549  2780.682861  2859.992432  ...  3172.838623  3264.242920

This output format is useful because it gives both a central forecast and uncertainty bands without first setting up a separate conformal wrapper or separately trained quantile model.

3.4 Visualizing the Forecast

The notebook visualizes the history, the held-out future values, the TabPFN-TS point forecast, and the 0.1 to 0.9 quantile band.

from tabpfn_time_series.plot import plot_pred_and_actual_ts

plot_pred_and_actual_ts(
    train=train_tsdf,
    test=test_tsdf_ground_truth,
    pred=pred,
)

Time series forecast.

The blue curve is the observed history. The purple curve is the held-out future, which is available only because this is a demo. The red curve is the TabPFN-TS forecast. The shaded red region is the 0.1 to 0.9 quantile interval.

The forecast captures the most obvious structure in both series: strong annual seasonality, sharp yearly peaks, and a recurring drop after the peak. This is exactly where the feature transformation matters. The model is not seeing a raw sequence alone; it is seeing a tabular representation that exposes time position and seasonal phase.

The forecast is not perfect. For example, the sharpness and height of some future peaks are difficult to match exactly. That is expected because monthly tourism demand is not deterministic. The useful question is not whether every point lands exactly on the future curve. The useful question is whether the model captures the seasonal structure, gives sensible point forecasts, and expresses uncertainty that is reasonable for the held-out window.

3.5 Evaluating Forecasts in Practice

The demo is useful as a first look, but a real forecasting workflow would need numerical evaluation. At minimum, a practitioner should compute point forecast errors and quantile coverage on the held-out window.

import numpy as np

eval_df = test_tsdf_ground_truth[["target"]].rename(
    columns={"target": "actual"}
).join(
    pred.rename(columns={"target": "forecast"})
)

q10_col = 0.1 if 0.1 in eval_df.columns else "0.1"
q90_col = 0.9 if 0.9 in eval_df.columns else "0.9"

mae = (eval_df["actual"] - eval_df["forecast"]).abs().mean()
rmse = np.sqrt(((eval_df["actual"] - eval_df["forecast"]) ** 2).mean())
coverage_80 = (
    (eval_df["actual"] >= eval_df[q10_col])
    & (eval_df["actual"] <= eval_df[q90_col])
).mean()

print(f"MAE: {mae:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"80% interval coverage: {coverage_80:.3f}")

For a more complete evaluation, it is also useful to break the errors down by item ID and forecast horizon. In forecasting, average error can hide important behavior. A model may be good for short horizons but weak for longer horizons, or good for one item but poor for another.

eval_df = eval_df.reset_index()
eval_df["horizon"] = eval_df.groupby("item_id").cumcount() + 1
eval_df["absolute_error"] = (eval_df["actual"] - eval_df["forecast"]).abs()

display(
    eval_df.groupby("horizon")["absolute_error"]
    .mean()
    .rename("MAE by horizon")
)

4. Summary and Conclusion

In this post, I used the time series forecasting section of the official TabPFN hands-on demo to understand how TabPFN can be applied outside ordinary static tabular prediction.

The conceptual section made the key step explicit: TabPFN-TS frames univariate time series forecasting as tabular regression. This transformation is not unique to TabPFN; supervised ML models such as XGBoost, LightGBM, Random Forest, and CatBoost can also use time-derived tabular features. What changes with TabPFN-TS is the regression engine: a pretrained, context-conditioned tabular foundation model is used instead of fitting a new task-specific model from scratch.

Operationally, the observed history becomes rows with known targets. Future timestamps become rows with missing targets. Running-index, calendar, and seasonal features give the tabular model information about trend, time position, and cyclic structure.

The hands-on demo then showed this idea in code. We loaded two monthly tourism time series from the Chronos datasets collection, held out the last 24 months, added temporal features, predicted with TabPFNTimeSeriesPredictor, and visualized both point forecasts and 0.1 to 0.9 quantile intervals.

The main takeaway is that TabPFN is not being used as a native sequence model here. The bridge is feature engineering. Once time is represented as tabular features, TabPFNv2 can be used as a zero-shot tabular regressor for future timestamps.

This makes the workflow conceptually simple and practically interesting. It also creates clear evaluation questions: how accurate are the point forecasts, how well calibrated are the quantile intervals, and how does the model behave across different horizons, seasonalities, and item IDs?

With this post, I have covered another remaining section of the official TabPFN hands-on demo. In the upcoming posts, I will continue exploring the parts of the TabPFN ecosystem that can translate into useful workflows for real tabular and time-dependent data problems. As I continue this series, I welcome feedback and requests from readers: what did you find most useful in this post, and which aspects of tabular foundation models should I explore next?

[P11] Understanding tabular foundation models: predictive behavior of TabPFN

Mohit Saharan — Mon, 27 Apr 2026 20:14:33 GMT

For a new reader, the minimum background is this: TabPFN is a pretrained tabular foundation model. Unlike XGBoost or Random Forest, its ordinary .fit() call does not update model weights to learn a fresh model from scratch. Instead, .fit() prepares the labelled rows as context for the current task, and TabPFN uses that context to predict new rows. For more theory, P3, P4, P5, and P6 are the better starting points; this post gives only the background needed for today’s topic.

Today I cover the predictive behavior section of the official TabPFN hands-on demo notebook. You can find my version of the notebook here in my GitHub repository.

Model scores such as ROC AUC, RMSE, and \(R^2\) tell us how models perform on average. In this post, I use the TabPFN demo to ask a more diagnostic question: how do TabPFN, Random Forest, and XGBoost behave across the input space? We will inspect probability surfaces, regression curves, and quantile intervals to see which behaviors are standard supervised-ML diagnostics and which reflect TabPFN's pretrained, context-conditioned workflow.

The post is organized as follows:

Conceptual background: the terms and equations needed to understand the examples.
- Working vocabulary.
- Supervised learning vs TabPFN, mathematically.
- Mathematical objects we will inspect.
- Classical diagnostics vs TabPFN’s workflow.
Hands-on demo: three examples from the notebook.
- Classification decision boundaries.
- Regression curve fitting.
- Regression uncertainty with quantiles.
Summary and conclusion: the main takeaways and what comes next.
- What the conceptual background prepared us to inspect.
- What the examples demonstrated.
- What changes when the same diagnostics are applied to TabPFN.

1. Conceptual Background

Before going to the hands-on demo, I want to set up the concepts that make the examples meaningful. This section does three things:

It defines the vocabulary used in the post.
It connects TabPFN’s prediction step to the posterior predictive distribution from P4.
It defines the mathematical objects we will inspect in the demo: classification probability surfaces, regression mean functions, quantiles, and interval coverage.

1.1 Working Vocabulary

The key terms for this post are:

Tabular foundation model: a pretrained model designed to work across many tabular prediction tasks.
In-context learning: the model uses labelled rows as context for the current task instead of updating pretrained weights in the usual task-specific training loop.
Predictive behavior: the shape and reliability of the model’s predictions, not just the final score.
Decision boundary: in classification, the region where the model switches from one class to another.
Quantiles: values below which specified fractions of a predictive distribution fall, useful for uncertainty.

The familiar supervised ML workflow is:

 model = XGBClassifier()
 model.fit(X_train, y_train)
 preds = model.predict(X_test)

Here, .fit() learns task-specific trees from the dataset.

TabPFN uses a similar interface:

 model = TabPFNClassifier()
 model.fit(X_train, y_train)
 preds = model.predict(X_test)

But the meaning is different. TabPFN is already pretrained. In the standard prediction workflow, .fit() validates the data and prepares preprocessing, caching, and task context. Conceptually, the training rows and labels become context for the current task:

The test rows are queries:

For one query row \(x_\text{new}\), TabPFN conceptually predicts the posterior predictive distribution I discussed in P4:

Here, \(p(\cdot)\) means a predictive probability distribution: class probabilities for classification, and a distribution over possible target values for regression.

For this post, the important point is practical. Classical supervised models can also provide more than hard class labels: classifiers such as Random Forest, XGBoost, and CatBoost can return class probabilities, and specialized methods can return uncertainty estimates or quantiles. So the diagnostic workflow itself is not unique to TabPFN. What is different here is that TabPFN produces its predictions by conditioning a pretrained tabular model on the current dataset, and TabPFN regression can expose distributional summaries such as quantiles directly through the prediction API.

1.2 Supervised learning vs TabPFN, mathematically

The code later in the post uses familiar sklearn-style calls such as .fit() and .predict(). To avoid treating TabPFN as just another tree model, this subsection makes the difference explicit. First, I describe the usual supervised-learning abstraction. Then I connect TabPFN back to the posterior predictive distribution from P4.

Let the labelled dataset be:

Here, \(D\) is the current training dataset, \(x_i\) is the feature vector for row \(i\), \(y_i\) is its target value or class label, and \(n\) is the number of labelled rows. I use lowercase \(x\) and \(y\) for concrete feature and target values. Later, uppercase \(X\) and \(Y\) refer to random variables.

As a useful abstraction, ordinary supervised learning usually chooses a function class \(\mathcal{F}\) and fits a task-specific model by minimizing an empirical loss:

Here, \(f\) is a candidate prediction function, \(\ell\) is a loss function, \(\Omega(f)\) is a regularization term, \(\lambda \geq 0\) controls the strength of regularization, and \(\hat{f}\) is the model learned specifically for this dataset. The notation \(\arg\min\) means “choose the function that makes this objective as small as possible.” Tree-based models such as Random Forest, XGBoost, and CatBoost differ in how they define \(\mathcal{F}\), how they optimize the model, and how they regularize it, but the basic idea is still dataset-specific fitting.

TabPFN is conceptually different. Following the terminology I used in P4, we can insert a latent task variable \(\phi\) into the posterior predictive distribution. Here, \(\phi\) represents the underlying supervised machine learning task: the feature-target relationship, the noise pattern, and other task-level assumptions that determine how data is generated.

For a new row \(x_\text{new}\), the posterior predictive distribution can be written as:

The first term does not include \(D\) because, after conditioning on the latent task \(\phi\), the task itself is assumed to contain the information needed to describe how \(x_\text{new}\) maps to \(y_\text{new}\). The integral means that the prediction averages over possible tasks \(\phi\), weighted by how plausible each task is after seeing \(D\). If the set of possible tasks were discrete, this would look like a weighted sum instead of an integral.

Using Bayes’ rule:

The prior \(p(\phi)\) represents assumptions about what kinds of tabular tasks are likely. The likelihood \(p(D|\phi)\) says how likely the current dataset is under task \(\phi\). The posterior \(p(\phi|D)\) says which tasks remain plausible after seeing the current dataset. The denominator \(p(D)\) is a normalizing constant.

This gives the same intuition as P4: we are averaging predictions across possible latent tasks, weighted by how plausible those tasks are after seeing the context dataset.

TabPFN does not explicitly compute this integral at prediction time. Instead, it is pretrained on many synthetic tasks so that the neural network amortizes this inference. In this context, amortization means that much of the work of learning how to infer tasks has already happened during pretraining; at prediction time, the model uses \(D\) as context and \(x_\text{new}\) as a query to directly output predictions that approximate this posterior predictive behavior.

This is a useful theoretical lens, but it should not be read as a guarantee that TabPFN is perfectly Bayesian for every real dataset. It is an amortized approximation learned from the task distribution used during pretraining. The closer a real task is to the kinds of tasks represented by that prior, the more useful we should expect this behavior to be.

This is the mathematical reason why the same sklearn-looking code can mean different things:

model.fit(X_train, y_train)

For XGBoost, this learns task-specific trees. For TabPFN, this prepares preprocessing, cache state, and task context for a pretrained model.

1.3 Mathematical objects we will inspect

The introduction named three diagnostic views: probability surfaces, regression curves, and quantile intervals. This subsection defines the mathematical objects behind those views so that the hands-on demo is easier to interpret.

I will use three handles to make predictive behavior visible:

A classification probability surface.
A regression mean function.
Regression quantiles and interval coverage.

The next few equations define these objects before we use them in the notebook examples.

For binary classification, let \(X\) be the random feature vector and \(Y\) be the random class label. The notation \(\mathbb{P}\) means probability. For a specific input value \(x\), define:

Here, \(\eta(x)\) is the model’s estimated probability of the positive class given the training dataset \(D\). The decision boundary at threshold \(\tau\) is:

For the usual threshold \(\tau=0.5\), the boundary is where a 0.5-threshold decision rule switches between classes. Looking at \(\eta(x)\), not only the predicted class, tells us how the model’s positive-class probability changes across the feature space.

For regression, we shift from class probabilities to a distribution over possible numeric target values. Its conditional cumulative distribution function is:

Here, \(F_x(y)\) is the probability that the target \(Y\) is less than or equal to the candidate value \(y\), given input \(x\) and context \(D\). From this distribution, we can define the predictive mean, where \(\mathbb{E}\) means expectation:

and the \(\alpha\)-quantile, where \(\alpha\) is a probability level between 0 and 1:

The notation \(\inf\) means the infimum: the leftmost value, or limiting lower bound, where the cumulative probability reaches at least \(\alpha\).

This is useful because the notebook asks TabPFN for quantiles. Once we have quantiles, we can form prediction intervals. For example, an 80% central prediction interval can be written as:

This interval is useful only if it is calibrated. For a well-calibrated central 80% interval, on a held-out dataset \(\{(x_j, y_j)\}_{j=1}^m\), where \(m\) is the number of held-out rows, we would expect:

This equation is the mathematical version of the interval coverage check used later in the post.

The indicator \(\mathbf{1}\{\cdot\}\) equals 1 when the condition inside the braces is true and 0 otherwise. So the average counts the fraction of held-out targets that fall inside the predicted interval.

1.4 Classical diagnostics vs TabPFN’s workflow

The diagnostics in this post are familiar supervised ML tools. What changes is not the diagnostic itself, but the source of the predictions being diagnosed: TabPFN is pretrained and conditions on the current dataset as context.

Probability surfaces and decision boundaries: standard supervised ML can visualize any probabilistic classifier with predict_proba in 2D. TabPFN adds a probability surface produced by conditioning a pretrained model on the current context dataset.
Smooth regression curves: splines, Gaussian processes, neural networks, and tuned boosting workflows can produce smooth predictions. TabPFN may produce a smooth-looking mean function from a small context dataset without building a custom smooth model.
Quantile predictions and intervals: quantile regression, conformal prediction, Bayesian models, and ensembles can provide intervals. TabPFN regression can expose distributional summaries such as quantiles directly through the prediction API.
Meaning of .fit(): classical models usually fit task-specific parameters from the current dataset. Ordinary TabPFN .fit() prepares preprocessing, cache state, and context for a pretrained model whose weights are not updated.

So the point is not that TabPFN invented these diagnostics. The point is to apply familiar diagnostics to TabPFN and ask whether its pretrained, context-conditioned workflow gives useful behavior with less task-specific tuning.

2. Hands-on Demo: Inspecting Predictive Behavior

The conceptual background gave us the objects to inspect: a probability surface, a mean function, and quantile intervals. Now I use the notebook to inspect those objects directly for TabPFN, Random Forest, and XGBoost.

The notebook section has three examples:

Classification decision boundaries.
Regression curve fitting.
Regression uncertainty with quantiles.

The full notebook contains the helper functions and plotting code. Below, I show the parts that matter for understanding the workflow.

2.1 Classification decision boundaries

The first example creates a binary classification dataset made of concentric circles. This is useful for studying predictive behavior because the correct class transition is nonlinear and easy to inspect visually.

The important plotting choice is response_method=”predict_proba”. I do not only want to know which class the model predicts; I want to see the probability surface.

Mathematically, the plot visualizes an estimate of:

over a grid of \(x\)-values. The color transition region corresponds to the decision boundary \(\mathcal{B}_{0.5}\).

X_train, y_train = generate_circle_data(
    num_points_per_circle=[50, 100, 200],
    radii=[1, 2, 4],
    noise_factor=0.1,
)

rf = RandomForestClassifier().fit(X_train[:, :2], y_train)
xgb = XGBClassifier().fit(X_train[:, :2], y_train)
tabpfn = TabPFNClassifier().fit(X_train[:, :2], y_train)

DecisionBoundaryDisplay.from_estimator(
    tabpfn,
    X_train[:, :2],
    response_method=”predict_proba”,
    grid_resolution=50,
)

The notebook repeats this plotting workflow for Random Forest, XGBoost, and TabPFN. I show the TabPFN call here because the important detail is the use of predict_proba to visualize the probability surface.

Decision boundary comparison.

Random Forest and XGBoost learn the circular pattern, but their probability surfaces contain more block-like regions. This is consistent with how tree models partition the feature space. A tree often produces a piecewise-constant function of the form:

Here, \(R_m\) is one region of the input space, such as a leaf region in a tree, and \(c_m\) is the prediction assigned to that region. Averaging many trees can smooth this behavior, but the partitioning can still be visible in simple 2D plots.

TabPFN’s smoother radial probability surface reflects a different inductive bias: the boundary comes from a pretrained model conditioning on the current rows as context rather than from task-specific tree partitions learned only from this dataset.

This 2D decision-boundary plot is useful because the toy dataset has two features. In real high-dimensional tabular projects, the same diagnostic idea usually shows up through calibration curves, residual plots by feature bins, partial dependence or ICE plots, SHAP dependence plots, and segment-level error analysis.

The lesson is not “TabPFN is always better.” The lesson is that the same probability-surface diagnostic can reveal different inductive biases across models.

2.2 Regression curve fitting

The classification example inspected the probability surface \(\eta(x)\). The second example shifts to regression and inspects the learned mean function \(\mu(x)\). It uses a simple one-dimensional regression problem. The noiseless data-generating function is:

The notebook samples 40 training points, fits Random Forest, XGBoost, and TabPFN, and predicts on a dense grid.

X_train, y_train = generate_sinx_plus_x(N=40)
X_test = np.linspace(0, 20, 200).reshape(-1, 1)

rf = RandomForestRegressor(random_state=42).fit(X_train, y_train)
xgb = XGBRegressor(random_state=42).fit(X_train, y_train)

tabpfn = TabPFNRegressor()
tabpfn.fit(X_train, y_train)

y_pred_tabpfn = tabpfn.predict(X_test)

Sin curve fitting comparison

The tree-based models follow the data, but their predictions are more step-like. This is expected because trees partition the feature space into regions. TabPFN produces a smoother curve that follows the sinusoidal trend.

For real regression projects, this kind of inspection maps directly to predicted-vs-actual plots, residual plots, and residuals by feature bins.

Smooth regression is not unique to TabPFN. A spline model, Gaussian process, neural network, or tuned boosting workflow may also produce smooth predictions. The TabPFN-specific point is that the estimated predictive mean \(\mu(x)\) is shaped by the pretrained model’s learned prior and the small context dataset, without building a custom smooth model for this toy function.

2.3 Regression uncertainty with quantiles

The regression curve example focused on the mean function \(\mu(x)\). The third example moves from mean prediction to uncertainty. The notebook creates a line with heteroscedastic noise, meaning the noise grows with \(x\). In mathematical form, the toy data is approximately:

Here, \(x\) is the input, \(0.8x\) is the noiseless line, \(\sigma(x)\) is the input-dependent noise scale, and \(\epsilon\) is standard normal random noise.

The notebook also leaves a gap in the training data. This lets us inspect whether TabPFN expresses higher uncertainty where the data is noisier or sparse.

The key call is:

reg = TabPFNRegressor()
reg.fit(x, y_noisy)
preds = reg.predict(x_test, output_type=”full”)

With output_type=”full”, TabPFN returns several summaries of the predictive distribution, including mean, median, mode, and quantiles. If quantiles are not specified, the default quantiles are:

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9].

Regression uncertainty with quantiles

The left panel shows the generated data. The right panel shows TabPFN’s predictive quantile bands. The bands are narrow where the data is dense and low-noise. They become wider in noisier regions and around the gap where the model has less direct context.

This behavior is consistent with useful distributional predictions, but the intervals still need to be validated with held-out coverage checks.

The TabPFN-specific point here is convenience and integration: TabPFN regression can expose distributional summaries directly through the model output interface. Traditional supervised ML can also provide uncertainty, but usually through a separate method such as quantile regression, conformal prediction, Bayesian modeling, or ensembling.

Quantiles are model outputs, not guarantees. For a held-out dataset, I would request the two interval endpoints explicitly and compute empirical coverage:

q10_pred, q90_pred = reg.predict(
    X_holdout,
    output_type=”quantiles”,
    quantiles=[0.1, 0.9],
)

coverage_80 = np.mean((y_holdout >= q10_pred) & (y_holdout <= q90_pred))
print(f”80% interval coverage: {coverage_80:.3f}”)

Here, X_holdout and y_holdout are rows and targets that were not used in reg.fit(...). If the value is close to 0.8, allowing for sampling noise, the interval is roughly calibrated on that held-out sample. If it is much lower, the model is overconfident. If it is much higher, the intervals may be too wide to be operationally useful. This is a diagnostic, not a proof of calibration for every future segment.

3. Summary and Conclusion

In this post, I used predictive behavior as a diagnostic lens for comparing TabPFN with familiar supervised ML baselines.

The conceptual section connected the notebook examples to three mathematical objects: the classification probability surface \(\eta(x)\), the regression mean function \(\mu(x)\), and the regression predictive distribution with quantiles \(Q_\alpha(x)\). This made the later plots easier to interpret because each visual had a mathematical object behind it.

The hands-on examples then showed those objects in code. The classification example compared probability surfaces, the regression curve-fitting example compared learned mean functions, and the uncertainty example used TabPFN regression quantiles to inspect how predictive intervals behave in noisy or sparse regions.

The main takeaway is that the diagnostics themselves are not new to tabular foundation models. What is different in TabPFN is the workflow: a pretrained tabular model conditions on the current dataset as context, and in regression it can expose distributional summaries such as quantiles directly through its prediction API. That makes it worth asking, example by example, whether TabPFN gives useful behavior with less task-specific tuning.

With this post, I have moved one step further through the official TabPFN hands-on demo. In the upcoming posts, I will continue exploring the remaining parts of the TabPFN ecosystem and focus on ideas that can be applied in real tabular data workflows. Stay tuned.

Search & Ranking Systems: A Practical Guide for Data Scientists

Mohit Saharan — Sat, 06 Dec 2025 16:28:01 GMT

Search and ranking power discovery in almost every digital product we use: ecommerce, content, jobs, maps, support, and more.

This post is a technical, hands‑on introduction to the main building blocks of search and ranking systems. We’ll look at:

Query understanding
Semantic search
Spell correction
Ranking / Learning-to-Rank
Personalization
Transformer architectures
Two-tower architectures
LLM-based ontologies
Evaluating search results at scale

The goal is that by the end, you’ll know what each concept means and how it fits into a production stack.

1. Big Picture: How Modern Search Systems Work

Let’s start with the mental model.

When a user types a query (“sushi”) into a search box, a typical production system looks like this:

Each block maps to the buzzwords:

Query understanding: clean and interpret the text (“susshi” → “sushi”, detect intent, extract entities).
Semantic search & Two‑Tower architectures: retrieve candidates by meaning, not just exact keywords.
Learning‑to‑Rank: order candidates using ML based on relevance signals.
Personalization: adjust ranking per user.
Transformers & NLP: power semantic understanding, embeddings, and generative components.
LLM‑based ontologies: structure the catalog & concepts to make search smarter.
Evaluation at scale: measure all of this with offline metrics and online A/B tests.

We’ll go through these components with code and simple architecture diagrams.

2. Learning‑to‑Rank (LTR): The Core of Ranking

2.1 Problem setup

In many supervised ML setups you predict one label per row.

In Learning‑to‑Rank, you care about ordering documents for a query:

Input: query q and documents d₁, …, dₙ
Features: x_i = f(q, d_i, user, context)
Output: scores s_i that induce a ranking

Data is grouped by query:

 Query q1:  d1, d2, d3  → label: [2, 0, 1] (relevance levels)
 Query q2:  d4, d5      → label: [0, 1]
 ...

2.2 Three main LTR paradigms

Pointwise
Treat each (q, d) as an independent sample.
Example: regression on a relevance score (0–3) or classification (clicked / not).
Pairwise
Learn from pairs of documents for the same query.
For each pair (d+, d−): model learns score(d+) > score(d−).
Listwise
Optimize over the whole ranked list (e.g. approximating NDCG).

In practice, gradient-boosted trees (LambdaMART, XGBoost ranker, LightGBM ranker) are very common in industry because they perform well, are relatively fast, and handle heterogeneous features nicely.

2.3 Minimal LTR example with XGBoost

Below is a toy example using XGBRanker to rank results for different queries.

 import numpy as np
 from xgboost import XGBRanker
 
 # Toy data: 3 queries, each with some candidate items
 # Features could be: [BM25_score, semantic_similarity, popularity]
 X = np.array([
     [2.1, 0.4, 10],  # q1-d1
     [1.2, 0.7, 30],  # q1-d2
     [0.5, 0.2,  5],  # q1-d3
     [0.1, 0.9, 50],  # q2-d4
     [0.3, 0.3, 20],  # q2-d5
     [1.0, 0.1,  1],  # q3-d6
     [0.9, 0.6, 15],  # q3-d7
 ])
 
 # Relevance labels (higher = more relevant)
 y = np.array([
     2,  # q1-d1
     3,  # q1-d2
     0,  # q1-d3
     3,  # q2-d4
     1,  # q2-d5
     0,  # q3-d6
     2,  # q3-d7
 ])
 
 # Group size: number of documents per query
 group = [3, 2, 2]  # q1 has 3 docs, q2 has 2, q3 has 2
 
 model = XGBRanker(
     objective=”rank:pairwise”,
     n_estimators=100,
     learning_rate=0.1,
     max_depth=4,
     subsample=0.8,
     colsample_bytree=0.8
 )
 
 model.fit(X, y, group=group)
 
 # Predict and see the ranking for query 1 documents
 scores_q1 = model.predict(X[0:3])
 ranking_q1 = np.argsort(-scores_q1)  # descending
 print(”Scores q1:”, scores_q1)
 print(”Ranking q1 (indices):”, ranking_q1)

In a real system:

X is built from many features: lexical, semantic, user, item, context.
y often comes from logs (clicks, purchases) with some preprocessing.
group is derived from query IDs.

This is the backbone ranking block in the earlier pipeline diagram.

3. Query Understanding

Query understanding is about turning raw user text into something the system can reason about.

Typical sub-tasks:

Normalization
- Lowercasing, Unicode normalization, removing punctuation.
- Handling accents (“papá” → “papa”), transliteration.
Tokenization
- Split into tokens, handle multi‑word entities (“ice cream”).
Spell correction / typo handling
- “suhshi” → “sushi”.
- “iphon 15” → “iphone 15”.
Intent classification
- Is the user searching for a product, a category, a help article?
- Example labels: {”product_search”, “faq_search”, “navigation”, ...}
Entity extraction
- Extract “sushi”, “Amsterdam”, “vegan”, etc.
- Map to catalog entities: category IDs, locations, cuisines…
Query rewriting / expansion
- Add synonyms, canonicalize terms (“veggie” → “vegetarian”).

3.1 Minimal intent classification example

You can treat query intent classification as a standard text classification problem:

 import numpy as np
 from sklearn.pipeline import Pipeline
 from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.linear_model import LogisticRegression
 
 queries = [
     “iphone 15 pro max”,
     “refund policy”,
     “restaurants near me”,
     “track my order”,
     “vegan sushi”,
 ]
 
 labels = [
     “product_search”,
     “faq_search”,
     “local_search”,
     “faq_search”,
     “product_search”,
 ]
 
 pipe = Pipeline([
     (”tfidf”, TfidfVectorizer(ngram_range=(1, 2))),
     (”clf”, LogisticRegression(max_iter=1000)),
 ])
 
 pipe.fit(queries, labels)
 
 print(pipe.predict([”cancellation policy”]))
 print(pipe.predict([”best burgers in berlin”]))

In production, you’d likely use:

Better tokenization (e.g. spaCy, HuggingFace tokenizers)
Possibly a transformer encoder (see next sections)
More sophisticated labels and training data

4. Semantic Search & Two‑Tower Architectures

4.1 Lexical vs semantic search

Traditional search (BM25, TF‑IDF) is lexical:

Relevance is based on overlap of terms between query and document.
“cheap phone” won’t match “affordable smartphone” very well.

Semantic search uses vector representations (embeddings) of text:

Encode queries and documents into vectors in ℝᵈ.
Similar meanings → close in vector space (via cosine / dot product).
Retrieval: find top‑K documents with highest similarity.

4.2 Two‑Tower (Dual-Encoder) architecture

For scalable semantic retrieval, a common pattern is the Two‑Tower architecture:

The query encoder and item encoder share architecture but may (or may not) share weights.
You pre-compute and index item embeddings in an ANN index (FAISS, ScaNN, etc).
At query time, compute q_vec, then retrieve top‑K items by similarity.

4.3 Training objective (contrastive learning)

Given training triples (q, d⁺, d⁻) where d⁺ is relevant and d⁻ is not, you can train with a contrastive loss.

Pseudo‑code:

 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 
 class DualEncoder(nn.Module):
     def __init__(self, text_encoder):
         super().__init__()
         self.query_encoder = text_encoder()
         self.doc_encoder = text_encoder()
 
     def encode_query(self, queries):
         return self.query_encoder(queries)  # [batch, dim]
 
     def encode_doc(self, docs):
         return self.doc_encoder(docs)      # [batch, dim]
 
 def contrastive_loss(q_vecs, d_pos_vecs, d_neg_vecs, temperature=0.05):
     # q_vecs, d_pos_vecs, d_neg_vecs: [batch, dim]
     # Construct logits: [batch, 1 + num_neg]
     pos_scores = (q_vecs * d_pos_vecs).sum(dim=-1, keepdim=True)  # [B,1]
     neg_scores = (q_vecs.unsqueeze(1) * d_neg_vecs).sum(dim=-1)   # [B, num_neg]
 
     logits = torch.cat([pos_scores, neg_scores], dim=1) / temperature
     labels = torch.zeros(q_vecs.size(0), dtype=torch.long)  # index 0 is positive
 
     return F.cross_entropy(logits, labels)

In practice you’d use a real transformer encoder (e.g., BERT-like model), proper batching, and possibly in-batch negatives (each positive is a negative for others).

4.4 Simple semantic search example with pre-trained model

If you don’t want to train from scratch, you can use existing sentence embedding models:

 from sentence_transformers import SentenceTransformer, util
 
 model = SentenceTransformer(”all-MiniLM-L6-v2”)
 
 documents = [
     “Cheap smartphone with good battery life”,
     “Italian restaurant with vegan options”,
     “Used car marketplace”,
 ]
 doc_emb = model.encode(documents, convert_to_tensor=True)
 
 query = “affordable phone with long battery”
 q_emb = model.encode(query, convert_to_tensor=True)
 
 cos_scores = util.cos_sim(q_emb, doc_emb)[0]
 top_k = cos_scores.topk(k=3)
 for score, idx in zip(top_k.values, top_k.indices):
     print(float(score), documents[int(idx)])

This is a semantic retrieval layer that can feed candidates into your LTR model.

5. Transformer Architecture: Why It Shows Up Everywhere

Transformers underpin much of modern NLP, including semantic search, query understanding, and LLMs.

5.1 Core ideas

Input sequence is tokenized into tokens t₁, …, tₙ.
Each token mapped to an embedding, plus positional encodings.
Multiple layers of self‑attention + feed‑forward networks.
Self‑attention lets each token attend to all others in the sequence.

Simplified encoder block:

Encoder‑only models (BERT, RoBERTa):

Great for classification, retrieval, sentence embeddings.
Often used as the backbone in query/item encoders.

Decoder‑only / LLMs (GPT‑like):

Great for generative tasks: query rewriting, summarization, plan generation, ontology induction (later section).

You don’t need to derive the attention equations from scratch to work effectively with them in search; you need to know:

They produce contextual embeddings (token/sequence representations).
You can fine-tune them for:
- Query intent classification
- Semantic retrieval (dual encoder)
- Text-to-structure tasks (ontology building, entity extraction).

6. Spell Correction

Users type fast and on mobile; typos are guaranteed.

6.1 Classic view: edit distance + language model

Step 1: Candidate generation

Generate strings within edit distance ≤ 1 or 2 from the input.
Filter to those seen in your corpus/catalog (e.g., product names).

Step 2: Candidate scoring

Use frequency and language models:
- score(candidate) = P(noisy_query | candidate) * P(candidate)
- Choose the candidate with highest score.

A simple implementation uses Levenshtein distance as a heuristic:

 def levenshtein(a, b):
     dp = [[0] * (len(b) + 1) for _ in range(len(a) + 1)]
     for i in range(len(a) + 1):
         dp[i][0] = i
     for j in range(len(b) + 1):
         dp[0][j] = j
     for i in range(1, len(a) + 1):
         for j in range(1, len(b) + 1):
             cost = 0 if a[i-1] == b[j-1] else 1
             dp[i][j] = min(
                 dp[i-1][j] + 1,      # deletion
                 dp[i][j-1] + 1,      # insertion
                 dp[i-1][j-1] + cost  # substitution
             )
     return dp[-1][-1]
 
 def best_correction(query, vocab):
     best = query
     best_dist = float(”inf”)
     for v in vocab:
         d = levenshtein(query, v)
         if d < best_dist:
             best_dist = d
             best = v
     return best
 
 vocab = [”sushi”, “suspect”, “sandwich”]
 print(best_correction(”suhshi”, vocab))  # → “sushi”

In production you’ll:

Use more efficient algorithms (trigram indexes, BK‑trees).
Include language model / semantic signals (e.g. user typed “suhshi restaurant” → strongly prefer “sushi”).
Often treat it as a ranking problem (again): generate candidates, then rank with ML.

6.2 Neural spell correction

Modern systems often use sequence‑to‑sequence transformers trained on (noisy, clean) pairs:

Input: “suhshi near me”
Output: “sushi near me”

These can be more robust to complex typos, spacing issues, and can correct multi-token sequences.

7. Personalization

Search relevance is not one‑size‑fits‑all:

Some users care about price.
Others care about delivery time, ratings, brand, etc.

7.1 Feature-based personalization

The simplest approach: add user and context features into your LTR model:

User features:
- Long‑term engagement by category
- Average price of past purchases
- “Healthiness”, “vegan”, “premium” preferences
Context features:
- Time of day, day of week
- Device, location
- Session-level features

The LTR model learns how these features interact with item features.

7.2 Two‑Tower for personalized retrieval

You can extend the Two‑Tower idea:

User embedding is learned from user ID, interaction history, etc.
Item embedding from item features.
Train with contrastive loss: user should be close to interacted items, far from non‑interacted items.
Retrieval at serving time: recommend items with closest vectors to user embedding.

This is often paired with:

Retrieval tower (user–item dual encoder) → candidate set.
Ranking model (feature-rich LTR) → final personalized ordering.

8. LLM‑based Ontologies

8.1 What is an ontology?

An ontology is a structured representation of concepts and their relationships:

Entities: categories, items, attributes.
Relationships: is_a, part_of, compatible_with, etc.

Example snippet:

Attributes for “Sushi” might include: cuisine=Japanese, serves_raw_fish=True, typically_contains_rice=True.

Ontologies power:

Query understanding (“sushi” ∈ Japanese food)
Faceted search (filter by cuisine, price range, dietary restrictions)
Recommendation diversity (ensure we cover multiple categories)

8.2 How LLMs help

Traditionally, ontologies were built manually or with rule-based NLP. That doesn’t scale.

LLMs can:

Generate category trees
- Given a list of item names/descriptions, propose a hierarchical category structure.
Map items to categories
- “Assign this product to one of these categories [A, B, C].”
Extract structured attributes
- From textual descriptions, emit JSON with fields like cuisine, price_range, is_vegan_friendly.
Discover synonyms and related concepts
- For semantic expansion: “veggie”, “plant-based”, “vegetarian”.

Pseudo-code for attribute extraction with an LLM (conceptual):

 def extract_attributes(description, llm_client):
     prompt = f”“”
     You are an information extraction system.
     Read the following restaurant description and output JSON with fields:
     cuisine (string), price_level (one of: cheap, medium, expensive),
     vegetarian_friendly (true/false).
 
     Description: {description}
     JSON:
     “”“
     response = llm_client.generate(prompt)
     return json.loads(response)

Once you have an ontology:

Store it in a graph or relational DB.
Use it in:
- Query rewriting (“veggie sushi” → add constraint vegetarian_friendly=True).
- Ranking features (match between query intent and item attributes).
- Diversification (ensure results cover multiple relevant categories).

9. Evaluating Search Results at Scale

You can’t improve what you don’t measure. Search evaluation happens on two axes:

Offline metrics: using logged or labeled data.
Online metrics: A/B tests on real traffic.

9.1 Offline metrics: NDCG, MRR, Recall@K

Given query q, documents d₁..dₙ with relevance labels rel_i and system ranking:

DCG@K (Discounted Cumulative Gain):
IDCG@K = DCG of ideal ranking (sort by true relevance).
NDCG@K:
MRR@K (Mean Reciprocal Rank):
Recall@K: fraction of relevant items retrieved in top‑K.

Simple NDCG@K implementation:

 import numpy as np
 
 def dcg_at_k(rels, k):
     rels = np.asarray(rels)[:k]
     gains = 2 ** rels - 1
     discounts = np.log2(np.arange(2, len(rels) + 2))
     return np.sum(gains / discounts)
 
 def ndcg_at_k(rels, k):
     ideal = sorted(rels, reverse=True)
     idcg = dcg_at_k(ideal, k)
     if idcg == 0:
         return 0.0
     return dcg_at_k(rels, k) / idcg
 
 # Example: model ranking with relevance labels
 rels = [3, 0, 2, 1]  # relevance of items at positions 1..4
 print(ndcg_at_k(rels, k=3))

In a real pipeline:

Log per‑query predictions and labels.
Aggregate NDCG@K, MRR@K, Recall@K across queries.
Compare new models offline before going to A/B testing.

9.2 Online evaluation: A/B tests

Offline metrics have limitations (logging bias, missing labels, etc.). Ultimately, you care about business and user metrics:

Click-through rate (CTR)
Conversion rate (purchases, bookings)
Revenue per session
Time to first relevant result
User satisfaction proxies (bounce rate, long dwell time, etc.)

Standard approach:

Randomly bucket users into control vs treatment.
Control uses baseline search; treatment uses new ranking or retrieval.
Run for enough time / users to get statistical power.
Check:
- Primary success metrics (e.g., +X% CTR).
- Guardrail metrics (latency, errors, crash rate, etc.).
Decide whether to ship, iterate, or roll back.

9.3 Evaluation at scale

At scale, you need:

Aggregation pipelines: compute metrics daily/weekly on billions of log events.
Monitoring dashboards: track relevance and business KPIs over time.
Alerting: detect regressions (e.g., NDCG drop due to bug in feature pipeline).

The modeling work (LTR, semantic search, personalization) is only useful if you can reliably measure and monitor impact.

10. How NLP Ties Everything Together

NLP is the glue between the components:

Text embeddings (transformers) → semantic search & LTR features.
Classification → intent detection, spam detection, query routing.
Sequence labeling → entity extraction (locations, product types, attributes).
Generation → query rewriting, summarizing item descriptions, ontology induction.

You don’t need to reinvent NLP; you can:

Start from pre‑trained models (e.g., BERT variants, sentence transformers).
Fine‑tune for:
- Query intent classification.
- Dual‑encoder retrieval.
- Sequence labeling for entities.
- Text‑to‑JSON extraction for ontologies.

11. Putting It All Together: A Minimal Search & Ranking Stack

Here’s a summarized architecture combining everything:

This is the ecosystem where:

Learning‑to‑Rank is the central model.
Transformers & semantic search power understanding and retrieval.
Two‑Tower architectures make ANN retrieval scalable.
NLP underpins query understanding and text features.
Personalization injects user/context into the ranking.
LLM‑based ontologies structure the catalog and enrich features.
Evaluation at scale ensures you’re improving, not just changing.

12. Where to Go Next

If you want to see all of this wired together in actual code, I have created a small end‑to‑end project you can run locally: github.com/msaharan/DSAIEngineering/search_and_ranking/search_and_ranking_demo (It links to a specific commit so the code matches the version of the post. Later versions of the code will be available on the main branch.)

It’s a self‑contained, CPU‑friendly mini stack that implements most of the ideas from this post:

Flow: normalize/understand query → retrieve lexical + semantic candidates → personalize + featurize → train/eval LTR → apply business rules → display results.
Retrievers: TF‑IDF lexical; optional SentenceTransformer semantic; optional dual‑encoder + ANN stub.
Ranking: XGBRanker when available, else RandomForest; offline metrics (NDCG/MRR).
Personalization: simple cuisine/price affinities and user–item bias.
Rules: vegan boost + cuisine diversity; lightweight ontology‑style enrichment for dietary/category/price hints.
Data: tiny CSVs in data/ so you can inspect and change everything.

12.1 Run the demo

From the repo root:

 cd search_and_ranking/search_and_ranking_demo
 docker build -t search-ranking-demo .
 docker run --rm -it search-ranking-demo        # semantic + LTR pipeline
 # Lexical-only:
 # docker run --rm -it search-ranking-demo python run_demo.py
 # Dual-encoder + ANN:
 # docker run --rm -it search-ranking-demo python run_demo.py --semantic --dual

The script will, end‑to‑end:

Query understanding & intent classification
Train a TF‑IDF + logistic regression intent classifier on data/query_intents.csv, and add simple query normalization + synonym expansion.
Lexical + semantic retrieval
Build a TF‑IDF lexical retriever, optionally add a SentenceTransformer semantic retriever, and (optionally) a tiny dual‑encoder + ANN stub for semantic candidate generation.
Personalization
Construct simple user profiles (cuisine and price affinities + user–item bias) from data/query_doc_labels.csv, then inject those features into ranking.
Learning‑to‑Rank
Train a ranking model (XGBRanker if available, else RandomForest) on grouped query–item relevance labels and report offline metrics (NDCG, MRR) on held‑out queries.
Business rules & ontology‑style enrichment
Apply lightweight rules such as vegan boosting and cuisine diversity, plus simple ontology‑style enrichment (dietary hints, categories, price range) derived from catalog metadata.

12.2 Suggested experiments

Once you have it running, here are concrete exercises that map back to sections of this post:

Lexical vs semantic search (Sections 3–4)
- Run lexical‑only, then --semantic.
- Compare which documents surface for ambiguous or “fuzzy” queries.
- Tweak TF‑IDF and embedding models and see how NDCG/MRR change.
Play with query understanding (Section 3)
- Add new intents and examples to query_intents.csv.
- Extend the synonym/expansion logic for your own domain.
- Observe how better query understanding changes candidate sets and ranking.
Modify personalization (Section 7)
- Change how user cuisine/price affinities are computed.
- Add new user‑level features (e.g., “likes cheap & fast” vs “likes premium & slow”).
- Watch how different user profiles get different rankings for the same query.
Extend the LTR feature set (Section 2)
- Edit the ranking feature construction to add new signals (e.g., distance, freshness, popularity buckets).
- Re‑train the model and inspect which features matter most.
Experiment with ontology‑style features (Section 8)
- Enrich catalog.csv with more structured attributes (dietary tags, categories, price bands).
- Use them in query understanding (e.g., detect “vegan”, “cheap”) and as ranking features.
- If you have access to an LLM, try auto‑generating these attributes for new items.
Change evaluation settings (Section 9)
- Adjust how train/validation splits are done.
- Compute NDCG/MRR at different K values and inspect failure cases manually.

Run the demo end-to-end once and you’ll have a working template you can adapt to your own search and recommendation projects.

KServe Explained: A Practical Guide to Serving ML & GenAI on Kubernetes

Mohit Saharan — Mon, 24 Nov 2025 16:48:07 GMT

Source: https://kserve.github.io

KServe is an open‑source system for serving machine‑learning and generative AI models on Kubernetes.

If you’re a junior–mid‑level data scientist or ML engineer, you’ve probably already hit the “OK, my model works in a notebook… now how do other people use it?” wall. KServe is one of the main answers to that question in Kubernetes‑based environments.

This article walks through what KServe is, how the architecture in the image fits together, and what it looks like to use it in real life.

1. The problem KServe solves

Once a model is trained, teams usually need to:

Turn it into an API for apps, dashboards, or other services
Scale it up and down as traffic changes
Run it on different hardware (CPU, GPU, etc.)
Handle versioning, canary rollouts, and rollbacks
Collect metrics and logs for debugging and monitoring

You can hand‑roll all this: build a Flask/FastAPI service, write Dockerfiles, create Kubernetes Deployments/Services/Ingresses, wire up autoscaling, etc. But it’s repetitive and easy to get wrong, especially when you have many models.

KServe’s goal is to give you a standard, Kubernetes‑native way to deploy and manage models so you focus on model logic, not platform plumbing.

2. Reading the diagram: the stack from bottom to top

The image shows a layered architecture. Let’s walk through it bottom‑up.

2.1 Hardware: GPU / CPU / APU

At the very bottom you see:

NVIDIA – GPU
Intel – CPU
AMD – APU

These are your compute resources. KServe does not replace them; it just helps you use them efficiently:

You can ask for CPUs only for lightweight models.
You can request GPUs when you serve LLMs or heavy deep‑learning models.
Kubernetes schedules pods onto nodes that have the requested resources.

As an ML engineer, this means you express what your model needs; you don’t hard‑code where it runs.

2.2 Kubernetes layer

Next is the Kubernetes block (EKS / AKS / GKE / on‑prem, etc.).

Kubernetes provides the basics:

Containers and pods
Networking between services
Autoscaling at the pod level
ConfigMaps, Secrets, and other configuration

KServe is just another controller running inside your cluster. If you’re comfortable with kubectl and basic Kubernetes concepts (pods, deployments, services), you’re in good shape.

2.3 Knative + Istio layer

Above Kubernetes, the diagram shows Knative + Istio.

KServe relies on these (or similar components) for:

Serverless behavior – scale to zero when idle, scale up when requests arrive
Traffic management – split traffic between two model versions (e.g., 90% old, 10% new)
Ingress and routing – getting HTTP requests into your cluster and to the right model
mTLS and observability (with Istio or another service mesh)

You don’t have to deeply understand Knative or Istio to use KServe day‑to‑day, but it helps to know they are the networking + serverless “engine” underneath.

2.4 KServe: Predictive & Generative Model Inference

This is the big blue layer in the image: “Predictive & Generative Model Inference.”
This is what we usually mean when we say “KServe.”

Here KServe provides:

A standard way to define model endpoints using Kubernetes CRDs
Built‑in support for common ML frameworks, e.g.:
- Hugging Face (transformers and LLMs)
- PyTorch
- TensorFlow
- scikit‑learn
- XGBoost / LightGBM
Features for the full inference lifecycle (as labeled above the layer):
- pre/post process – custom data transformations
- predict – standard inference
- generate – generative tasks (LLMs, image generation, etc.)
- inference graph – chaining multiple models/pipelines
- explain – model explanations
- monitor – metrics and logging hooks

Under the hood, KServe uses different model servers and runtimes such as:

Triton Inference Server
TensorFlow Serving
TorchServe
Various runtimes that align with Open Inference or OpenAI‑style APIs

You configure which runtime to use in your model spec; KServe wires it into the platform.

2.5 Model storage

On the right, the image shows “Model Storage” – a “bucket” icon.

Your models typically live in:

S3, GCS, Azure Blob, or MinIO
A shared filesystem or volume

Your KServe configuration points at a URI like s3://my-bucket/models/resnet/.

KServe then downloads and loads the model into the right inference server. This keeps model artifacts decoupled from compute so you can update or redeploy without baking models into images every time.

3. KServe’s key concepts (what you actually touch)

From a data science / ML engineering perspective, these are the building blocks you’ll work with.

3.1 `InferenceService`

The main concept is the InferenceService, a custom Kubernetes resource that represents one logical model endpoint.

Instead of defining a Deployment + Service + Ingress + Autoscaler, you define one InferenceService YAML, and KServe generates everything else.

A simplified example for a PyTorch model:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sentiment-analyzer
spec:
  predictor:
    pytorch:
      storageUri: s3://ml-models/sentiment-analyzer/
      resources:
        requests:
          cpu: “1”
          memory: “2Gi”
        limits:
          nvidia.com/gpu: 1

What this tells KServe:

Use the PyTorch built‑in runtime
Load the model from the given S3 path
Request 1 CPU, 2Gi memory, and 1 GPU per pod
Create an HTTP/gRPC endpoint for inference

KServe then handles:

Creating a Knative service
Deploying pods
Autoscaling
Exposing a stable endpoint

3.2 Predictors, Transformers, Explainers

Inside an InferenceService you can define three main components:

Predictor – required
- The actual model server (TensorFlow Serving, TorchServe, Triton, or a custom container).
- Handles the core predict or generate logic.
Transformer – optional
- A separate container that runs before and/or after the predictor.
- Use it for things like:
  - Tokenization and embedding lookups
  - Image decoding and normalization
  - Business‑specific response formatting
- It takes in the HTTP request, massages it, calls the predictor, then post‑processes the response.
Explainer – optional
- Tied to explainability frameworks (e.g. Alibi Explain).
- Exposes a separate /explain style endpoint for feature attributions, counterfactuals, etc.

As a DS/ML engineer, this separation lets you keep model code and data‑wrangling logic nicely separated and reusable.

3.3 Inference graphs

Sometimes you need more than “one request in, one model out”:

Route requests to different model variants based on input
Run a pre‑model (e.g., routing or classification) that decides which expert model to call
Chain an embedding model → vector search → ranking model

KServe’s inference graph feature lets you define a small DAG (directed acyclic graph) of steps: each node can be a model, transformer, or external call. The graph itself is declared as config, which keeps your orchestration logic separate from model code.

For junior/mid‑level folks, you can think of it as “Kubeflow Pipelines, but for online inference instead of batch workflows.”

3.4 Monitoring & scaling

KServe surfaces metrics such as:

Request counts
Latencies
Error rates

These hook into Prometheus/Grafana or whatever observability stack you use. On the scaling side, because KServe is built on Knative, it supports:

Scale to zero (no pods when idle)
Autoscaling based on requests per second, concurrency, etc.
Smooth rollouts and rollbacks using revisions and traffic splitting

You can, for example, send 5% of traffic to a new model version and watch metrics before promoting it.

4. A typical workflow with KServe

Here’s what life usually looks like when you bring a model from training to production using KServe.

Train and export the model
- Example: train a binary text classifier in PyTorch and export a model.pt plus a small TorchServe handler.
Store the model artifact
- Upload to s3://ml-bucket/sentiment/v1/.
Prepare an InferenceService spec

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sentiment-v1
spec:
  predictor:
    pytorch:
      storageUri: s3://ml-bucket/sentiment/v1/
      resources:
        requests:
          cpu: “1”
          memory: “2Gi”

Apply it to the cluster

kubectl apply -f sentiment-v1.yaml

Wait until it’s ready

kubectl get inferenceservices

Once the STATUS is Ready, KServe has set up the underlying service.
Call the endpoint
From a client (curl, Python, your app), send HTTP POST requests with the expected JSON shape. KServe routes them through the gateway → Knative → your predictor container.
Update & canary
When you train sentiment-v2, point a new InferenceService (or new revision) at s3://ml-bucket/sentiment/v2/, then gradually shift traffic from v1 to v2.

5. Why KServe is attractive for DS/ML engineers

For junior to mid‑level professionals, KServe hits a nice balance:

Pros

You don’t need to build REST servers for each model; you reuse battle‑tested runtimes.
Deployment is declarative – one YAML per model.
It handles autoscaling, networking, and versioning for you.
Works with a wide range of frameworks and model types (tabular, vision, NLP, LLMs).
Fits cleanly into MLOps workflows with CI/CD, GitOps, etc.

Trade‑offs

You still need a functioning Kubernetes cluster and basic knowledge of it.
Knative/Istio add complexity; debugging networking issues can be non‑trivial.
Serverless features like scale‑to‑zero introduce cold‑start latency for some workloads.

If your company already uses Kubernetes, leaning into KServe typically reduces the amount of custom serving infrastructure you have to maintain.

6. How to start learning KServe

A practical learning path:

Get a small K8s cluster
- Use Kind or Minikube locally, or a small managed cluster.
Install KServe
- Follow the official quickstart for your environment.
Deploy a sample model
- Use built‑in examples (e.g., sklearn iris, or a simple Hugging Face model).
- Call the endpoint from Python and inspect responses.
Add a transformer
- Put simple pre‑processing (e.g., text cleaning) into a transformer container and see how the request flow changes.
Experiment with versions & traffic splitting
- Deploy v1 and v2 of a model and gradually shift traffic.

By the time you’ve done those steps, the architecture in the image will feel much less abstract—you’ll see exactly how your InferenceService spec flows through KServe, Knative, and Kubernetes down to the hardware.

7. Recap

KServe is the model‑inference layer on top of Kubernetes, built with Knative and often Istio.
It standardizes how you deploy, scale, and manage models (both predictive and generative).
You describe your model endpoint with an InferenceService: predictors, transformers, explainers, resources, and storage URI.
Under the hood, KServe wires together model servers, traffic management, autoscaling, and observability.

For data scientists and ML engineers moving from experimentation into production, learning KServe gives you a powerful mental model—and a practical toolkit—for serving real models in real systems.