Pimp quantile regression with strong coverage guarantees
Suppose that we are given a historical dataset containing samples of the form , where and are the -th realizations of (predictor) variable and of (predicted) variable , respectively.
As a running example, let us consider the following dataset:
Our goal #1 is to estimate the trend of variable against while evaluating the uncertainty of our prediction. In other words, point-wise predictions like classic regression do not satisfy us and we aim at to computing intervals where the realizations land with high (say, ) coverage probability. More formally, we want to build a set-valued coverage function such that, for any new pair of realizations , it holds that:
Our goal #2 is that the length of the prediction interval should reflect the local dispersion of given that , for any possible value of . This is especially desirable when the observation noise is heteroscedastic, i.e., its variance depends on the point it is applied to. This happens to be our case, since realizations ‘s are more dispersed as the corresponding ‘s increase.
In this post we show how to achieve goals #1 and #2 via conformal quantile regression (CQR), first proposed by Y. Romano, E. Patterson and E. Candes in [1], which applies conformal prediction (CP) to quantile regression (QR) via a powerful and surprisingly simple-to-implement procedure.
First, let us investigate how QR and “vanilla” CP alone would address our problem, and how each fails to jointly achieve our goals.
1. Via quantile regression (QR)
As explained in a previous post [link], the -th quantile regressor estimates the conditional -th quantile of the predicted variable given the predictor variable .
Then, a natural attempt to build a prediction interval with coverage, is the following.
- Train a low-quantile regressor on all historical samples
- Train a high-quantile regressor on all historical samples
- Approximate the coverage function as
If we apply this procedure to our initial dataset with (hence, the coverage probability is ) using linear QR with polynomial features of order 2, we obtain the following result.
Pro’s: Goal #2 is achieved: provides an interval whose length depends on value , which is desirable in the presence of heteroscedastic noise as in our case.
Con’s: Goal #1 is not not achieved. In fact, it holds only asymptotically, as the number of training samples grows to infinity, under some technical assumptions [1].
2. Via vanilla conformal prediction (CP)
We discussed about CP in a previous post [link].
The “vanilla” CP procedure applied to our problem goes as follows:
- Randomly split the historical data points into a training set and a calibration set
- Train a least-square regressor on the training set
- Evaluate the performance of on the calibration set by computing the scores:
- Compute the empirical -th quantile of the scores:
- Output the coverage function:
If applied to our initial dataset with a linear least-square regressor using polynomial features of order 2, vanilla CP produces the following result:
Pro’s: Goal #1 is achieved: in fact, under some technical assumptions (see Theorem 2, of our previous post [link]), it holds that, for any new pair of realizations ,
Con’s: Goal #2 is not achieved: the length of the prediction interval is constant across all values of , which is somehow disappointing especially in the case of heteroscedastic noise, as in our case.
3. Conformalized quantile regression (CQR)
The question arises naturally: Can we take the best of the worlds, QR and CP, and jointly achieve goals #1 and #2?
The answer is yes, via Conformalized Quantile Regression (CQR) [1]. CQR is a direct application of the split CP paradigm to QR, that tweaks the empirical coverage interval defined in (2) to ensure that its coverage probability is (approximately) in the finite samples regime.
CQR Procedure.
- Randomly split the historical data points into a training set and a calibration set
- Train a lower quantile regressor on the training samples
- Train an upper quantile regressor on the training samples
- Computing on the calibration points the scores:
- Compute the empirical -th quantile of the scores:
- Output the coverage interval:
To gather more intuitions, it is instructive to consider the following 3 cases:
- Case a): the interval built via QR regression only is already conformal. In this case, a portion of the calibration samples land within (hence, their score is negative) while the remaining fall outside (hence, their score is positive). Thus, the empirical -th quantile of the scores will be zero and .
- Case b): under-covers the calibration samples, i.e., a portion of calibration samples smaller than fall inside . Then, a portion of scores smaller than is negative, hence is positive, hence the resulting interval is larger than the non-conformal .
- Case c) over-covers the calibration samples, i.e., a portion of calibration samples higher than fall inside . Then, following a similar reasoning, the non-conformal is shrunk to finally obtain .
If applied to our initial dataset, CQR provides the following result.
Note that, in this case, the QR interval under-covers calibration samples. In fact, the calibration score distribution is as follows:
Goal #2 is achieved by CQR, since CQR merely displaces the levels set by the QR regression by a constant offset. We conclude by proving formally that the CQR procedure achieves goal #1 as well, on the coverage probability.
Theorem 1 Let be a new sample with score . Suppose that the scores are exchangeable random variables. Then, the prediction interval satisfies:
where the second inequality holds if the scores are almost surely distinct.
Proof: The event holds whenever
By rearranging the terms of (11), we obtain the equivalent expression:
It stems that . The thesis follows from Lemma 1 and Theorem 2 of our previous post [link].
3.1. Discussion: Marginal vs Conditional coverage
It is important to understand that the coverage probability in (1), that is required to be approximately , is marginalized over the training and calibration samples and, most importantly, over all possible new pairs . Intuitively, this means that if draw a large number of different calibration samples and of corresponding new points , then for a portion of them will lie within the respective coverage interval:
On the other hand, the coverage probability does not hold marginally, i.e., in point-wise fashion:
In other words, if we fix and we draw a large number of realizations of , the portion of them falling inside will be generally different from .
This is illustrated below, where we compare the output of CQR with the actual (point-wise, i.e., conditional) confidence interval of the distribution that the observations were drawn from.
References
[1] Romano, Y., Patterson, E., Candes, E. (2019). Conformalized quantile regression. Advances in neural information processing systems, 32.
[2] Manokhin, V. Awesome conformal prediction Github repo
Leave a comment