Quantile regression

An expressive and robust alternative to least square

For regression problems, least square regression (LSR) arguably gets the lion share of data scientists’ attention. The reasons are several: LSR is taught in virtually every introductory statistics course, it is intuitive and is readily available in most of software libraries.

LSR estimates the mean of the predicted variable {Y} as a function of the value of the observed predictor variable {X}. However, when the dispersion of {Y} around its (conditional) mean is considerable, i) LSR generally fails to produce actionable results, since the mean is not informative on the likelihood of extreme values which are the most critical to handle. In this case, a method that can estimate the whole distribution of {Y} conditioned on {X} would be preferable.

Moreover, ii) LSR is not robust to outliers: the presence of a few “corrupted” points can affect largely the quality of the regression.

Quantile regression (QR) is a method that addresses both concerns i)ii) above: 

i) QR it can estimate any quantile of the distribution of the predicted {Y} conditioned on the predictor {X} 

ii) QR is more robust to the presence of outliers than classic LSR.

Next, we first introduce the notion of (conditional) quantile before delving into QR, via an instructive parallel with LSR.

1. Quantile of a random variable

Let consider a random variable {Y}, taking on values in {\mathcal Y\subset \mathbb R}. Informally speaking, the {\tau}-th quantile of {Y} is that value that exceeds {Y} with probability {\tau}. For instance, if a person’s height is the {0.95}-th quantile, it should be then understood that the person is taller than {95\%} of the population.

This concept can be directly formalized if the cumulative density function (CDF) {F_Y(y):=\Pr(Y\le y)} is invertible. This holds if {Y} is a continuous random variable that can take on any value within an interval. In this case, the {\tau}-th quantile {\mathcal Q^{\tau}[Y]} is such that {\Pr(Y\le \mathcal Q^{\tau}[Y]):=F_Y(\mathcal{Q}^{\tau})=\tau}, i.e.,

\displaystyle \mathcal Q^{\tau}[Y] := F_Y^{-1}(q), \quad \mathrm{if } \ F_Y^{-1} \ \mathrm{exists}. \ \ \ \ \ (1)

However, for discrete variables, the CDF {F_Y} is discontinuous and step-wise constant, hence {F_Y} is not invertible and (1) is not applicable. To understand this, imagine that {Y} is a binary variable that takes on values 0 and 1 with probability {0.5} for each. Then,

\displaystyle F_Y(\tau) = \left\{ \begin{array}{ll} 0 \ \ \quad \mathrm{if} \ \tau<0 \\ 0.5 \quad \mathrm{if} \ \tau \in [0,1) \\ 1 \ \ \quad \mathrm{if} \ \tau\ge 1 \end{array} \right. \ \ \ \ \ (2)

Suppose now that we want to compute the {0.95}-th quantile of {Y}. Clearly, {F_Y} cannot be inverted at {\tau=0.95}. It is then clear that a more general definition of quantile is needed.

To cater for non-invertible CDFs, expression (1) is then generalized as the minimum value at which the cumulative density function exceeds {\tau}.

Definition 1 The {\tau}-th quantile {\mathcal{Q}^{\tau}[Y]} of random variable {Y} is defined as:

\displaystyle \mathcal{Q}^{\tau}[Y] := \inf \left\{y: \, F_Y(y)\ge \tau \right\}. \ \ \ \ \ (3)

Thus, in the example above, the {0.95}-th quantile of the binary variable is effectively 1.
Note also that, when {F_Y} is invertible, then definition (3) boils down to (1).

Estimation. In most practical cases, we do not have access to the CDF {F_Y}. Instead, we can rely on a historical dataset containing {n} independent realizations {y_1,\dots,y_n} of the random variable {Y}. In this case, the {\tau}-th quantile {\mathcal Q^{\tau}[Y]} can be estimated as the {\lceil \tau n \rceil}-th worst realization. If we suppose that the realizations are sorted in increasing order ({y_1\le y_2\le \dots \le y_n}), the sample quantile equals:

\displaystyle \widetilde{Q}^{\tau}[Y] = y_{\lceil \tau n \rceil}. \ \ \ \ \ (4)

Conditional quantile. Extending the definition of quantile to conditioned random variables is straightforward. Suppose that {Y} is correlated with another variable {X}, called predictor, that is observed and takes on values within {\mathcal X\subset \mathbb R^d} where {d\ge 1}. We call {F_{Y|X=x}} the CDF of the variable {Y} conditioned on the observed value {x} of {X}. Then, the conditional {\tau}-th quantile {\mathcal Q^{\tau}[Y|X=x]} is defined similarly to (3).

Definition 2 The {\tau}-th conditional quantile {\mathcal{Q}^{\tau}[Y|X]} of random variable {Y} given variable {X} is defined as:

\displaystyle \mathcal{Q}^{\tau}[Y|X=x] := \inf \left\{y: \, F_{Y|X=x}(y)\ge \tau \right\}, \quad \forall x\in \mathcal X. \ \ \ \ \ (5)

2. Regression: Quantile vs. Least square

The goal of regression is to estimate a predefined target statistics {\mathcal S[Y|X]} of the predicted variable {Y} upon observing a realization of the predictor variable {X}.

  • If such statistic is the conditional mean, i.e., {\mathcal S[Y|X]:=\mathbb E[Y|X]}, then we obtain the classic least square regression (LSR);
  • If such statistic is the {\tau}-th conditional quantile, i.e., {\mathcal S[Y|X]:=\mathcal Q^{\tau}[Y|X]}, then we obtain the quantile regression (QR).

Assume now that we have observed {n} independent joint realizations of variables {X} and {Y}, denoted as {\{(x^i,y^i)\}_{i=1,\dots,n}}.

naive approach for quantile regression would prescribe, for each {x\in \mathcal X}, to aggregate observations pairs {(x^i,y^i)} where the predictor variables takes on the same value {x^i=x} and to apply the formula for the sample quantile (4) to the corresponding values {y}‘s. However, this method fails miserably if {X} is a continuous variable, whose realizations {x^i} are almost surely distinct; in this case, one would end up computing sample quantiles over… one sample!

A better (but still unsatisfactory) approach would be to cluster the observations into bins where the values {x^i} are “close”, and compute (4) for each bin. Yet, this method is not scalable as the number of dimensions {d} of variable {X} grows large.

To introduce QR, let us make a step back to analyze general (hence, not necessarily quantile) regression problems.

Let {h:\mathcal X\rightarrow \mathcal Y} be a “guess” for the desired statistic {\mathcal S[Y|X]}. In other words, if it has been observed that {X=x}, then {h(x)} guesses the statistic {\mathcal S[Y|X=x]}. Then, the regression procedure is as follows.

  1. Design a loss function {\ell(y,h(x)):\mathcal Y\times \mathcal Y\rightarrow \mathbb R} such that the guess function {h} that minimizes the expected loss is precisely the target statistic {\mathcal S[Y|X]}:
    \displaystyle \mathcal S[Y|X] = \mathrm{argmin}_{h: \mathcal X\rightarrow \mathcal Y} \, \mathbb E_{(X,Y)\sim p_{X,Y}} \ell(Y,h(X)). \ \ \ \ \ (6)
    Recall that {\mathcal S[Y|X]} is to be meant as a function that, for each possible value {x\in\mathcal X}, produces the statistic {\mathcal S[Y|X=x]} of predicted variable {Y} given that the predictor {X} takes on value {x}.
  2. Define a parametrized class of functions {h_{\theta}:\mathcal X\rightarrow \mathcal Y} where {\theta\in \mathbb R^m} is the vector of parameters.
    • If {h_{\theta}(x):=\sum_{j=1}^d \theta_j x_j}, where {x_j} is the {j}-th component of {x}, then the regression is linear;
    • In case of deep learning, {\theta} is the collection of weights and biases of a neural network.
  3. Minimize the empirical loss:
    \displaystyle \theta^*=\mathrm{argmin}_{\theta} \sum_{i=1}^n \ell\left(y^i,h_{\theta}(x^i)\right) + \mathcal R(\theta) \ \ \ \ \ (7)
    where {\mathcal R} is a potential regularizer, that penalizes extreme values of {\theta} (e.g., {\mathcal{R}(\theta)=\sum_j \theta_j^2}).
    Finally, h_{\theta^*} is our regressor.

Next we discuss how to go about steps 1 and 3 in the case of LSR and QR.

2.1. Loss function design

To design the appropriate loss function for QR, it is instructive to draw a parallel with its more popular LSR counterpart.

Least squares regression (LSR) draws its name precisely from its loss function, being the square of the residual {y - h(x)}:

\displaystyle \ell^{\mathrm{LSR}}(y,h(x)) := (y-h(x))^2, \quad \forall\, x\in \mathcal X,y\in\mathcal Y. \ \ \ \ \ (8)

The loss function {\ell^{\mathrm{LSR}}}, to be minimized, entices the guess {h(x)} to stay as close as possible to the true value {y}. It is then intuitive that, out of all possible functions {h} mapping {\mathcal X} to {\mathcal Y}, the expected LSR loss is minimized by the conditional mean {\mathbb E[Y|X]}, as prescribed by expression (6).

Theorem 3 The conditional mean {\mathbb E[Y|X]}, meant as a function that maps each {x\in \mathcal X} to {\mathbb E[Y|X=x]}, minimizes the expected loss:

\displaystyle \mathbb E[Y|X] = \mathrm{argmin}_{h: \mathcal X\rightarrow \mathcal Y} \, \mathbb E_{(X,Y)\sim p_{X,Y}} \ell^{\mathrm{LSR}}(Y,h(X)) \ \ \ \ \ (9)

where the expectation is with respect to the joint distribution {p_{X,Y}} of {X} and {Y}.

Proof: We first prove that, for any fixed {x\in \mathcal X} and for all h(x)\in \mathbb R,

\displaystyle \mathbb E_{Y\sim p_{Y|X=x}} \ell^{\mathrm{LSR}}\left(Y,\mathbb E[Y|X=x]\right) \le \mathbb E_{Y\sim p_{Y|X=x}} \ell^{\mathrm{LSR}}\left(Y, h(x)\right). \ \ \ \ \ (10)

Hereafter, for simplicity of notation we will omit the dependency on {x}. Then, we obtain:

\displaystyle \ell^{\mathrm{LSR}}\left(Y,h\right)^2=\mathbb E[Y-h] = \mathbb E\left[ Y - \mathbb E[Y] + \mathbb E[Y] - h \right]^2 \ \ \ \ \ (11)

\displaystyle \quad = \mathbb E\left[ Y - \mathbb E[Y] \right]^2 + \mathbb E\left[ \mathbb E[Y] - h \right]^2 + 2 \left(\mathbb E[Y] - h \right) \mathbb E \left[ Y - \mathbb E[Y]\right] \ \ \ \ \ (12)

The second term is nonnegative since {(\mathbb E[Y] - h)^2\ge 0}, while the third term is null since {\mathbb E \left[ Y - \mathbb E[Y]\right]=0}. It stems that (10) holds. The thesis follows from taking the expectation of both terms of (10) with respect to the predictor {X}. \Box

Quantile regression (QR). Similarly to the LSR case, the loss function for the {\tau}-th QR shall be designed as that function whose expectation is minimized by the {\tau}-th quantile {\mathcal Q^{\tau}[Y|X]}. It turns out that, as we will prove next, the desired loss function writes:

\displaystyle \ell^{\tau\mathrm{QR}}(y,h(x)) := \left\{ \begin{array}{ll} (1 - \tau) |y-h(x)|, \quad \mathrm{if} \ y-h(x)\le 0 \\ \tau |y-h(x)|, \quad \mathrm{if} \ y-h(x)>0. \end{array} \right. \ \ \ \ \ (13)

Note that, in contrast to LSR, residuals are weighted differently depending on their sign. Moreover, the loss function increases only linearly with the magnitude of the residuals. Interestingly, when our target statistic is the conditional median (i.e., {\tau=0.5}), then the loss simply equals the absolute residual.

Theorem 4 The conditional quantile {\mathcal Q^{\tau}[Y|X]}, meant as a function that maps each {x\in \mathcal X} to {\mathcal Q^{\tau}[Y|X=x]}, minimizes the expected loss:

\displaystyle \mathcal Q^{\tau}[Y|X] = \mathrm{argmin}_{h: \mathcal X\rightarrow \mathcal Y} \, \mathbb E_{(X,Y)\sim p_{X,Y}} \ell^{\tau \mathrm{QR}}(Y,h(X)). \ \ \ \ \ (14)

Proof: We prove the thesis under the assumption that the CDF {F_Y} is invertible. Similarly to Theorem 3, we first prove that, for any fixed {x\in \mathcal X} and for all h(x)\in \mathbb R,

\displaystyle \mathbb E_{Y\sim p_{Y|X=x}} \ell^{\tau\mathrm{QR}}\left(Y,\mathcal Q^{\tau}[Y|X=x]\right) \le \mathbb E_{Y\sim p_{Y|X=x}} \ell^{\tau \mathrm{QR}}\left(Y, h(x)\right). \ \ \ \ \ (15)

By omitting the dependency on {x} for notation simplicity and after unfolding the expectation, (15) becomes:

\displaystyle \mathcal Q^{\tau}[Y] = \mathrm{argmin}_h \, (\tau-1) \int_{-\infty}^{z} (y - h) p_Y(y) dy + \tau \int_{z}^{\infty} (y-h) p_Y(y) dy \ \ \ \ \ (16)

To prove (16) we will take the derivative of the function to be minimized with respect to {z} and set it to 0. Leibniz integration rule, that we report here below, comes in handy:

\displaystyle \frac{d}{du} \int_{a(u)}^{b(u)} f(u,t) dt\big|_{u=\bar{u}} = - f(u,a(\bar{u})) \frac{d}{du} a(u)\big|_{u=\bar{u}} +

\displaystyle \qquad + f(u,b(\bar{u})) \frac{d}{du} b(u)\big|_{u=\bar{u}} + \int_{a(\bar u)}^{b(\bar u)} \frac{d}{du} f(u,t) \big|_{u=\bar u} dt \ \ \ \ \ (17)

By applying Leibniz rule to the terms in (16) we obtain:

\displaystyle \frac{d}{dh} \left[ (\tau-1) \int_{-\infty}^{h} (y - h) p_Y(y) dy + \tau \int_{h}^{\infty} (y-h) p_Y(y) dy \right]

\displaystyle = (1-\tau) F_Y(h) - \tau (1-F_Y(h))=0 \ \ \ \ \ (18)

\displaystyle \Rightarrow \ F_Y(h)=\tau \ \ \ \ \ (19)

Thus, the optimal {h=F_Y^{-1}(\tau)}, which proves (16) and (15). The thesis follows form taking the expectation of both terms of (15) with respect to {X}. \Box

To gain some visual intuitions of the result of Theorem (4), it is useful to consider the simple case where {X} is not considered and {Y} is a uniform variable between 0 and 1. In this case, equation (14) boils down to:

\displaystyle \min_{h\in[0,1]} (1-\tau) \int_0^h |y-h| dy + \tau \int_h^1 |y-h| dy \ \ \ \ \ (20)

\displaystyle = \min_{h\in[0,1]} \frac{1}{2}h^2 (1-\tau) + \frac{1}{2}(1-h)^2 \tau \ \ \ \ \ (21)

By taking the derivative with respect to {h} and setting it to 0 we obtain, as expected, that the optimal {h} is indeed {\tau}:

\displaystyle h (1-\tau) - \tau (1-h) = 0 \ \Rightarrow \ h=\tau. \ \ \ \ \ (22)

2.2. Empirical loss minimization

For simplicity, we here confine ourselves to linear regression, where the guessing function takes the form:

\displaystyle h_{\theta}(x):= \sum_{j=1}^d \theta_j x_j. \ \ \ \ \ (23)

Also, we discard the regularizer, and we aim at solving the empirical average of losses:

\displaystyle \theta^* = \mathrm{argmin}_\theta \sum_{i=1}^n \ell\left(y^i,h_{\theta}(x^i)\right). \ \ \ \ \ (24)

Finally, our \tau-th quantile regressor will be h_{\theta^*}.

Least square regression (LSR). If the loss {\ell:=\ell^{\mathrm{LSR}}} in (24), then we can rewrite the problem as:

\displaystyle \min_\theta \, \sum_{i=1}^n \Big(y^i - \sum_{j=1}^d \theta_j x_j^i \Big)^2 \ \ \ \ \ (25)

\displaystyle = \min_\theta \, (\mathbf{y} - \mathbf{X}\theta)^T (\mathbf{y} - \mathbf{X}\theta) \ \ \ \ \ (26)

where {\mathbf{y}} is the column vector of observations {[y^i]_i} of the predicted {Y} and {\mathbf{X}} is the matrix of observations of the predictor {X}, whose {j}-th column (multiplying {\theta_j}) contains the {n} observations {[x_i^j]_i} of the {j}-th predictor variable. By computing the derivative of (26) with respect to {\theta} and setting it to 0 we obtain the classic linear LSR formula:

\displaystyle \nabla_{\theta} (\mathbf{y} - \mathbf{X}\theta)^T (\mathbf{y} - \mathbf{X}\theta) = -2 \mathbf{X}^T (\mathbf{y}-\mathbf{X}\theta^*) = 0 \ \ \ \ \ (27)

\displaystyle \Rightarrow \ \theta^* = (\mathbf{X}^T \mathbf{X}) \mathbf{X}^T \mathbf{y}. \ \ \ \ \ (28)

When non-linear regression is used, in general there exists no closed-form formula for {\theta^*}, which has to be computed or approximated via numerical methods.

Quantile regression (QR). Even for linear QR, no closed-form formula for the optimal {\theta} exists. One can rather compute it via linear programming, as we show next.

We first rewrite the residual {y^i-\sum_{j=1}^d \theta_j x_j^i:=v^+_i - v^-_i}, where {v^+_i} is its positive part, i.e., {v^+_i=\max(y^i-\sum_{j=1}^d \theta_j x_j^i,0)}, and {v^-_i} is its negative part, i.e., {v^-_i=\max(\sum_{j=1}^d x_j^i-y^i,0)}. Then, we can re-express the empirical loss minimization problem (24), with {\ell:=\ell^{\tau \mathrm{QR}}}, as the following linear program:

\displaystyle \min_{\theta,v^+,v^-} \, \sum_{i=1}^n (1-\tau)v^-_i + \tau v^+_i \ \ \ \ \ (29)

\displaystyle \mathrm{s.t.} \ y^i-\sum_{j=1}^d \theta_j x_j^i:=v^+_i - v^-_i, \quad \forall\, i=1,\dots,n \ \ \ \ \ (30)

\displaystyle \quad v^+_i,v^-_i\ge 0, \quad \forall \, i=1,\dots,n. \ \ \ \ \ (31)

3. Robustness to outliers

Real data are often messy and may contain extreme “corrupted” values, commonly called outliers. It turns out that both quantile estimation and regression are robust to outliers. We provide some intuitions below.

Estimation. Let us start with the simpler case where the predictor variable {X} is absent, and the {\tau}-the quantile is estimated as the {\lceil \tau n \rceil}-th worst realization of variable {Y}, as in (4). In this case, even if we corrupt the dataset samples with arbitrary large outliers, as long as their number does not exceed {\lceil \tau n \rceil}, then the sample quantile will still lie within the original, uncorrupted set of values. More formally, we say that the breakdown point of {\widetilde{Q}^{\tau}} is {\frac{\lceil \tau n \rceil}{n}}, which tends to {\tau} as {n} tends to infinity.

In contrast, the sample mean {\frac{1}{n} \sum_{i=1}^n y^i} is not robust to outliers. In fact, by corrupting a single data point, one can let the sample mean take on an arbitrary value. Equivalently, we say that the breakdown point of the sample mean is 0.

Importantly, if the distribution of {Y} is symmetrical around its mean, then the {0.5}-th quantile, also called median, coincides with the mean. This suggests that, in this case, the sample median {y_{\lceil n/2 \rceil}} is a more robust estimator for the mean than the sample mean itself.

Regression. Once again, it is instructive to draw a parallel between QR and LSR. In the loss function of LSR (8), residuals are squared: if an outlier has a residual being 4 times larger than a non-outlier, then its associated loss is 16 times bigger. On the other hand, the loss in QR (13) is simply proportional to the residual magnitude, hence the outlier would incur a loss being only 4 times higher than the non-outlier.

As a result, in the quest for minimizing its loss, LSR has the natural tendency to “over-react” to the presence of outliers by bringing {h_{\theta}} closer to outliers than QR. In the figure below we demonstrate this on numerical examples, produced via scikit-learn [4], where we compare LSR to its alter-ego: the median regression, i.e., the {0.5}-th QR.

References

[1] Koenker, R. (2005). Quantile regression (Vol. 38). Cambridge university press.

[2] Hao, L., Naiman, D. Q. (2007). Quantile regression (No. 149). Sage.

[3] Scikit-learn Quantile Regressor examples

[4] Scikit-learn Quantile Regressor function

[5] Stackexchange: Formulating Quantile Regression as a Linear Program


Posted

in

by

Comments

2 responses to “Quantile regression”

  1. Conformalized quantile regression – ML without tears Avatar

    […] explained in a previous post [link], the -th quantile regressor estimates the conditional -th quantile of the predicted variable […]

    Like

  2. Oldies but goldies: MMSE estimate – ML without tears Avatar

    […] already discussed in another post, the MMSE estimate corresponds to the mean of unknown given the observation […]

    Like

Leave a comment