Introduction to Quantile Regression in R

Fernanda Alves Martins
Javier Martinez Arribas

29 March, 2023

Introduction

When making predictions for an outcome, it can be helpful to determine the level of confidence or a range of values surrounding the expected outcome where the actual value may fall.

For instance, when predicting a stock price, it’s not just the average outcome that matters, but also the best and worst-case scenarios that are essential in minimizing risks, such as avoiding being late or losing money.

Although most Machine Learning techniques don’t offer a straightforward approach to achieving this, in this introduction course, we’ll delve into the use of Quantile Regression to achieve this.

This approach enables us to gain crucial statistical insights into our data, specifically the quantiles.

Introduction

Quantile regression was introduced by Koenker and Bassett (1978) and fits specified percentiles of the response, such as the 90th percentile, and can potentially describe the entire conditional distribution of the response.

Quantile regression does not assume a particular parametric distribution for the response, nor does it assume a constant variance for the response, unlike least squares regression.

The quantile level is the probability (or the proportion of the population) that is associated with a quantile.

By fitting a series of regression models for a grid of values of in the interval (0,1), you can describe the entire conditional distribution of the response.

Benefits of Quantile Regression

  • Handles skewed distributions: Traditional regression methods assume that the data is normally distributed. Quantile regression, on the other hand, can handle skewed distributions and provide more accurate predictions.

  • Robustness to outliers: Quantile regression is also robust to outliers since it minimizes the sum of absolute deviations instead of the sum of squared deviations.

  • Flexibility: Quantile regression allows modeling different quantiles of the response variable, which can be useful for different applications. For example, quantile regression can predict the lowest or highest values of the response variable.

Benefits of Quantile Regression

  • Interpretability: Quantile regression provides estimates of the conditional quantiles of the response variable, which can be interpreted as the effect of each predictor on different parts of the distribution of the response variable.

  • Useful for risk management: In finance and other fields where risk management is critical, quantile regression can be used to model the lower quantiles of the response variable, which can help in estimating the risk of negative events.

Comparison with Linear Regression

Examples: Age vs. Body Mass Index

library(gamlss.data)
library(MASS)
library(tidyverse)
library(quantreg)
library(ggplot2)

data("dbbmi")
ggplot(dbbmi, aes(x=bmi)) +
  geom_histogram(fill="blue", position="dodge")+
  theme(legend.position="top")

Examples: Age vs. Body Mass Index

rq.bmi<- rq(bmi ~ age, tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = dbbmi)
rq.bmi
Call:
rq(formula = bmi ~ age, tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = dbbmi)

Coefficients:
             tau= 0.10  tau= 0.25  tau= 0.50  tau= 0.75  tau= 0.90
(Intercept) 13.6134535 14.5189625 15.6364205 16.8357953 17.7737257
age          0.1875969  0.2047773  0.2315554  0.2735556  0.3305528

Degrees of freedom: 7294 total; 7292 residual

Examples: Age vs. Body Mass Index

summary(rq.bmi25<- rq(bmi ~ age, tau = 0.25, data = dbbmi))

Call: rq(formula = bmi ~ age, tau = 0.25, data = dbbmi)

tau: [1] 0.25

Coefficients:
            Value     Std. Error t value   Pr(>|t|) 
(Intercept)  14.51896   0.04507  322.14690   0.00000
age           0.20478   0.00453   45.21552   0.00000

Examples: Age vs. Body Mass Index

summary(rq.bmi75<- rq(bmi ~ age, tau = 0.75, data = dbbmi))

Call: rq(formula = bmi ~ age, tau = 0.75, data = dbbmi)

tau: [1] 0.75

Coefficients:
            Value     Std. Error t value   Pr(>|t|) 
(Intercept)  16.83580   0.05303  317.48151   0.00000
age           0.27356   0.00641   42.65068   0.00000

Examples: Age vs. Body Mass Index

anova(rq.bmi25, rq.bmi75)
Quantile Regression Analysis of Deviance Table

Model: bmi ~ age
Joint Test of Equality of Slopes: tau in {  0.25 0.75  }

  Df Resid Df F value    Pr(>F)    
1  1    14587  111.37 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Evaluation: Age vs. Body Mass Index

plot(rq.bmi)

Evaluation: Age vs. Body Mass Index

qs <- c(0.025,0.25,0.50,0.75,0.975)

ggplot(dbbmi, aes(age, bmi)) +
  geom_point(size=1, colour="grey70") +
  geom_quantile(quantiles=qs, formula=y ~ poly(x, 3), colour="red") +
  geom_smooth(method='lm', formula=y ~ poly(x,3), colour="blue", 
              se=FALSE, linetype="11") +
  theme_classic()

Example: Birth weight vs. Mother weight

qs <- c(0.025,0.25,0.50,0.75,0.975)

ggplot(birthwt, aes(lwt, bwt, colour=smoke)) +
  geom_point(size=1, colour="grey70") +
  geom_quantile(quantiles=qs, formula=y ~ poly(x, 3), colour="red") +
  geom_smooth(method='lm', formula=y ~ poly(x,3), colour="blue", 
              se=FALSE, linetype="11") +
  theme_classic()

Example: Foodexp. vs. Income

Engel’s Law states that as a household’s (or a nation’s) income rises, the percentage of income spent on food decreases and the percentage spent on other goods and services increases.

Example: Foodexp. vs. Income

data(engel)
qs <- c(0.025,0.25,0.50,0.75,0.975)
ggplot(engel, aes(income, foodexp)) +
  geom_point(size=1, colour="grey70") +
  geom_quantile(quantiles=qs, formula=y ~ poly(x, 3), colour="red") +
  geom_smooth(method='lm', formula=y ~ poly(x,3), colour="blue", 
              se=FALSE, linetype="11")