Post

Data Analytics Homework 03

Data Analytics Homework 03

This is my personal answer to homework03, Data Analytics.

Question 6

In a multiple regression model $y = X\beta + \epsilon$, it is critical to know if $(X^TX)^{-1}$ exists.The diagonal elements of $(X^TX)^{-1}$ in correlation form, i.e., $X$ is normalized, are often called Variance Inflation Factors (VIFs), and they are important multicollinearity diagnostic.

VIF for the $j^{th}$ regression coefficient is expressed as

\[\text{VIF}_j = \frac {1} {1 - R_j^2}\]

where $R_j^2$ is the coefficient of multiple determination obtained from regressing $x_j$ on the other regressor variables ${ x_1, x_2, …, x_{j-1}, x_{j+1}, …, x_p }$.

Calculate all the VIFs in the autompg dataset and discuss your observation.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
auto_mpg_data_1 = pd.read_csv('./auto-mpg.data.csv', header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name'], sep=r'\s+')

# Substract the x matrix
# In this case, we shall only use cylinders, displacement, horsepower, weight, acceleration, model year, and origin
# There are 7 variables
observation_matrix = auto_mpg_data_1[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin']]

# Transfer dtypes into float64
# Firstly, transfer horsepower into int64
# See special value in horsepower
observation_matrix['horsepower'] = pd.to_numeric(observation_matrix['horsepower'], errors='coerce')
observation_matrix = observation_matrix.astype('float64')

# Fill nan
observation_matrix_filled = observation_matrix.fillna(observation_matrix['horsepower'].mean())
# Test nan
observation_matrix_filled.isnull().values.any()

After process our data, we can conduct the regression.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# use sklearn to conduct linear regression
# data: observation_matrix_filled
from sklearn.linear_model import LinearRegression


def between_regressor_regression(regressor_no):
    # Prepare the regression variables
    y = observation_matrix_filled.iloc[:, regressor_no]  # our chosen regressor (y)
    x = observation_matrix_filled.iloc[:, [i for i in range(0, 7) if i != regressor_no]]
    
    # Special treatment for y
#     y = y[:, np.newaxis]
    
    # Prepare our model
    model = LinearRegression()
    model.fit(x, y)
    R_score = model.score(x, y)
    coef = model.coef_
    
    # print(f'model R_score is: {R_score}, model coef is: {coef}')
    return (R_score, coef)

for item in range(0, 7):
    R_score, coef = between_regressor_regression(item)
    print(f'VIF for regressor {observation_matrix_filled.columns[item]} is {1 / (1 - R_score)}')

Our result is:

1
2
3
4
5
6
7
VIF for regressor cylinders is 10.695947463386112
VIF for regressor displacement is 21.800494214461786
VIF for regressor horsepower is 9.018033271181528
VIF for regressor weight is 10.50604929747329
VIF for regressor acceleration is 2.5062796569955825
VIF for regressor model year is 1.2391699980298057
VIF for regressor origin is 1.7346650225093887

可以发现 cylinders, displacement, horsepower, weight 的 VIF 值都很高。而 acceleration, model year, origin 的 VIF 都比较低。

A variance inflation factor (VIF) is a measure of the amount of multicollinearity in regression analysis.

When $R_i^2$ is equal to 0, and therefore, when VIF or tolerance is equal to 1, the $i$ th independent variable is not correlated to the remaining ones, meaning that multicollinearity does not exist.

也就是说,一个变量被其他变量所解释的程度非常小,意味着它很难与其他变量线性相关。相反的,如果一个变量被其他变量解释的程度大,则会有它与其他变量非常相关,$R^2$ 的值会很大,从而导致 VIF 很大。

In general terms,

  • VIF equal to 1 = variables are not correlated.
  • VIF between 1 and 5 = variables are moderately correlated.
  • VIF greater than 5 = variables are highly correlated.
This post is licensed under CC BY 4.0 by the author.