Sparse matrices — Part 2

We learned about sparse matrices and why do we need them in the first part of the article. In this one, we’re going to cover different packages available for implementing multiple linear regression in Python, their limitations and support for sparse matrices.
Statsmodel OLS package
This package gives us the statistics about our model like the p-values, R-squared values, AIC, BIC, etc. This model is analogous to the lm() function in R which provides a detailed analysis of all the estimates and parameters involved in the regression.

Though we can use this package to predict and calculate other goodness of the fit measures, it’s not ideal to use it, especially for large datasets.
In my research project, I had a dataset of 1.4M rows and around 6600 dummy variables and I was trying to run fixed effects regression (part 3 of this series will cover this). I tried running regression using this package and I kept getting errors of memory limit exceeded. As a solution, my research advisors asked me to convert the input matrix to a sparse dataset as 90% of the data is zero anyway (recall that it’s a matrix of dummy variables), I tried that and now I could process the dataset in seconds rather than waiting for hours.
Sparse matrix and Statsmodel are not friends!
One of the biggest disadvantages of using the Statsmodel package is that it doesn’t take a scipy sparse matrix or Sparse [uint8] format as input. I discovered this while running the regression using the newly converted matrix into sparse as we discussed above.
Here comes the Scikit learn package into the picture.
Scikit learn package
This package was designed to run multiple linear regression using large datasets and make predictions using the out-of-sample approach or cross-validation. It is one of the most widely used packages by programmers. It offers many different pre-build functions to get the value of the intercept, coefficients, etc.

We can see that there are different ways to get different attributes of a linear regression but there is nothing like a summary() we used in the Statsmodel package.
Coming back to the sparse matrix, this package also supports sparse matrix as an input and gives the results in a few seconds. The issue with this package is that it doesn’t give a detailed summary of statistical parameters like the Statsmodel does.
For calculating p-values, we’ll have to write our function using the model run by Scikit learn.
Please refer to the following link to see the code for the same: https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression
This is why a lot of programmers and statisticians believe that R has better support for statistics-related problems than Python. In R, we can just use “lm()” function or “lme4” package to replicate the task we discussed above for the same size of the dataset.
In this article, we talked about two different packages used for performing multiple linear regression, their advantages, disadvantages, and which one to use when dealing with sparse matrices.
In the next article, we’ll deep dive into fixed-effects and random-effects regression and learn how to interpret them.
References: