Skip to content

jyu-theartofml/Kaggle_HousePricing

Repository files navigation

HousePricing dataset - methods comparison

The package Caret is one of the most popular for tackling machine learning problems in R. Using this Kaggle competition data set, several of the methods in Caret are compared in terms of RMSE to see how they perform.

First, the data was cleaned up, categorical variable values were harmoninized between training and test sets. Muliticollinearity evaluation was done to identify co-dependent predictors (ones that are redundant and doesn't provide useful info towards model training).Then data imputation was performed with MissForest package in R, which's a powerful(and time consuming) algorithm that creates a bagged decision tree to impute a missing value based on other predictor variables. In Caret , the built in imputation method uses the K-nearest neighbor algorithm, which is less computationally expensive and more practical for larger dataset. The imputed data (training and test data) was then saved for later usage.

In the second .R file, three regression algorithms are compared. Before training the models, the training and test data were mean-centered and scaled (for numerical columns).The target, Sales Price, was transformed by taking the log to remedy the skewness of the distribution,and the training data was split into 80% for training, and 20% for validation. A 5 fold cv was performed for each method.

**Gradient Boosted Method (GBM) gridsearch**

This method allowed grid search on depth of the decision tree, number of estimators (n.trees) and minimum number of observation in a terminal node (not weight). A validation loss (RMSE) of 0.1419481 was obtained (only 275 samples), LB =0.13510 (only prediction using GBM was submitted to Kaggle).

lr_overfit lr_overfit
Overfit Reduced overfit at smaller shrinkage

Univariate Feature Selection

Predictors that have statistically significant differences between the classes are then used for modeling. validation loss was val_loss=0.1621139, the worst among the three methods.

Regularized Random Forest gridsearch

This algorithm is the most time consuming of the three. The parameters availabe for grid search is number of random predictor variables selected at each split (mtry), and number of random threshold for each predictor variable (numRandomCuts). Validation loss was 0.154461.

Genetic Algorithm in the Caret package was tried, but it kept on freezing R studio using up too memory. Same with the Regularized Random Forest algorithm. Parallelization of the computation should be explored in R to resolve these issues. In addition, algorithms like GBM and Xgboost don't yield the same results when evaluated against the corresponding models in Caret .

Reference
http://topepo.github.io/caret/index.html

About

Comparison of R methods for regression of pricing data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages