HousePricing dataset - methods comparison

The package Caret is one of the most popular for tackling machine learning problems in R. Using this Kaggle competition data set, several of the methods in Caret are compared in terms of RMSE to see how they perform.

First, the data was cleaned up, categorical variable values were harmoninized between training and test sets. Muliticollinearity evaluation was done to identify co-dependent predictors (ones that are redundant and doesn't provide useful info towards model training).Then data imputation was performed with MissForest package in R, which's a powerful(and time consuming) algorithm that creates a bagged decision tree to impute a missing value based on other predictor variables. In Caret , the built in imputation method uses the K-nearest neighbor algorithm, which is less computationally expensive and more practical for larger dataset. The imputed data (training and test data) was then saved for later usage.

In the second .R file, three regression algorithms are compared. Before training the models, the training and test data were mean-centered and scaled (for numerical columns).The target, Sales Price, was transformed by taking the log to remedy the skewness of the distribution,and the training data was split into 80% for training, and 20% for validation. A 5 fold cv was performed for each method.

**Gradient Boosted Method (GBM) gridsearch**

This method allowed grid search on depth of the decision tree, number of estimators (n.trees) and minimum number of observation in a terminal node (not weight). A validation loss (RMSE) of 0.1419481 was obtained (only 275 samples), LB =0.13510 (only prediction using GBM was submitted to Kaggle).


Overfit	Reduced overfit at smaller shrinkage

Univariate Feature Selection

Predictors that have statistically significant differences between the classes are then used for modeling. validation loss was val_loss=0.1621139, the worst among the three methods.

Regularized Random Forest gridsearch

This algorithm is the most time consuming of the three. The parameters availabe for grid search is number of random predictor variables selected at each split (mtry), and number of random threshold for each predictor variable (numRandomCuts). Validation loss was 0.154461.

Genetic Algorithm in the Caret package was tried, but it kept on freezing R studio using up too memory. Same with the Regularized Random Forest algorithm. Parallelization of the computation should be explored in R to resolve these issues. In addition, algorithms like GBM and Xgboost don't yield the same results when evaluated against the corresponding models in Caret .

Reference
http://topepo.github.io/caret/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
01_house_data_imputation.R		01_house_data_imputation.R
02_house5_methods_comp.R		02_house5_methods_comp.R
Lr001.jpeg		Lr001.jpeg
README.md		README.md
concatenated_total_data_imputed.csv		concatenated_total_data_imputed.csv
lr_overfit.jpeg		lr_overfit.jpeg
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_house_data_imputation.R

01_house_data_imputation.R

02_house5_methods_comp.R

02_house5_methods_comp.R

Lr001.jpeg

Lr001.jpeg

README.md

README.md

concatenated_total_data_imputed.csv

concatenated_total_data_imputed.csv

lr_overfit.jpeg

lr_overfit.jpeg

test.csv

test.csv

train.csv

train.csv

Repository files navigation

HousePricing dataset - methods comparison

About

Releases

Packages

Languages

jyu-theartofml/Kaggle_HousePricing

Folders and files

Latest commit

History

Repository files navigation

HousePricing dataset - methods comparison

About

Topics

Resources

Stars

Watchers

Forks

Languages