Predicting the sale price of houses in the Ames Housing Dataset.



This project was an assignment I completed as part of my General Assembly Data Science intensive. It used a widely available data set used for education, The Ames Housing Dataset. The project had three main components:

• Estimate the sale price of properties based on their "fixed" characteristics, such as neighbourhood, lot size, number of stories, etc.
• Estimate the value of possible changes and renovations to properties from the variation in sale price not explained by the fixed characteristics.
• Determine the features in the housing data that best predict "abnormal" sales (foreclosures, etc.).

To tackle these problems I first rigorously cleaned the data, correcting any rows in the data set that had missing values and removing any rows that were outliers to the rest of the dataset, which would impact my model training.

From there I began using the “fixed” features to see if they could be used to determine an estimated sales price of the houses in the data set, which was around 85% accurate.

Once I had an estimated value of each house using only the “fixed” features, I used the difference in values between the actual house price and the estimated price from fixed features to do further analysis on which “non-fixed” features influenced that difference in price the most.

For the final part of the project I created a model to classify sales in the data set as normal or abnormal based on their entire feature set and drew out which features influenced that model the most.

Technology Used_

The project was completed with Python in a Python Notebook. Here are a selection of the libraries used:

• Sci-Kit Learn for building classification models, RFECV and metrics.
• PyCaret for initial machine learning model comparisons, anomaly detection and CatBoost Regression.
• Pandas for data frame management.
• NumPy / SciPy for crunching numbers.
• Matplotlib / Seaborn for plotting graphs.


Below are some graphics generated while completing this project.


I’ve collected a few resources for further reading:

A Kaggle Notebook by Rafael Alencar on the topic of different resampling strategies.
PyCaret documentation regarding comparing regression models.
The Ames Housing Dataset on Kaggle.

© Copyright 2020 Tom Gilmore. All Rights Reserved.