Show the code
# read the data
= pd.read_csv("dwellings_ml.csv")
dwell
# delete yrbuilt and parcel
= dwell.drop(['yrbuilt','parcel'], axis=1) dwell
This week we predicted with a 92.3% accuracy wether a house was built before 1980 or not with a Machine Learning model called RandomForest from Scikit Learn library. The data set available to train the model had 48 features and 22 913 rows and the most important variables to predict the target were livearea, numbaths and stories.
# read the data
= pd.read_csv("dwellings_ml.csv")
dwell
# delete yrbuilt and parcel
= dwell.drop(['yrbuilt','parcel'], axis=1) dwell
The following code creates a scatter plot that helped me to analyze the effects,patterns the “total units” and “Number of bedrooms” variable have in relation to the “Before 1980” variable.
= px.scatter(
fig
dwell,='numbdrm',
x='totunits',
y='before1980'
color
)
fig.update_layout(="Number of bedrooms",
xaxis_title="Total units",
yaxis_title="Scatter Plot of number of bedrooms vs total units"
title
)
fig.show()
Findings: - All the houses with more than 6 bedrooms were built before 1980, no matter how many units they have. - All the houses that have 4 or more units wew built before 1980, no matter the number of bedrooms. - Apparently houses in the past (before 1980) used to have more units and bathrooms than now.
The next code creates 2 boxplots, one for houses before 1980 and the other of 1980 or after. This is very helpful to analyze the distribution of the variable “number of bathrooms” and the relationship with “before1980”.
= px.box(
fig
dwell,='before1980',
x='numbaths',
y= 'before1980',
color
)
fig.update_layout(="Before 1980",
xaxis_title="Number of Bathrooms",
yaxis_title="Boxplot of Number of Bathrooms vs Before 1980"
title
)
fig.show()
Findings: - Apparently houses in the past (before 1980) used to have less bathrooms than now. - Upper quartile of the houses built before 1980 is equal to the lower quartile to the houses built in 1980 or after.
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
The following code separates the data into 4 groups, Train X, Train y, TestX and TextY. This is the data preparation to fit the classifier “RandomForest” which I selected since it normally has a high accuracy with its predictions.
= dwell.drop(["before1980"], axis=1)
X = dwell["before1980"]
y
= train_test_split(X, y, test_size=.30)
train_data, test_data, train_targets, test_targets
# RandomForestClassifier
= RandomForestClassifier()
classifier
classifier.fit(train_data, train_targets)
= classifier.predict(test_data) targets_predicted
Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.
The following code creates a chart of the classiffier features ranked by importance.
# Create a series containing feature importances from the model and feature names from the training data
= pd.Series(classifier.feature_importances_, index= train_data.columns).sort_values(ascending=False)
feature_importances
# Plot a simple bar chart
= 'Classifier Features by Importance') feature_importances.plot.bar(title
Findings: - Livearea was the biggest predictor in whether the house was built before 1980, with around 10%. - Followed by number of bathrooms and stories with around 8%.
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
The next code chunk uses 2 Scikit-Learn methods to evaluate the classification model: Accuracy and confusion matrix.
print("Accuracy:", accuracy_score(test_targets, targets_predicted))
# Create the confusion matrix
= confusion_matrix(test_targets, targets_predicted)
cm =cm).plot() ConfusionMatrixDisplay(confusion_matrix
Accuracy: 0.9236252545824847
Findings: - The model accuracy was 92.3%. - According to the confusion matrix, there was 279 false positives and 251 false negatives. The accuracy was lower for the houses that were built in 1980 or after.