Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow

Chapter 2 - End-to-end Machine Learning Project

Setup

Load packages

import os
from pathlib import Path
import tarfile
import urllib.request

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split

Visuals

Code to save and format figures as high-res PNGs

IMAGES_PATH = Path() / "images" / "end_to_end_project"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)


# matplotlib settings
plt.rc("font", size=14)
plt.rc("axes", labelsize=14, titlesize=14)
plt.rc("legend", fontsize=14)
plt.rc("xtick", labelsize=10)
plt.rc("ytick", labelsize=10)

Seeds

Set seed for reproducbility

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED) # set seed for reproducibility

Additionally, you must set the environment variable PYTHONHASHSEED to "0" before python starts.

Data

Extract and load housing data

def extract_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True) # create new dir if does not exist
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path = "datasets")


def load_housing_data():
    housing_csv_path = Path("datasets/housing/housing.csv")
    if not housing_csv_path.is_file():
        extract_housing_data()
    return pd.read_csv(housing_csv_path)


housing = load_housing_data()

View housing data structure

housing.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
housing.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
housing.hist(bins=50, figsize=(12,8))
plt.show()

Create test set

Random Sample

def shuffle_and_splt_data(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio) # number of rows in test data, rounded to nearest integer
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]


# Split into training and test set with sizes 80% and 20%,respectively
train_set, test_set = shuffle_and_splt_data(housing, 0.2)

Verify training data size

len(train_set) / len(housing)
0.8

Verify test data size

len(test_set) / len(housing)
0.2

Additionally you can use sklearn to create train and test split, supplying the random seed as an argument

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=RANDOM_SEED)

Verify training data size

len(train_set) / len(housing)
0.8

Verify test data size

len(test_set) / len(housing)
0.2

In order to ensure consistent entries in test set across data refreshes, you will need to store the indexes that are in the test set.

Stratified Sample

Sometimes you will want to perform a train/test split using a stratified sample. Here’s a stratified train/test split using income buckets as our strata.

housing["median_income"].describe()
count    20640.000000
mean         3.870671
std          1.899822
min          0.499900
25%          2.563400
50%          3.534800
75%          4.743250
max         15.000100
Name: median_income, dtype: float64

Create income bucket column in dataframe

housing["income_bucket"] = pd.cut(
    housing["median_income"],
    bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
    labels=[1, 2, 3, 4, 5],
)

housing["income_bucket"].value_counts().sort_index().plot.bar(
    rot=0, # rotates x-axis labels
    grid=True # add gridlines
)
plt.xlabel("Income Bucket")
plt.ylabel("Number of districts")
plt.show()

Use sklearn to perform a stratified train/test split 10 times

splitter = StratifiedShuffleSplit(n_splits=10, test_size = 0.2, random_state=RANDOM_SEED)
stratified_splits = []
stratified_split_indices = splitter.split(housing, housing["income_bucket"])
for train_index, test_index in stratified_split_indices:
    stratified_train_set_idx = housing.iloc[train_index]
    stratified_test_set_idx = housing.iloc[test_index]
    stratified_splits.append([stratified_train_set_idx, stratified_test_set_idx])

If you wish to use a single straified test/train split, you can simply use train_test_split()

strat_train_set, strat_test_set = train_test_split(
    housing,
    test_size=0.2,
    stratify=housing["income_bucket"],
    random_state=RANDOM_SEED,
)

Certify the straification resulted in a more balanced test set for income:

def income_bucket_proportions(data):
    return data["income_bucket"].value_counts() / len(data)

train_set, test_set = train_test_split(
    housing,
    test_size=0.2,
    random_state=RANDOM_SEED
)
proportion_comparison = pd.DataFrame({
    "Overall %": income_bucket_proportions(housing),
    "Stratified %": income_bucket_proportions(strat_test_set),
    "Random %": income_bucket_proportions(test_set),
})
proportion_comparison.index.name = "Income Bucket"
proportion_comparison["Stratified Error %"] = proportion_comparison["Stratified %"] / proportion_comparison["Overall %"] - 1
proportion_comparison["Random Error %"] = proportion_comparison["Random %"] / proportion_comparison["Overall %"] - 1

(proportion_comparison * 100).round(2)
Overall % Stratified % Random % Stratified Error % Random Error %
Income Bucket
3 35.06 35.05 34.52 -0.01 -1.53
2 31.88 31.88 30.74 -0.02 -3.59
4 17.63 17.64 18.41 0.03 4.42
5 11.44 11.43 12.09 -0.08 5.63
1 3.98 4.00 4.24 0.36 6.45

EDA

Data Visualization

Creating a scatter plot of the latitudes and longitudes tells us where these district are located in relation to one another.

housing.plot("longitude", "latitude", "scatter")
plt.show()

We observe that our data spans California. Applying an alpha to the plot will better show the districts’ density.

housing.plot("longitude", "latitude", "scatter", alpha=0.2)
plt.show()

Additionally, we can plot these points and add layers for population and median housing price

housing.plot(
    kind="scatter",
    x="longitude",
    y="latitude",
    grid=True,
    s=housing["population"] / 100, # size of points
    c="median_house_value", # color of points
    cmap="viridis", # colormap to use for color layer
    colorbar=True,
    alpha=0.5,
    legend=True,
    figsize=(10,7)
)
plt.show()

This we are using location data, we can plot this on a map image

filename = "california.png"
if not (IMAGES_PATH / filename).is_file():
    img_url_root = "https://github.com/ageron/handson-ml3/raw/main/"
    img_url = img_url_root + "images/end_to_end_project/" + filename
    urllib.request.urlretrieve(img_url, IMAGES_PATH / filename)
housing.plot(
    kind="scatter",
    x="longitude",
    y="latitude",
    grid=False,
    s=housing["population"] / 100, # size of points
    c="median_house_value", # color of points
    cmap="viridis", # colormap to use for color layer
    colorbar=True,
    alpha=0.5,
    legend=True,
    figsize=(10,7)
)
axis_limits = -124.55, -113.95, 32.45, 42.05
plt.axis(axis_limits)
california_img = plt.imread(IMAGES_PATH / filename)
plt.imshow(california_img, extent=axis_limits)
plt.show()

Correlation

We can return a correlation matrix and look at correlations for our target variable.

corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64

Also, pandas comes with the ability to create scatter plots for all variables of interest.

We observe strong correlations among variables concerning house size, number, and count of rooms. We observe weak correlation between housing_median_age and the other variables

scatter_columns = ["housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income", "median_house_value", "ocean_proximity"]
pd.plotting.scatter_matrix(housing[scatter_columns], figsize=(12, 8))
plt.show()

Feature Engineering

We can combine columns into new columns for interaction effects.

housing["rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["people_per_house"] = housing["population"] / housing["households"]

When we re-run the correlation matrix, we observe that the derived columns can provide bettwe correlations to the target variable

corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.688075
rooms_per_house       0.151948
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
people_per_house     -0.023737
population           -0.024650
longitude            -0.045967
latitude             -0.144160
bedrooms_ratio       -0.255880
Name: median_house_value, dtype: float64

Data Prep

Functions are the preferred way to clean data for ML. Functionalizi8ng the data cleaning process allows you to reproduce your results easily and apply the same cleaning across different projects.

Clean training data and labels

First, we want to remove the target variable from our training set and store it in its own object.

TARGET_VARIABLE = "median_house_value"
housing = strat_train_set.drop(TARGET_VARIABLE, axis=1) # drops column
housing_labels = strat_test_set[TARGET_VARIABLE].copy()

Handle Missing Data

TO handle missing data, there are three options: - Remove the rows with missing values (pd.DataFrame.dropna()) - Remove attributes with missingness (pd.DataFrame.drop()) - Impute the missing values

Imputation is generally preferred, so as to avoid losing information. We can use scikit-learn for imputation

# available strategies: 'mean', 'median', 'most_frequent', 'constant' (using provided 'fill_value'), or Callable
imputer = SimpleImputer(strategy="median")
housing_numeric_columns = housing.select_dtypes(include=[np.number])
imputer.fit(housing_numeric_columns)
SimpleImputer(strategy='median')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The imputer calcualtes the specified statistics and stores them in the statistics_ attribute.

imputer.statistics_
array([-118.51  ,   34.26  ,   29.    , 2125.    ,  434.    , 1167.    ,
        408.    ,    3.5385])
housing_numeric_columns.median().values
array([-118.51  ,   34.26  ,   29.    , 2125.    ,  434.    , 1167.    ,
        408.    ,    3.5385])

To apply the “fitted” imputer to the data, use the transform method.

X = imputer.transform(housing_numeric_columns)

Handle Text and Categorical Data