End-to-End Machine Learning with Python and Sagemaker

In Part 1 of this article we successfully scraped the data we needed and in this Part 2 we are going to move on to the next steps in the Machine Learning Cycle, which is Exploratory Data Analysis.

The full code on Github and the scraped dataset for this Part 2 can be found here.

Let’s dive straight with the exploratory data analysis.

What to Expect from Exploratory Data Analysis

There is some confusion out there whether to split the test and training dataset before doing EDA or not.I would say the problem comes when you go into too much detail during your EDA.Your brain will already capture some patterns in the data and would be bias in the subsequent stages.

On the other hand, like you would see in this our scraped dataset, we need to do this EDA to validate the quality and consistency of our data. Maybe, the data might be of such bad quality that we need to drop it and go get another.

So EDA is good to just get a peep of your dataset.But do not go too detail and start adjusting things, because these adjustments (or feature engineering) will be done in the next phase of the Machine Learning Cycle.

That said, let us load our dataset and start peeping into it.

Performing the EDA on our scraped dataset

We need to import our libraries. Pandas will be used to create the dataframe, numpy to help with some operations. Seaborn and matplotlib are for visualisations.

a.) Importing the libraries and the data

We will start by importing pandas, numpy for managing the dataframe and for visualisation, we would use matplotlib and seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

After importing the required libraries, the next thing is to bring in our data set using pandas.

dataset = pd.read_csv('land_price_data.csv',index_col = 0)

#View first 10 observations
print(dataset.head(10))

Here is the output you would get if you use the same dataset as me. But if you scrape the data from scratch, you would have different values.

      Price    Location      Area
0      8000      Douala   10000.0
1     55000       Yassa     300.0
2     55000       Yassa     200.0
3     55000      Japoma     735.0
4     55000      Japoma    1500.0
5      3000        Mfou  174000.0
6  16000000  Biyem-Assi     270.0
7     60000       Kribi    5000.0
8    150000       Kribi    1000.0
9      8000      Douala   10000.0

From here, we can already see that there is some inconsistency in how the data was entered in the classified adds website.

Line 6 above (Biyem-Assi) clearly shows that the correct value for price would have been “16 000 000 /270 = 59 259”.

So some data in the “Price” column are entered as “Price Per Metres Square”( Like “Yassa”, “Japoma”,”Kribi” ), while some are entered as total of “Price per m2 * Total Area” (like the case for the Location “Biyem-Assi” in Line 6).

Also “Douala” is city and not a neighbourhood.These cities will have to be removed in the feature engineering stage, since the model only has to predict prices for neighbourhoods and not cities.

b.) Inspecting missing Values

Let us take a look at the missing values and also verify if the are “Missing Values with Information” or “Missing Values without Information”.

First thing is to create a function to go through all the columns and identify all columns with Missing values and the % of missing values per column.

#Create a function to check for missing values in the dataset
features_with_na = [feature for feature in dataset.columns if dataset[feature].isnull().sum()>1]

for feature in features_with_na:
    print(feature, np.round(dataset[feature].isnull().mean(),2),'% missing values' )

We get this output when we run the above code.

Area 0.05 % missing values

We see that only the “Area” column is having missing values.And about 5% of the values are missing.

Let us visualise to see the distribution of the missing values per column.

#Plotting a heatmap to visualise the missing values
sns.heatmap(dataset.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Here is the output graph we get.

Now let us create two histograms (one for Area with missing values , and the other for Area without missing values) and see the relationship with Price.This is to confirm if the missing values carry information or not.

#Plot to see if Area containing missing values Vs Area not Containing Missing Values have any correlation with Price
data = dataset.copy()
data['Area'] = np.where(data['Area'].isnull(),1,0)
plt.bar(data['Area'],data['Price'].median())
plt.xlabel('Area')
plt.ylabel('Median Price')
plt.show()

Notice that we made a copy of the dataset with the the dataset.copy().We do not want to affect the original dataset, so make a copy to do our EDA

Here is the output we get from the above code.

Notice that there is no difference in height between the 02 histograms. Showing that the missing values are random and can be deleted without adding any bias.

c.) Inspecting the Dependent Variable – Price

Now let us look at “Price” to see the distribution.

So let us look at a summary descriptive statistics of Price.

#View the distribution of Price
print(dataset['Price'].describe())

Here is the output of the code.

count    4.490000e+03
mean     4.961029e+07
std      8.482607e+08
min      0.000000e+00
25%      1.000000e+04
50%      3.500000e+04
75%      5.000000e+05
max      4.500000e+10
Name: Price, dtype: float64

We see that the mean is 49 610 290, while the median is 35 000.The difference is very huge. We will have a lot of work to do with the outliers.

Now let us look at the distribution using seaborn.

#Plot to see the distribution of Price
sns.displot(dataset['Price'].dropna(),kde=False,color='darkred',bins=40)

We see from the distribution plot above that there are just few outliers causing the trouble.

Now let us take a look at the dataset above the 3rd Quartile (75 percentile).

#View data above the 75th percentile
data = dataset.copy();
data_outlier = data[data['Price'] > data['Price'].quantile(0.75)]
print(data_outlier.head(10))

Below is the output.

        Price      Location    Area
6    16000000    Biyem-Assi   270.0
16   60000000        Logpom  1000.0
17   57240000    Omnisports   477.0
19   21875000         Yassa   875.0
20   26000000        Logpom   200.0
21  200000000  Bonamoussadi   300.0
22   12000000          PK12   300.0
23    3000000         Lendi   500.0
24   34000000        Makepe   340.0
25   24000000        Logpom   200.0

Let us divide “Price” by “Area” and see if the result will be close to the median.

# Create a new column by dividing Price by Area.
data_outlier['Price per Area'] = round(data_outlier['Price']/ data_outlier['Area'])

#Inspect the first 10 observations
print(data_outlier.head(10))

Here is the output of the code.

        Price      Location    Area  Price per Area
6    16000000    Biyem-Assi   270.0         59259.0
16   60000000        Logpom  1000.0         60000.0
17   57240000    Omnisports   477.0        120000.0
19   21875000         Yassa   875.0         25000.0
20   26000000        Logpom   200.0        130000.0
21  200000000  Bonamoussadi   300.0        666667.0
22   12000000          PK12   300.0         40000.0
23    3000000         Lendi   500.0          6000.0
24   34000000        Makepe   340.0        100000.0
25   24000000        Logpom   200.0        120000.0

The price per Area looks Great!!!
It shows that most of these values were wrongly entered like we suspected.The minimum Price Per Area is 6000,while the maximum is 130 000.Which are all reasonable.
We will do this correction during Feature Engineering.

Let’s just take a look at the distribution without the outliers.

data_without_outlier = data[data['Price'] < data['Price'].quantile(0.75)]
sns.displot(data_without_outlier['Price'].dropna(),kde=False,color='darkred',bins=40)

Here is the ouput of this code above.Showing the graph for the data below the outliers cuttoff value.

We see that without the outliers it looks better now.We will come back to these later in the feature engineering phase.But now it is good we have an idea of the distribution of our dataset and things we could possibly do later to improve accuracy.

d.) Inspecting the Independent Variable – Area

Let us take a look at the Area variable to see its distribution.

#Summary descriptive stats
print(dataset['Area'].describe())

Here is the output for the above code.

count      4270.000000
mean      10036.810304
std       42361.933909
min           0.000000
25%         500.000000
50%        1000.000000
75%        5393.750000
max      500000.000000
Name: Area, dtype: float64

We notice that the mean is 10 036, while the median is 1 000.Again there are lots of outliers which we will need to take care of later.

Let us look at the distribution.

#Plotting a box plot to view the dispersion
data = dataset.copy()
data['Area'].hist(bins=25)
plt.xlabel('Area')
plt.ylabel('Count')
plt.show()

The code produces the following output.

Again it confirms that a few points are making the distribution very positively skewed.We will also deal with them later. Next let us look at the relationship between Area and price.We need to confirm if the “Price” will increase per “Area” or it would be a uniform distribution.

#Find relationship between Area and Price
data = dataset.copy()
plt.scatter(data['Area'],data['Price'])
plt.xlabel('Area')
plt.ylabel('Price')

plt.show()

Here is the output of the code above.

It shows a somewhat uniform distribution if we take care of the outliers in “Price”. It is good to know that most of the price was entered as “Price per metres squared” and not as “Price per metres squared * metres squares”.

d.) Inspecting the Independent Variable – Location

Let us just visualise the Location variable and count how many categories are in the Locations variable.

#Look at the number of categories in the "Land" variable
print(f"The number of categories in the Location variable is : {dataset['Location'].nunique()} locations  \n ")

location_list = dataset['Location'].unique()
print(location_list)
The number of categories in the Location variable is : 161 locations  
 
['Douala' 'Yassa' 'Japoma' 'Mfou' 'Biyem-Assi' 'Kribi' 'Bastos' 'Logpom'
 'Omnisports' 'Bonamoussadi' 'PK12' 'Lendi' 'Makepe' 'PK11' 'Zone Bassa'
 'Yaoundé' 'Kotto' 'Mendong' 'Logbessou' 'Nyom2' 'Odza' 'Mbankolo' 'Nyom'
 'Mimboman' 'Nkolfoulou' 'Nkolbisson' 'Mbangue' 'PK26' 'Soa' 'PK21'
 'Ndog-Bong' 'Village' 'Bonedale' 'Bassa' 'Manjo' 'PK27' 'Limbé' 'Ndokoti'
 'Bonaberi' 'Akwa' 'Centre ville' 'Tiko' 'Ahala' 'Emana' 'Oyom Abang'
 'Logbaba' 'Nkoabang' 'Santa Barbara' 'Tsinga' 'Zamengoue' 'PK33' 'PK19'
 'Awae' 'Olembe' 'Edéa' 'Denver' 'Cite des Palmiers' 'Bonanjo' 'PK16'
 'Nyalla' 'Messassi' 'PK15' 'Beedi' 'Foumbot' 'Dibombari' 'Bepanda'
 'Quartier Golf' 'Bali' 'Nlongkak' 'Mbalgong' 'Elig-Essono' 'Damase'
 'Ange Raphael' 'Obobogo' 'Efoulan' 'Essos' 'Mbanga' 'Mvog Ada' 'Nyom1'
 'Mvan' 'Nkondengui' 'Bonapriso' 'Bafoussam' 'Eleveur' 'Buea' 'Cité Sic'
 'PK18' 'Ekié' 'Ndogpassi2' 'PK20' 'Ngoumou' 'Ekounou' 'Mbalmayo' 'Pk24'
 'Sangmélima' 'Tropicana' 'Deido' 'Biteng' 'Ndogbati' 'Messamendongo'
 'Ngousso' 'Nsimeyong' 'Akwa Nord' 'PK14' 'Ngodi' 'Soboum' 'Nsam'
 'Ebolowa' 'Bamenda' 'Dakar' 'Etoudi' 'Borne 10' 'Essomba' 'Etoug Ebe'
 'PK13' 'Nkoulouloum' 'Santa Babara' 'Mvolye' 'New Bell' 'Nkomo' 'PK31'
 'Quartier Fouda' 'Ndogpassi' 'Nkolmesseng' 'PK10' 'PK30' 'Mbouda'
 'Newtown Aeroport' 'Nkongsamba' 'Ekoumdoum' 'Yabassi' 'Batouri' 'Mvog Bi'
 'PK22' 'Nkolndongo' 'Loum' 'Messa' 'Olézoa' 'Eséka' 'Akoeman' 'Mvog Mbi'
 'Ndogpassi3' 'BEAC' 'Obili' 'Mokolo' 'Bilongue' 'Bertoua' 'Kumba'
 'Mont Febe' 'Ntui' 'Mballa 2' 'Mvog Atangana Mballa' 'PK25' 'Melen'
 'Etoa Meki' 'Mbengwi' 'Obala' 'Maroua' 'Kondengui' 'Febe' 'PK32']

There are too many categories and this would make the data very sparse when we will do oneHot encoding later.We might decide to build several models per city and so each city has a list of its neighbourhoods only.We do not want the dataset to become too wide after onehot encoding.

e.) Inspecting other relationships

Create a Price per Area Variable and see the correation between Price and Area

#Create Price per Area and see the correlation with Area and Price
data['Price per Area'] = data['Price']/data['Area']
plt.scatter(data['Price per Area'],data['Price'])

plt.xlabel('Price per Area')
plt.ylabel('Price')

Again, we need to take care of the outliers in the “Price” column and it would be OK.

We see again that some prices were entered as “Price per m2” while some are the overall price (price per m2 * area).So this is the inconsistency in data entry from the website.We would need to deal with these later. Otherwise the variance will cause our model to produce very poor results.

Great!!! Now we have generated enough insight from the data to be able to move on to the next phase of Feature Engineering.In the next stage we will apply what we have observed in this phase and actually start preparing the data for the machine learning model.

Leave a Reply

Your email address will not be published. Required fields are marked *