In Part 1 of this article we successfully scraped the data we needed and in this Part 2 we are going to move on to the next steps in the Machine Learning Cycle, which is Exploratory Data Analysis.
The full code on Github and the scraped dataset for this Part 2 can be found here.
Let’s dive straight with the exploratory data analysis.
What to Expect from Exploratory Data Analysis
There is some confusion out there whether to split the test and training dataset before doing EDA or not.I would say the problem comes when you go into too much detail during your EDA.Your brain will already capture some patterns in the data and would be bias in the subsequent stages.
On the other hand, like you would see in this our scraped dataset, we need to do this EDA to validate the quality and consistency of our data. Maybe, the data might be of such bad quality that we need to drop it and go get another.
So EDA is good to just get a peep of your dataset.But do not go too detail and start adjusting things, because these adjustments (or feature engineering) will be done in the next phase of the Machine Learning Cycle.
That said, let us load our dataset and start peeping into it.
Performing the EDA on our scraped dataset
We need to import our libraries. Pandas will be used to create the dataframe, numpy to help with some operations. Seaborn and matplotlib are for visualisations.
a.) Importing the libraries and the data
We will start by importing pandas, numpy for managing the dataframe and for visualisation, we would use matplotlib and seaborn.
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
After importing the required libraries, the next thing is to bring in our data set using pandas.
dataset = pd.read_csv('land_price_data.csv',index_col = 0) #View first 10 observations print(dataset.head(10))
Here is the output you would get if you use the same dataset as me. But if you scrape the data from scratch, you would have different values.
Price Location Area 0 8000 Douala 10000.0 1 55000 Yassa 300.0 2 55000 Yassa 200.0 3 55000 Japoma 735.0 4 55000 Japoma 1500.0 5 3000 Mfou 174000.0 6 16000000 Biyem-Assi 270.0 7 60000 Kribi 5000.0 8 150000 Kribi 1000.0 9 8000 Douala 10000.0
From here, we can already see that there is some inconsistency in how the data was entered in the classified adds website.
Line 6 above (Biyem-Assi) clearly shows that the correct value for price would have been “16 000 000 /270 = 59 259”.
So some data in the “Price” column are entered as “Price Per Metres Square”( Like “Yassa”, “Japoma”,”Kribi” ), while some are entered as total of “Price per m2 * Total Area” (like the case for the Location “Biyem-Assi” in Line 6).
Also “Douala” is city and not a neighbourhood.These cities will have to be removed in the feature engineering stage, since the model only has to predict prices for neighbourhoods and not cities.
b.) Inspecting missing Values
Let us take a look at the missing values and also verify if the are “Missing Values with Information” or “Missing Values without Information”.
First thing is to create a function to go through all the columns and identify all columns with Missing values and the % of missing values per column.
#Create a function to check for missing values in the dataset features_with_na = [feature for feature in dataset.columns if dataset[feature].isnull().sum()>1] for feature in features_with_na: print(feature, np.round(dataset[feature].isnull().mean(),2),'% missing values' )
We get this output when we run the above code.
Area 0.05 % missing values
We see that only the “Area” column is having missing values.And about 5% of the values are missing.
Let us visualise to see the distribution of the missing values per column.
#Plotting a heatmap to visualise the missing values sns.heatmap(dataset.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Here is the output graph we get.

Now let us create two histograms (one for Area with missing values , and the other for Area without missing values) and see the relationship with Price.This is to confirm if the missing values carry information or not.
#Plot to see if Area containing missing values Vs Area not Containing Missing Values have any correlation with Price data = dataset.copy() data['Area'] = np.where(data['Area'].isnull(),1,0) plt.bar(data['Area'],data['Price'].median()) plt.xlabel('Area') plt.ylabel('Median Price') plt.show()
Notice that we made a copy of the dataset with the the dataset.copy().We do not want to affect the original dataset, so make a copy to do our EDA
Here is the output we get from the above code.

Notice that there is no difference in height between the 02 histograms. Showing that the missing values are random and can be deleted without adding any bias.
c.) Inspecting the Dependent Variable – Price
Now let us look at “Price” to see the distribution.
So let us look at a summary descriptive statistics of Price.
#View the distribution of Price print(dataset['Price'].describe())
Here is the output of the code.
count 4.490000e+03 mean 4.961029e+07 std 8.482607e+08 min 0.000000e+00 25% 1.000000e+04 50% 3.500000e+04 75% 5.000000e+05 max 4.500000e+10 Name: Price, dtype: float64
We see that the mean is 49 610 290, while the median is 35 000.The difference is very huge. We will have a lot of work to do with the outliers.
Now let us look at the distribution using seaborn.
#Plot to see the distribution of Price sns.displot(dataset['Price'].dropna(),kde=False,color='darkred',bins=40)

We see from the distribution plot above that there are just few outliers causing the trouble.
Now let us take a look at the dataset above the 3rd Quartile (75 percentile).
#View data above the 75th percentile data = dataset.copy(); data_outlier = data[data['Price'] > data['Price'].quantile(0.75)] print(data_outlier.head(10))
Below is the output.
Price Location Area 6 16000000 Biyem-Assi 270.0 16 60000000 Logpom 1000.0 17 57240000 Omnisports 477.0 19 21875000 Yassa 875.0 20 26000000 Logpom 200.0 21 200000000 Bonamoussadi 300.0 22 12000000 PK12 300.0 23 3000000 Lendi 500.0 24 34000000 Makepe 340.0 25 24000000 Logpom 200.0
Let us divide “Price” by “Area” and see if the result will be close to the median.
# Create a new column by dividing Price by Area. data_outlier['Price per Area'] = round(data_outlier['Price']/ data_outlier['Area']) #Inspect the first 10 observations print(data_outlier.head(10))
Here is the output of the code.
Price Location Area Price per Area 6 16000000 Biyem-Assi 270.0 59259.0 16 60000000 Logpom 1000.0 60000.0 17 57240000 Omnisports 477.0 120000.0 19 21875000 Yassa 875.0 25000.0 20 26000000 Logpom 200.0 130000.0 21 200000000 Bonamoussadi 300.0 666667.0 22 12000000 PK12 300.0 40000.0 23 3000000 Lendi 500.0 6000.0 24 34000000 Makepe 340.0 100000.0 25 24000000 Logpom 200.0 120000.0
The price per Area looks Great!!!
It shows that most of these values were wrongly entered like we suspected.The minimum Price Per Area is 6000,while the maximum is 130 000.Which are all reasonable.
We will do this correction during Feature Engineering.
Let’s just take a look at the distribution without the outliers.
data_without_outlier = data[data['Price'] < data['Price'].quantile(0.75)] sns.displot(data_without_outlier['Price'].dropna(),kde=False,color='darkred',bins=40)
Here is the ouput of this code above.Showing the graph for the data below the outliers cuttoff value.

We see that without the outliers it looks better now.We will come back to these later in the feature engineering phase.But now it is good we have an idea of the distribution of our dataset and things we could possibly do later to improve accuracy.
d.) Inspecting the Independent Variable – Area
Let us take a look at the Area variable to see its distribution.
#Summary descriptive stats print(dataset['Area'].describe())
Here is the output for the above code.
count 4270.000000 mean 10036.810304 std 42361.933909 min 0.000000 25% 500.000000 50% 1000.000000 75% 5393.750000 max 500000.000000 Name: Area, dtype: float64
We notice that the mean is 10 036, while the median is 1 000.Again there are lots of outliers which we will need to take care of later.
Let us look at the distribution.
#Plotting a box plot to view the dispersion data = dataset.copy() data['Area'].hist(bins=25) plt.xlabel('Area') plt.ylabel('Count') plt.show()
The code produces the following output.

Again it confirms that a few points are making the distribution very positively skewed.We will also deal with them later. Next let us look at the relationship between Area and price.We need to confirm if the “Price” will increase per “Area” or it would be a uniform distribution.
#Find relationship between Area and Price data = dataset.copy() plt.scatter(data['Area'],data['Price']) plt.xlabel('Area') plt.ylabel('Price') plt.show()
Here is the output of the code above.

It shows a somewhat uniform distribution if we take care of the outliers in “Price”. It is good to know that most of the price was entered as “Price per metres squared” and not as “Price per metres squared * metres squares”.
d.) Inspecting the Independent Variable – Location
Let us just visualise the Location variable and count how many categories are in the Locations variable.
#Look at the number of categories in the "Land" variable print(f"The number of categories in the Location variable is : {dataset['Location'].nunique()} locations \n ") location_list = dataset['Location'].unique() print(location_list)
The number of categories in the Location variable is : 161 locations ['Douala' 'Yassa' 'Japoma' 'Mfou' 'Biyem-Assi' 'Kribi' 'Bastos' 'Logpom' 'Omnisports' 'Bonamoussadi' 'PK12' 'Lendi' 'Makepe' 'PK11' 'Zone Bassa' 'Yaoundé' 'Kotto' 'Mendong' 'Logbessou' 'Nyom2' 'Odza' 'Mbankolo' 'Nyom' 'Mimboman' 'Nkolfoulou' 'Nkolbisson' 'Mbangue' 'PK26' 'Soa' 'PK21' 'Ndog-Bong' 'Village' 'Bonedale' 'Bassa' 'Manjo' 'PK27' 'Limbé' 'Ndokoti' 'Bonaberi' 'Akwa' 'Centre ville' 'Tiko' 'Ahala' 'Emana' 'Oyom Abang' 'Logbaba' 'Nkoabang' 'Santa Barbara' 'Tsinga' 'Zamengoue' 'PK33' 'PK19' 'Awae' 'Olembe' 'Edéa' 'Denver' 'Cite des Palmiers' 'Bonanjo' 'PK16' 'Nyalla' 'Messassi' 'PK15' 'Beedi' 'Foumbot' 'Dibombari' 'Bepanda' 'Quartier Golf' 'Bali' 'Nlongkak' 'Mbalgong' 'Elig-Essono' 'Damase' 'Ange Raphael' 'Obobogo' 'Efoulan' 'Essos' 'Mbanga' 'Mvog Ada' 'Nyom1' 'Mvan' 'Nkondengui' 'Bonapriso' 'Bafoussam' 'Eleveur' 'Buea' 'Cité Sic' 'PK18' 'Ekié' 'Ndogpassi2' 'PK20' 'Ngoumou' 'Ekounou' 'Mbalmayo' 'Pk24' 'Sangmélima' 'Tropicana' 'Deido' 'Biteng' 'Ndogbati' 'Messamendongo' 'Ngousso' 'Nsimeyong' 'Akwa Nord' 'PK14' 'Ngodi' 'Soboum' 'Nsam' 'Ebolowa' 'Bamenda' 'Dakar' 'Etoudi' 'Borne 10' 'Essomba' 'Etoug Ebe' 'PK13' 'Nkoulouloum' 'Santa Babara' 'Mvolye' 'New Bell' 'Nkomo' 'PK31' 'Quartier Fouda' 'Ndogpassi' 'Nkolmesseng' 'PK10' 'PK30' 'Mbouda' 'Newtown Aeroport' 'Nkongsamba' 'Ekoumdoum' 'Yabassi' 'Batouri' 'Mvog Bi' 'PK22' 'Nkolndongo' 'Loum' 'Messa' 'Olézoa' 'Eséka' 'Akoeman' 'Mvog Mbi' 'Ndogpassi3' 'BEAC' 'Obili' 'Mokolo' 'Bilongue' 'Bertoua' 'Kumba' 'Mont Febe' 'Ntui' 'Mballa 2' 'Mvog Atangana Mballa' 'PK25' 'Melen' 'Etoa Meki' 'Mbengwi' 'Obala' 'Maroua' 'Kondengui' 'Febe' 'PK32']
There are too many categories and this would make the data very sparse when we will do oneHot encoding later.We might decide to build several models per city and so each city has a list of its neighbourhoods only.We do not want the dataset to become too wide after onehot encoding.
e.) Inspecting other relationships
Create a Price per Area Variable and see the correation between Price and Area
#Create Price per Area and see the correlation with Area and Price data['Price per Area'] = data['Price']/data['Area'] plt.scatter(data['Price per Area'],data['Price']) plt.xlabel('Price per Area') plt.ylabel('Price')
Again, we need to take care of the outliers in the “Price” column and it would be OK.

We see again that some prices were entered as “Price per m2” while some are the overall price (price per m2 * area).So this is the inconsistency in data entry from the website.We would need to deal with these later. Otherwise the variance will cause our model to produce very poor results.
Great!!! Now we have generated enough insight from the data to be able to move on to the next phase of Feature Engineering.In the next stage we will apply what we have observed in this phase and actually start preparing the data for the machine learning model.