End-to-End Machine Learning with Python and Sagemaker

Currently, I am preparing to sit in for the AWS Machine Learning Speciality Exam, and I know for such exams, the best way to prepare is to go quickly through the concepts (inhaling) and start building projects (exhaling).

So I have decided to build a machine learning app using Python and AWS Sagemaker , which will predict land prices based on the specified neighborhood and the size of land required ,as inputs to the model.

The goal is to use this project so together we walk through the machine learning lifecycle and also introduce you to the powerhouse of machine learning on AWS – Sagemaker. This first article will not use Sagemaker yet as we are trying to scrape without going through AWS Sagemaker,so that we minimise the cost of doing so inside Sagemaker.

Wait!!! What is Sagemaker?

It is a managed service provided by AWS which helps data scientists and ML engineers build, train and deploy ML models in the cloud.Since it is a managed service, it takes away all the hassles of getting your model deployed with all security and performance benefits.

It has a wide range of fuctionalities in the machine learning cycle from labeling, data preparation, feature engineering, anomaly detection, auto-ML, training, hyperparameter tuning, hosting and monitoring the deployed model.

You could either use in built models in sagemaker or bring your own code and run it on sagemaker directly.

Enough of Sagemaker for now, let us focus and get to the machine learning process from the start.

Tools and Disclaimers

  • We will be using python’s requests, BeautifulSoup, Pandas, Numpy, Scikit-learn, Jupyter notebooks, sagemaker notebook instances.
  • I would assume some basic knowledge of python and using Jupyter Notebooks.And some basic notion of web scraping using BeautifulSoup and Requests.
  • If you can’t wait to start partying you can get the raw code from Github by clicking here.Feel free to use any resource I provide here.
  • Notice that the code above is not production level code. But just to walk you through the ML project process, and show you how we could leverage the AWS sagemaker to make our lives better.
  • Unfortunately, to avoid charges, I had to delete all I created, otherwise I would have been sharing a live view of the app here as well. Just hang on, we will rebuild the app together and offcourse delete the app together to prevent incurring charges.

My wish is that at the end of this project you would inhale some of these concepts here and later try them out on your own problems, relevant to your locality.

So what are all the parts of this Project?

As seen below the complet project will be broken into 04 parts (or 04 articles) and this is the first out of the four articles.:

  1. Scraping the data (using Python, Requests and BeautifulSoup)
  2. Exploratory Data Analysis and storing to AWS S3 (using sagemaker)
  3. Feature Engineering and selection (using sagemaker)
  4. Building, training and deploying the model (using sagemaker).

So let us move on with the Machine Learning Process before diving to our Jupyter Notebooks for scraping the data at the end of this article.

The Machine Learning Lifecycle. From Business Goal definition to Performance Monitoring.

As seen in the image above, an ML project always starts with the Business Goal, and it is an iterative process even after deployement.

So let us look at the steps we will follow in this article.

Steps we will follow in this article

  1. Define the business goal
  2. Frame the Machine Learning Problem
  3. Collect the Data through web scraping

In subsequent articles we will continue with the rest of the steps taking on from Data Processing  onwards. So let us assume we are a company who want to solve the problem our customers are having when they try to get information about land prices.

1. Defining the business Goal

Currently,if someone wants to buy a piece of land in a certain neighbourhood, it is difficult to get a clear price.They need to call several people on the phone to get their idea of prices, consolidate all their findings before concluding on the price they think is reasonable.It is a painful and time consuming process.At times they need to come back to the same people several times, to get information to compare the prices for several neighbouhoods before decising which one to settle for.

Therefore the building this Land Price App will help put all this information in one place, and save the customer the hassle of going through all of that. The app will even help give updated prices of those neighbourhoods.

Therefore, the company decides that this app should be built and hosted on AWS which will free them the time to focus on other aspects of the business.They will be sure of the security, scalability and overall performance. Now you and your ML team need to translate this business problem to an ML problem.

2. Framing the Machine Learning Problem

Based on the business problem above, you know you need to predict prices (The output). And from every indication the Machine Learning Task in question here should be Supervised Learning .Remember in supervised learning, we provide the machine learning with labelled data (say data with neighbourhoods, area and the actual prices) while in unsupervised learning the data we provide the ML model has no labels and the ML model just has to find patterns in the data

The ML model will learn from these fully labelled data.So that when the customer will subsequently input just the neighbourhood and the size of land desired, the trained model will be able to predict the Price per metres square for the corresponding neighbourhood.

Also it is a regression problem, since you Price is a continuous variable.

That is great so far since we are clear with the Machine Learning Task.The next thing os to go shop for a dataset which will at least contain Prices, “Size of Land (or Area)” and “Neighbourhood (or Location)”.

Fortunately we have a classified adds website were people post lands for sale in our country.Let us check out if it has those attributes we listed above.Let us go check out the website.


Image of the website to be scrapped.Notice that there are more than 3000 page.
Also , I saw that each of these pages have about 11 listings per page

You see that we have a listing of lands for sale per neighbourhood.

Let us get into one of the listings and see if we can find Price, Area and Size of the land in a form we can easily scrape.

The photo of one of the listings.We see Area at the bottom left (800 m2) , Price is 55 000 FCFA and Locfation is “Logbessou” as seen in the Top Right of this image.

That is great! We are sure that we can scrape this website and we will get the data we need to train our regression model.

But there is problem, if you take a closer look at the above 02 images again.And also if you had some domain knowledge about expected prices in my country.

Some of the prices are so high.Meaning that while some are entering Price as “Price per metres square” others are entering price as the total of “Price per metres square * total area”.Notwithstanding you would still have noticed it during the Exploratory Data Analysis phase if you are not from my country.So domain knowledge, dataset is messy and in machine learning we need statistics to help us expecially in the next phase of Exploratory Data Analysis.

This is not a kaggle dataset guys.Real world data is messy and dirty.Let us leave that cleaning for next article and go grab our data using some of our websraping skills.

3. Collecting the Data

3.) Collecting the Data

Here we will be using BeautifulSoup and requests to get the data through webscraping. Remember we do not want to strain the resources from this website.So please scrape with consideration.

In the Gihub link here, you can find all the code in the Land_Scraper Notebook.

Now fold your sleeves and let’s start coding.

a.) Import the libraries
# Importing Libraries required to scrape the data
import requests
from bs4 import BeautifulSoup
import pandas as pd

Requests will help query the url to Extract the data from the given website URL.While BeautifulSoup will help parse (or Transform) the data and load the required attributes we need to a dictionary.So these tools will do your ETL (Extract Transform Load)

a.) Writing functions for the ETL process

Let us create our03 ETL functions : get_urls(page_number), extract_page(url) and transform_page(soup).

The get_urls(page_number) will be used to provide a list of urls (corresponding to the number of listings) per page specified. As seen, earlier we have above 3 000 pages, and about 11 urls (corresponding to the number of listings) per page. You can see that we are creating a list of all the Urls found in the selected pages.

# Create the  function using Request and BeautifulSoup to get the URL of the pages we will need to scrape 
def get_urls(page_number):
    base_url = 'https://www.jumia.cm'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
    request = requests.get(f'https://www.jumia.cm/en/land-plots?page={page_number}&xhr=ugmii', headers)
    soup = BeautifulSoup(request.text, 'html.parser')
    partial_url_list = soup.find_all('article')
    for partial_url in partial_url_list:
        new_url = base_url + partial_url.find('a')['href']
        print(f"Getting the Urls for page {page_number}")

Next, the extract_page(url) will be used to extract data from each of the listings , when the url of the listing is provided.So it takes each of the Url above and then accesses the parses the data in the URL using BeautifulSoup.

# Create function using BeautifulSoup to parse URLs from all the pages from the above function 
def extract_page(url):
    url = url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
    request = requests.get(url, headers)
    soup = BeautifulSoup(request.text, 'html.parser')
    return soup

The lfinally the transform_page(soup) function will finalise the processes of loaded the required data we need into a dictionary.

# Create function to obtain the data we need from all those URLs above and store in a dictionary
def transform_page(soup):
    main_div = soup.find('div', class_='twocolumn')
    price = main_div.find('span', {'class': 'price'}).get_text(strip=True).replace('FCFA',"")
    location = main_div.select('dl > dd')[1].text.strip()
        area = main_div.find_all('h3')[1].get_text(strip=True).replace('Area', "").replace(' m2',"")
    except IndexError:
        area = ''

    items = {
        'Price': price,
        'Location': location,
        'Area': area

    print(f"Scrapping the page '{soup.find('title').text}'...")
c.) Scraping the data and storing in a dictionary

We will create an empty list url_list into which we will append all the Urls for all the pages we decide to scrape.If we scrape say 10 pages and we know each page has about 11 listings per page, we expect the url_list to end up with about 110 Urls.

In the code below I scraped the first 100 pages as a demo. You will need to scrape more pages.

Please be gentle with the website. Do not scrape too much.

After we have the list of Urls, the next thing is to extract, transform and load the necessary attributes we need per listing into our data frame.

# Extracting all the URLs from page 1 to the number of pages required.In this case I just extracted 1 page as a demo
url_list = []
for page_number in range(1, 100):

#Extracting and Transforming all the data from the required pages selected above
land_data_list = []
for url in url_list:
    page = extract_page(url)

We can see that we have extracted all the listings of the first page.Notice that these listings change everytime, as they are being updated all the time.So what your output will not be the same as mine.

d.) Saving the scraped data as a CSV file using pandas

We will create a dataframe using pandas, do some cleaning to remove the currency on the Price and the “m2” on the Area column.and store the scraped data as a CSV file.

# Creating a pandas dataframe
df = pd.DataFrame(land_data_list)

#Formating Area and Price Columns from text to numeric
df['Area'].replace({' m2':'',',': ''},regex = True,inplace = True)
df['Area'] = pd.to_numeric(df['Area'],errors = 'coerce')

df['Price'].replace({'FCFA':'',',': ''},regex = True,inplace = True)
df['Price'] = pd.to_numeric(df['Price'])

Now all we need to do is convert the dataframe to a CSV file and save, which we will use in the Part 2 of this project. rice” and “Area” are now looking good.

df.to_csv('land_price_data.csv',index = False)

Everything looks good.

Congratulations.We have succeeded in scraping the data we need to solve our business problems.

Let us take that Good news to our managers, while we prepare to continue with Exploratory Data Analysis in the next article.

Wish you Good Data Luck!!!

Leave a Reply

Your email address will not be published. Required fields are marked *