Minarctic

Data-driven storyteller.

Web scraping is the process of extracting data from websites. It's a powerful technique for gathering information at scale, and Python is a popular choice for web scraping tasks due to its simplicity, versatility and extensive libraries.

In this tutorial, we'll delve into the world of web scraping with Python using the powerful Beautiful Soup library. We'll start by exploring the applications of web scraping. Then, we'll show you how to write your own scraper using sample code. To solidify your understanding, we'll put our skills to the test by scraping data from a real website!

To follow along, a basic understanding of Python and HTML will be helpful.

In this tutorial, you'll learn:
  • how web Scraping works?
  • How to extract data from local HTML files
  • How to do Website Scraping

Here is what you have to notice:

Web scraping is a powerful tool, but it's important to use it responsibly. Many websites have terms of service that outline their policies on data scraping. It's crucial to check these terms before scraping any website. Some sites may explicitly prohibit scraping, while others may have limitations on scraping frequency or the amount of data extracted. Ignoring these guidelines could result in your IP address being blocked or even legal repercussions.

How Web Scraping works?

Web scraping automates data extraction from websites. It follows a three-step process:

  1. Fetching the Page: Like a web browser, the scraper requests and retrieves the complete HTML document, including the website's structure.
  2. Parsing the HTML: Websites use HTML to define content organization. Web scrapers parse this code to locate the desired data.
  3. Extracting the Data: Using elements within the HTML (tags and classes), the scraper pinpoints the specific information you need. These elements act like labels, guiding the scraper to the relevant data sections.

While fetching and parsing are simple, identifying the data within HTML can be more complex. That's why we'll start by working with local HTML files. Once you're comfortable finding data there, we'll move on to scraping a real website!

How to extract data from local HTML files?

As we mentioned earlier, this tutorial assumes a basic understanding of common HTML tags. However, if you're new to HTML, we recommend learning the fundamental structure and common tags before proceeding.

Why Start with Local Files?

Local HTML files offer a safe and controlled environment to practice your web scraping skills. You can experiment without external factors like network requests or dynamic content. This allows you to build a solid foundation before tackling live websites.

Learning by Doing: Code Alongside Us

For optimal learning, we highly recommend coding along with the tutorial. This hands-on approach reinforces concepts and helps you develop practical web scraping skills

Let's get started.

Setting Up Your Environment

Before we begin, we have to install some libraries and necessary software.

  1. The first step is to install Python. You can download the latest version of Python from the official website.

  2. The second step is to install an IDE or a text editor for writing python code.You can install any IDE or text editor that you like.

  3. In the third step, we'll install the essential library for web scraping in Python – Beautiful Soup. We'll be using Beautiful Soup throughout this tutorial to efficiently navigate the structure of HTML documents and retrieve the specific information we desire. To install Beautiful Soup, run the following command:

pip install beautifulsoup4

Optional Recommendation

In order to avoid dependency conflicts, you can use python virtual environment.

  1. Extracting data from HTML files often requires us to break down their structure and access specific elements. Here's where parsing tools come in. They act like interpreters, reading the HTML code and transforming it into a format that's easier to navigate and manipulate. While Beautiful Soup offers built-in parsing capabilities, we'll be using lxml (pronounced 'xml') in this tutorial. Lxml is a powerful XML/HTML parser written in Python. It efficiently reads and organizes the structure of HTML files, making it easier to extract the data we need. This translates to faster processing and broader compatibility, especially when dealing with complex HTML structures.

here is the command that you can run to install lxml:

pip install lxml

Now we are ready to start scraping.

This content is gated

The author hasn't created any products that make this content accessible yet.

  1. Create a Project Folder

Create a new folder to store your project files. This helps keep your code and HTML file organized. Give your folder a clear and descriptive name, like "web_scraping_tutorial" or something similar.

  1. Obtain Your HTML File

There are two options for the HTML file you'll be scraping data from:

  • Write Your Own HTML : If you're comfortable with HTML, you can create a simple HTML file with the data structure you want to practice scraping. This gives you complete control over the content and allows you to tailor the experience to your specific needs.
  • Download Our Provided HTML file : We've provided a sample HTML file named "sample.html". You can [download] this file and place it within your project folder.
  1. Create a Python File :

Within your project folder, create a new Python file. You can name it something descriptive like "web_scraper.py" . This file will contain the Python code to interact with Beautiful Soup and lxml for web scraping.

Your project folder should be something like this:

web_scraping_tutorial
                    |
                    |
                    ---- sample.htm
                    |
                    ---- web_scraper.py
  1. Open Your Python IDE

Launch your preferred Python IDE or a text editor. Open the newly created Python file in your IDE for editing. In the following, we are going to

We've set up our files. Now, let's write the Python code to extract data from the HTML using Beautiful Soup and lxml. We'll break down each line, explaining how it interacts with the HTML structure to grab the information we want.

Since we are going to use beautifulsoup library, we have to import it first.This line imports the BeautifulSoup class from the bs4 module.

from bs4 import BeautifulSoup

Before we can start scraping data, we need to access the contents of the HTML file. The following code achieves this:

with open('sample.html', 'r') as sample_html:
    content = sample_html.read()

In line 2, the code reads the entire content of the opened HTML file and stores it in the variable content. It's now a string containing the raw HTML code.

Now that we have the HTML content stored as a string in the content variable, it's time to parse it using Beautiful Soup. Here's the code:

bsoup = BeautifulSoup(content, 'lxml')
What Happens During Parsing?

Beautiful Soup takes the raw HTML code (stored in content) and parses it into a tree-like structure. This structure represents the hierarchical organization of elements within the HTML document. Imagine it like an organized map, where each element (headings, paragraphs, etc.) has its designated location.

Extracting Specific Information

Now that we have the parsed HTML structure stored in the soup object, it's time to target and extract the data we're interested in. Beautiful Soup provides various methods to navigate and locate elements within the parsed document. Here are some common approaches:

  • Finding by Tag : Use the bsoup.find(tag_name) method to search for the first occurrence of a specific HTML tag (e.g., bsoup.find('h2') to find the first heading). Here is an example code with it's output:
from bs4 import BeautifulSoup
 
with open('sample.html', 'r') as sample_html:
    content = sample_html.read()
 
    bsoup = BeautifulSoup(content, 'lxml')
    h2_tags = bsoup.find('h2')
    print(h2_tags.text)
 
#output: Top News Stories
  • Finding All by Tag : Use the bsoup.find_all(tag_name) method to locate all elements with a particular tag (e.g., bsoup.find_all('p') to find all paragraphs).
paragrapghs = bsoup.find_all('p')
   for p in paragrapghs:
      print(,p.text)
 

output:

Get the inside scoop on...
A research team has developed ...
By John Doe - August ..
...
  • Finding by Attributes : You can refine your search by specifying HTML element attributes. For example, bsoup.find('div', class_='class-name') finds the first div element with a class attribute of "class-name".

Local HTML files offer a learning ground, but real websites are different. They often generate HTML dynamically using JavaScript. This means the code you see might not be the complete picture for scraping.

To scrape real websites, you need the final rendered HTML. Here's where browser developer tools come in. These tools accessible through (right-click and ("Inspect") allow you to examine the underlying code and resources loaded by a web page.

Finding Your Target Data

The key element is the Elements tab in developer tools. It displays the website's HTML structure like a tree. While it might be more complex than local files, you can still navigate it. Look for relevant tags, inspect element attributes, and use the search functionality to find the data you want to extract.

Conclusion

This foundational tutorial has empowered you with the core principles of web scraping using Beautiful Soup and lxml. You've gained hands-on experience in setting up your development environment, parsing HTML structures, and extracting targeted data elements. As you venture into real-world web scraping, leverage browser developer tools to inspect the underlying code of live websites. Remember to adhere to ethical scraping practices by respecting robots.txt guidelines and website terms of service.