Web Scraping With Python And Beautiful Soup














































Web Scraping With Python And Beautiful Soup



Web Scraping With Python And Beautiful Soup



Beautiful Soup?


Beautiful Soup is a Python module for extracting information from an HTML page (and is much better for this purpose than regular expressions) BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment. It is often used for web scraping, The BeautifulSoup module%u2019s name is bs4 (for Beautiful Soup, version 4). To install it, you will need to run pip install beautifulsoup from the command line. While beautifulsoup is the name used for installation, to import Beautiful Soup you run import bs4. 


Setting up everything


Installing BeautifulSoup

We use the pip3 command to install the necessary modules.

We need to install the lxml module, which is used by BeautifulSoup.

C:\Users\example> pip3 install lxml


BeautifulSoup is installed with the above command.

C:\Users\example> pip3 install bs4


Installing Requests

The requests module lets you easily download files from the Web without having to worry about complicated issues such as network errors, connection problems, and data compression. The requests module doesn%u2019t come with Python, so you%u2019ll have to install it first.

 From the command line, run pip install requests

C:\Users\example> pip3 install requests



 The requests module was written because Python%u2019s urllib2 module is too complicated to use.

Importing the Libraries

The Python Requests library allows you to make use of HTTP within your Python programs in a human-readable way, and the Beautiful Soup module is designed to get web scraping done quickly. 

We will import both Requests and Beautiful Soup with the import statement. For Beautiful Soup, we%u2019ll be importing it from bs4, the package in which Beautiful Soup is found. With both the Requests and Beautiful Soup modules imported, we can move on to working to first collect a page and then parse it.

Creating a BeautifulSoup Object from HTML 

The BeautifulSoup() Object needs to be called with a string containing the HTML it will parse and The Second Option Specifies The Parser. You can use other parsers as well such as %u2018html.parser%u2019 or %u2018html5lib%u2019 depending upon your Usage. But Since the BeautifulSoup recommends using the lxml parser for speed so make sure you do it this way.

This code uses requests.get() to download the main page from the cppsecrets.com website and then passes the text attribute of the response to BeautifulSoup() Object as an Argument. The BeautifulSoup object that it returns is stored in a variable named soup. Enter the following into the interactive shell or to your script while your computer is connected to the Internet.

Once you have a BeautifulSoup object, you can use its methods to locate specific parts of an HTML document.

Navigate Through a Webpage via BeautifulSoup 

Soup is an Object that we can actually Interact with by tag. So, Here Are Some Simple Ways Through Which You can navigate through a Webpage using beautifulsoup.

Here is an example of how you would do that.



Finding an element using Through a Tag is an easy one. Using BeautifulSoup we can also find a list of Elements Through the find_all() method That BeautfilSoup Provides. The find_all() Method returns a list of Items Through which we can Iterate Over to Find a particular Item that we are looking for.

Here Is an Example, In which we will find all the <a> tags and Iterate Over them to grab the links.  

Through using Python and Beautiful Soup to scrape data from a website. We have Gathered Information From a Webpage And have worked on How to Filter That Data Using the Python Bs4 Module.

You can continue working on collecting more data and Storing that Data In a CSV File. You can also use what you have learned to scrape data from other websites.


Comments