Python Beautifulsoup Web Scraping



Scrape a Dynamic Website with Python; Web Scraping with Javascript (NodeJS) Turn Any Website Into An API with AutoScraper and FastAPI; 6 Puppeteer Tricks to Avoid Detection and Make Web Scraping Easier; How to use a proxy in Playwright; Scrape a Dynamic Website with Python. BeautifulSoup is a Python library used for parsing documents (i.e. Mostly HTML or XML files). Using Requests to obtain the HTML of a page and then parsing whichever information you are looking for with BeautifulSoup from the raw HTML is the quasi-standard web scraping „stack“ commonly used by Python programmers for easy-ish tasks. Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can then be imported and used in various ways. Some of the practical applications of web scraping could be: Gathering resume of candidates with a specific skill. Manually Opening a Socket and Sending the HTTP Request. The most basic way to perform. Python & Web Scraping Projects for $10 - $30. You need to write three scrapers: one using Beautiful Soup, one using Scrapy, one using Selenium. All of them should scrap the same information from the domain. The goal is to gather the information.

Web scraping python beautifulsoup tutorial with example

Web scraping python beautifulsoup tutorial with example : The data present are unstructured and web scraping will help to collect data and store it. There are many ways of scraping websites and online services. Use the API of the website. Example, Facebook has the Facebook Graph API and allows retrieval of data posted on Facebook. Then access the HTML of the webpage and extract useful data from it. This technique is called as web scraping or web harvesting or web data extraction.

Steps involved in web scraping python beautifulsoup :-

  1. Send a request to the URL of a webpage which you want to access.
  2. Then the server will respond to the request by returning the HTML content of the webpage.
  3. After accessing data from HTML content we are at the left task of parsing data.
  4. We need to navigate and search trees that we create a task.
Scraping

Installing required third party library:-

Easy way to install the library in python to use pip and used to install and manage packages in python.
Pip install requests
Pip install html5lib
Pip install bs4
Then access HTML content from the webpage:-
Import requests
URL=http://www.geeksforgeeks.org/data-structures/
R=requests. get (URL)
Print (r.content)

  1. First, the step is to import request library and specify URL of webpage which you want to scrape.
  2. And send an HTTP request to URL and then save response from the server in response object called r.
  3. Also print r.contemt ton get rawHTML content of webpage.

Parse HTML content:-

Import requests
From bs4 import Beautifulsoup
URL=”http://www.values.com/inspirational-quotes”
R=requests. get (URL)
Soup=Beautifulsoup (r.content,’html5lib’)
Print(soup.prettify ())
The library in beautifulsoup is build on top of the HTML libraries as html.parser.Lxml.and the it will specify parser library as,
Soup=BeautifulSoup (r.content,’html5lib’)
From above example soup=beautifulsoup (r.content,’html5lib’)-will create an object by passing the arguments.
Html5lib:-will specify parser which we use.
r.content:-also called as raw HTML content.

Libraries used for web scraping python beautifulsoup :-

We will use the following libraries:

  1. Selenium: - It is a web testing library and used to automate browser activities.
  2. BeautifulSoup: -Beautiful Soup is also called Python package for parsing HTML and XML documents and creates the parse trees which are helpful to extract the data easily.
  3. Pandas: - the library is used for data manipulation and analysis. And also used to extract the data and store it in the desired format.

Automated web scraping can be used to speed up the data collection process.
You can write your code once and it will get the information you want from many times and many pages.
When you try to get the information and if you want to do manually you have to spend a lot of time clicking, scrolling, and searching.
You need large amounts of data from websites that are regularly updated with new content.
The manual web scraping can take a lot of time and repetition.
There is much information on the Web and new information is added.
Python Beautiful Soup and libraries requests both are powerful tools for the job.
If you like to learn with hands-on example you have a basic understanding of Python and HTML.
Web scraping will extract the data and presents it in a format you can easily make sense of.
It is the process of gathering information from the Internet.

HTML tags:-

<! DOCTYPE html>
<html>
<head>
</head>
<body>
<h1> first scraping</h1>
<p>Hello World</p>
<body>
</html>
1. <! DOCTYPE html>: it starts the document with a type declaration.
2 It is contained between <html> and </html>.
3. The script and Meta declaration of the HTML document is between <head>and </head>.
4. HTML document contains visible part between <body> and </body>tags.
5. The title headings are defined with the <h1> through <h6> tags.
6. All paragraphs are defined with the <p> tag.
And useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.
HTML tags sometimes come with the id or class attributes.
The id attribute specify a unique id for an HTML tag and the value must be unique within the HTML document.
The class attribute is used to define tags with the same class.
We use of these id and classes to help us locate the data we want.

The rules for scraping:-

We have to Terms and Conditions before you scrape it and be careful to read the statements about the legal use of data and should not be used for commercial purposes.
Do not request data from the website with your program as this may break the website. The layout may change from time to time we have to make sure to revisit the site and rewrite your code as needed.

Scraping Flipchart Website:-

Find the URL that you want to scrape
We are going to scrape the Flipchart website to extract the Price, Name, and Rating of Laptops.
The URL for this page is https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.
Inspecting the Page
The data is usually nested in tag and inspect the page to see which tag the data we want to scrape is nested.
To inspect the page we just right click on the element and click on “Inspect”.

The next step is that you will see a “Browser Inspector Box” open.
Find the data you want to extract
Then extract the Price, Name, and Rating which are nested in the “div”.

Web scraping python beautifulsoup Example:-

Importing libraries as,
From selenium import webdriver
From beautifulsoup import beautifulsoup
Import pandas as pd
For configuration:-
Driver=webdriver.chrome (“/usr/lib/chromium-browser/chromedriver”)
Products= []
Prices= []
Ratings= []
Driver. get(https://www.flipcart.com/laptops/>https://www.flipkart.com/laptops/~buyback-gauranteelaptops-/pr?sid=6bo%Cb5&uniq)
Code is as follows:-
content=driver.page_source
soup=Beautifulsoup (content)
for a in soup.finsAll (‘a’, href=True, attrs= {‘class’:’_31qSD5’}):
name=a. find (‘div’, attrs= {‘class’:’_3wU53n’})
price=a. find (‘div’, attrs= {‘class’:’_1vC4OE_2rQ-NK’})
name=a. find (‘div’, attrs= {‘class’:’hGSR34_2beYZw’})
products. append (name. text)
prices. append(price. text)
ratings. append (ratings. text)

Run the code and extract the data

To run the code, use the below command:
Python web-s.py
Store the data in a required format:-
df=pd.Dataframe ({‘product name’: products,’ Price’: prices, ‘Ratings’: ratings})
df.to_csv (‘products.csv’, index=False, encoding=’utf-8’)

APIs: An Alternative to Web Scraping:-

The Web is grown out of many sources and combines a ton of different technologies, styles, and personalities.
The API (application programming interfaces) allow to accessing data in a predefined manner.
You can avoid parsing HTML and instead access the data directly using format.
HTML is a way to visually present content to users.
The process is more stable than gathering the data through web scraping.
APIs are made to be consumed by programs than by human eyes.
Scraping the Monster Job Site:-
You will build a web scraper that fetches Software Developer job listings from the job aggregator site.
Web scraper will parse the HTML to pick out the pieces of information and filter the content for specific words.
Inspect Your Data Source:-
Click through the site and interact with it just like any normal user would.
In this example you could search for Software Developer jobs in Australia using the site’s native search interface:

Query parameters generally consist of three things:-

  1. Start: - The query parameters are denoted by a question mark (?).
  2. Information: - The pieces of information constitute one query parameter that is encoded in key value.

Where related keys and values are joined together by an equals sign.

  1. Separator: - Every URL can have multiple query parameters which are separated from each other by an ampersand.

Python 3 Web Scraping Beautifulsoup Tutorial

Hidden Websites:-
The information is hidden in login and needs to see from the page.
The HTTP request from python script is different than accessing the page from the browser.
Some advanced techniques are also used with a request to access behind the login.
Dynamic Websites:-
They are easy to work with because the server will send you an HTML page which contains all the information as a response.
Then you can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.
Using the dynamic website the server might not send HTML at all and receive JavaScript code as a response.

Parse HTML Code with Beautiful Soup:-

Pip3 install beautifulsoup4
After it import library and create beautiful soup object,
Import requests
From bs4 import Beautifulsoup
URL=’https://www.monster.com/jobs/search/?q=software-developer&where=Austrialia’
Page=requests. get (URL)
Soup=Beautifulsoup (page.content,’html.parser’)

Find the URL you want to scrape:-

To scrape the web for means to find speeches by famous politicians then scrape the text for the speech, and analyze it for how often they approach certain topics, or use certain phrases.
Before you try to start scraping a site we check the rules of the website first.
Rule can be found in the robots.txt file, which can be found by adding a /robots.txt path to the main domain of the site.

Identify the structure of the sites HTML:-

After finding a site to scrap use chrome’s developer tools to inspect the site’s HTML structure.
It is important because more you want to scrape data from certain HTML elements, or elements with specific classes or IDs.
Using the inspect tool you can identify which elements you need to target.

Install Beautiful Soup and Requests:-

There are packages and frameworks, like Scrapy but Beautiful Soup will allow you to parse the HTML.
With Beautiful Soup we need to install a Request library, which will fetch the url content.
The Beautiful Soup documentation has a lot of examples to help get you started as well.
$pip install requests
$pip install beautifulsoup4

Web Scraping Code:-

Results:-

This finds all of the <p> elements in the HTML.
The text allows selecting only the text from inside all the <p> elements.

It is messy and so filtering of results using the Beautiful Soup text allows us to get a cleaner return.
Other ways are present to search, filter and isolate the results you want from the HTML.
You can also be more specific, finding an element with a specific class as,

This would fine all the <div> elements with the class “cool_paragraph”.

Sometimes we need to extract information from websites. We can extract data from websites by using there available API’s. But there are websites where API’s are not available.

Here, Web scraping comes into play!

Python is widely being used in web scraping, for the ease it provides in writing the core logic. Whether you are a data scientist, developer, engineer or someone who works with large amounts of data, web scraping with Python is of great help.

Python Web Scraping Using Beautifulsoup

Without a direct way to download the data, you are left with web scraping in Python as it can extract massive quantities of data without any hassle and within a short period of time.

In this tutorial , we shall be looking into scraping using some very powerful Python based libraries like BeautifulSoup and Selenium.

BeautifulSoup and urllib

BeautifulSoup is a Python library for pulling data out of HTML and XML files. But it does not get data directly from a webpage. So here we will use urllib library to extract webpage.

First we need to install Python web scraping BeautifulSoup4 plugin in our system using following command :

$ sudo pip install BeatifulSoup4

$ pip install lxml

OR

$ sudo apt-get install python3-bs4

$ sudo apt-get install python-lxml

So here I am going to extract homepage from a website https://www.botreetechnologies.com

from urllib.request import urlopen

Web Scraping Beautiful Soup Python

from bs4 import BeautifulSoup

We import our package that we are going to use in our program. Now we will extract our webpage using following.

response = urlopen('https://www.botreetechnologies.com/case-studies')

Beautiful Soup does not get data directly from content we just extract. So we need to parse it in html/XML data.

data = BeautifulSoup(response.read(),'lxml')

Here we parsed our webpage html content into XML using lxml parser.

As you can see in our web page there are many case studies available. I just want to read all the case studies available here.

There is a title of case studies at the top and then some details related to that case. I want to extract all that information.

We can extract an element based on tag , class, id , Xpath etc.

You can get class of an element by simply right click on that element and select inspect element.

case_studies = data.find('div', { 'class' : 'content-section' })

In case of multiple elements of this class in our page, it will return only first. So if you want to get all the elements having this class use findAll() method.

Beautifulsoup Python Web Scraping Code

case_studies = data.find('div', { 'class' : 'content-section' })

Now we have div having class ‘content-section’ containing its child elements. We will get all <h2> tags to get our ‘TITLE’ and <ul> tag to get all children, the <li> elements.

case_stud.find('h2').find('a').text

case_stud_details = case_stud.find(‘ul’).findAll(‘li’)

Now we got the list of all children of ul element.

To get first element from the children list simply write:

case_stud_details[0]

We can extract all attribute of a element . i.e we can get text for this element by using:

case_stud_details[2].text

But here I want to click on the ‘TITLE’ of any case study and open details page to get all information.

Since we want to interact with the website to get the dynamic content, we need to imitate the normal user interaction. Such behaviour cannot be achieved using BeautifulSoup or urllib, hence we need a webdriver to do this.

Webdriver basically creates a new browser window which we can control pragmatically. It also let us capture the user events like click and scroll.

Selenium is one such webdriver.

Selenium Webdriver

Selenium webdriver accepts cthe ommand and sends them to ba rowser and retrieves results.

You can install selenium in your system using fthe ollowing simple command:

$ sudo pip install selenium

In order to use we need to import selenium in our Python script.

from selenium import webdriver

I am using Firefox webdriver in this tutorial. Now we are ready to extract our webpage and we can do this by using fthe ollowing:

self.url = 'https://www.botreetechnologies.com/'

self.browser = webdriver.Firefox()

Now we need to click on ‘CASE-STUDIES’ to open that page.

We can click on a selenium element by using following piece of code:

self.browser.find_element_by_xpath('//div[contains(@id,'navbar')]/ul[2]/li[1]').click()

Now we are transferred to case-studies page and here all the case studies are listed with some information.

Here, I want to click on each case study and open details page to extract all available information.

Creating A Web Scraper Python

So, I created a list of links for all case studies and load them one after the other.

To load previous page you can use following piece of code:

self.browser.execute_script('window.history.go(-1)')

Final script for using Selenium will looks as under:

And we are done, Now you can extract static webpages or interact with webpages using the above script.

Conclusion: Web Scraping Python is an essential Skill to have

Today, more than ever, companies are working with huge amounts of data. Learning how to scrape data in Python web scraping projects will take you a long way. In this tutorial, you learn Python web scraping with beautiful soup.

Along with that, Python web scraping with selenium is also a useful skill. Companies need data engineers who can extract data and deliver it to them for gathering useful insights. You have a high chance of success in data extraction if you are working on Python web scraping projects.

If you want to hire Python developers for web scraping, then contact BoTree Technologies. We have a team of engineers who are experts in web scraping. Give us a call today.

Consulting is free – let us help you grow!