Beautiful Soup Web Scraping Example



Python offers a lot of powerful and easy to use tools for scraping websites. One of Python's useful modules to scrape websites is known as Beautiful Soup.

  1. Beautiful Soup Web Scraping Example Free
  2. Beautiful Soup Web Scraping Example Using
  3. Beautiful Soup Web Scraping Examples
  4. Beautiful Soup Web Scraping Examples

Welcome to part 3 of the web scraping with Beautiful Soup 4 tutorial mini-series. In this tutorial, we're going to talk more about scraping what you want, specifically with a table example, as well as scraping XML documents. We begin with our same starting code. This tutorial was a basic introduction to web scraping with beautiful soup and how you can make sense out of the information extracted from the web by visualizing it using the bokeh plotting library. A good exercise to take a step forward in learning web scraping with beautiful soup is to scrape data from some other websites and see how you can. Create a Beautiful Soup Object and define the parser. Implement your logic. Disclaimer: This article considers that you have gone through the basic concepts of web scraping. The sole purpose of this article is to list and demonstrate examples of web scraping. The examples mentioned have been created only for educational purposes.

But we can automate the above examples in Python with Beautiful Soup module. Dos and don’ts of web scraping. Web scraping is legal in one context and illegal in another context. For example, it is legal when the data extracted is composed of directories and telephone listing for personal use. By Divyanshu Shekhar. In Python, Web Scraping. On August 31, 2020. In this blog, we will learn about BeautifulSoup Find and Findall function is used to parse the Scraped HTML Content to get useful data from the web. You will mostly use the Find and Findall function whenever scraping.

In this example we'll provide you with a Beautiful Soup example, known as a 'web scraper'. This will get data from a Yahoo Finance page about stock options. It's alright if you don't know anything about stock options, the most important thing is that the website has a table of information you can see below that we'd like to use in our program. Below is a listing for Apple Computer stock options.

First we need to get the HTML source for the page. Beautiful Soup won't download the content for us, we can do that with Python's urllib module, one of the libraries that comes standard with Python.

Fetching the Yahoo Finance Page



2
4
optionsUrl='http://finance.yahoo.com/q/op?s=AAPL+Options'

Scraping

2
4
optionsUrl='http://finance.yahoo.com/q/op?s=AAPL+Options'


This code retrieves the Yahoo Finance HTML and returns a file-like object.

If you go to the page we opened with Python and use your browser's 'get source' command you'll see that it's a large, complicated HTML file. It will be Python's job to simplify and extract the useful data using the BeautifulSoup module. BeautifulSoup is an external module so you'll have to install it. If you haven't installed BeautifulSoup already, you can get it here.

Beautiful soup web scraping example pdf

Beautiful Soup Example: Loading a Page

The following code will load the page into BeautifulSoup:

2
soup=BeautifulSoup(optionsPage)
Soup

Beautiful Soup Example: Searching

Beautiful Soup Web Scraping Example

Now we can start trying to extract information from the page source (HTML). We can see that the options have pretty unique looking names in the 'symbol' column something like AAPL130328C00350000. The symbols might be slightly different by the time you read this but we can solve the problem by using BeautifulSoup to search the document for this unique string.

Let's search the soup variable for this particular option (you may have to substitute a different symbol, just get one from the webpage):

2
[u'AAPL130328C00350000']

This result isn’t very useful yet. It’s just a unicode string (that's what the 'u' means) of what we searched for. However BeautifulSoup returns things in a tree format so we can find the context in which this text occurs by asking for it's parent node like so:

2
>>>soup.findAll(text='AAPL130328C00350000')[0].parent
<ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a>

Beautiful Soup Web Scraping Example Free

We don't see all the information from the table. Let's try the next level higher.

2
>>>soup.findAll(text='AAPL130328C00350000')[0].parent.parent
<td><ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a></td>

And again.

2
>>>soup.findAll(text='AAPL130328C00350000')[0].parent.parent.parent
<tr><td nowrap='nowrap'><ahref='/q/op?s=AAPL&amp;amp;k=110.000000'><strong>110.00</strong></a></td><td><ahref='/q?s=AAPL130328C00350000'>AAPL130328C00350000</a></td><td align='right'><b>1.25</b></td><td align='right'><span id='yfs_c63_AAPL130328C00350000'><bstyle='color:#000000;'>0.00</b></span></td><td align='right'>0.90</td><td align='right'>1.05</td><td align='right'>10</td><td align='right'>10</td></tr>
Using

Beautiful Soup Web Scraping Example Using

Bingo. It's still a little messy, but you can see all of the data that we need is there. If you ignore all the stuff in brackets, you can see that this is just the data from one row.

2
4
[x.text forxiny.parent.contents]
foryinsoup.findAll('td',attrs={'class':'yfnc_h','nowrap':'})

This code is a little dense, so let's take it apart piece by piece. The code is a list comprehension within a list comprehension. Let's look at the inner one first:

foryinsoup.findAll('td',attrs={'class':'yfnc_h','nowrap':'})

This uses BeautifulSoup's findAll function to get all of the HTML elements with a td tag, a class of yfnc_h and a nowrap of nowrap. We chose this because it's a unique element in every table entry.

If we had just gotten td's with the class yfnc_h we would have gotten seven elements per table entry. Another thing to note is that we have to wrap the attributes in a dictionary because class is one of Python's reserved words. From the table above it would return this:

<td nowrap='nowrap'><a href='/q/op?s=AAPL&amp;amp;k=110.000000'><strong>110.00</strong></a></td>

Beautiful Soup Web Scraping Examples

We need to get one level higher and then get the text from all of the child nodes of this node's parent. That's what this code does:

This works, but you should be careful if this is code you plan to frequently reuse. If Yahoo changed the way they format their HTML, this could stop working. If you plan to use code like this in an automated way it would be best to wrap it in a try/catch block and validate the output.

This is only a simple Beautiful Soup example, and gives you an idea of what you can do with HTML and XML parsing in Python. You can find the Beautiful Soup documentation here. You'll find a lot more tools for searching and validating HTML documents.

  • Beautiful Soup Tutorial
  • Beautiful Soup Useful Resources
  • Selected Reading

In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. In this we will try to scrap webpage from various different websites (including IMDB). We will cover beautiful soup 4, python basic tools for efficiently and clearly navigating, searching and parsing HTML web page. We have tried to cover almost all the functionalities of Beautiful Soup 4 in this tutorial. You can combine multiple functionalities introduced in this tutorial into one bigger program to capture multiple meaningful data from the website into some other sub-program as input.

This tutorial is basically designed to guide you in scarping a web page. Basic requirement of all this is to get meaningful data out of huge unorganized set of data. The target audience of this tutorial can be anyone of:

  • Anyone who wants to know – how to scrap webpage in python using BeautifulSoup 4.

  • Any data science developer/enthusiasts or anyone, how wants to use this scraped (meaningful) data to different python data science libraries to make better decision.

Beautiful Soup Web Scraping Examples

Though there is NO mandatory requirement to have for this tutorial. However, if you have any or all (supercool) prior knowledge on any below mentioned technologies that will be an added advantage −

  • Knowledge of any web related technologies (HTML/CSS/Document object Model etc.).

  • Python Language (as it is the python package).

  • Developers who have any prior knowledge of scraping in any language.

  • Basic understanding of HTML tree structure.