Home » Kodi » Video Scraping With Beautiful Soup and Python

Video Scraping With Beautiful Soup and Python

Last updated on August 14th, 2017 at 01:21 pm EST

Video Scraping With Beautiful Soup and Python

In our quest to build a working third party Kodi video addon, we already learnt a few things like creating a menu, playing a video etc. The most important part of a video addon, is to create the logic to search a website and then get the video link which you can then play from within Kodi (called scraping). In this post we are going to see how you can scrape a website for video links with Beautiful Soup and Python. We will be scraping youtube, because it’s easy and will give you confidence as well. Do note, we are only processing the first page of search results. Let us go ahead and look at our first video scraping with Beautiful Soup tutorial.

 Video Scraping With Beautiful Soup

Video Scraping With Beautiful Soup and Python

Disclaimer

This article is for Educational Purposes only. Please check the laws for web scraping for your country and the website you are scraping. We are not responsible for companies suing you or law enforcement, intelligence or secret services knocking at your door.

Video Scraping With Beautiful Soup and Python

Source

You can find the source code for the Python Script here.

Let’s Dive In

In this example we will be scraping Youtube, based on the search term provided by us. You would need to know basic html tags. We will be using Beautiful Soup, a python library for getting the data we want from html and xml files or sources. As this is our first video scraping example we decided to chose an easy one.

We need to import two Python libraries in our code.

import requests
from bs4 import BeautifulSoup

If you haven’t installed these libraries you can find steps on how to do that here.

Step 1 – Open Youtube.com in your Browser

We go to youtube.com on our browser. We prefer Firefox as it’s easier for what we do in Step 3, but you can also use Chrome.

Video Scraping with Beautiful Soup 001

Python code to open youtube.com, and get it’s HTML will be

sb_get = requests.get("https://www.youtube.com")

To see the HTML, we need to use

sb_get.content

In this case, youtube doesn’t know what browser the request is coming from so we may be blocked out. Hence we define a variable with details of firefox, and then use it in the above command (general terminology headers). You can chose the header for any browser you want.

mozhdr = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'}

requests.get("https://www.youtube.com", headers = mozhdr)

 

Step 2 – Enter Search Term

As we are Game of Thrones Fan, we enter the search term as Game of Thrones and then click the search icon. A new page opens up and we get a list of results. The important thing to note here is how the url changes from

https://www.youtube.com

to

https://www.youtube.com/results?search_query=game+of+thrones

Video Scraping with Beautiful Soup 002

Let us now search for Breaking Bad (another one of our favorite TV Shows). The url changes to

https://www.youtube.com/results?search_query=breaking+bad

So, we know that a space is replaced by a + and the search term is added after below url. Spaces are replaced with a + sign as url’s cannot have spaces.

https://www.youtube.com/results?search_query=

Equipped with all this knowledge, we can define three variables in our Python code.

scrape_url="https://www.youtube.com"
search_url="/results?search_query="
search_hardcode = "game+of+thrones"

We can now combine all the above terms, and we will get our search url.

sb_url = scrape_url + search_url + search_hardcode

Now, if we want the html for the search url page, we will give below command.

sb_get = requests.get(sb_url, headers = mozhdr)

sb_get now has the response from youtube.com (200 if all is good), and we can find the HTML in

sb_get.content

Step 3 – Identify Video link from HTML Source

The HTML source has a lot of stuff, what we need is the link to the video. This is where HTML basics come into the picture. All links on a page are enclosed with <a> tag. We suggest you learn more about the <a> tag from the below link.

HTML a tag Explained

We go into Firefox while we are on the search results page, and enable Inspector.

Tools – Web Developer – Inspector

Video Scraping with Beautiful Soup 003

Now, we hover the mouse cursor over the link to a video, we get below details.

Video Scraping with Beautiful Soup 004

Notice, the <a> tag has the all the details we need

  • the link to the video (in href)
  • the Title of the video (in title)

Step 4 – Filter out <a> Tags with Videos

Our page has a lot of <a> tags, but we only need the ones which have our video content. So, we need to figure out a way to identify all these <a> tags and filter them out. We need to look for something common in all these <a> tags. It is easy for yotube, but for some websites you need to filter the <div> tag within which the <a> tag is enclosed. Again, this varies from site to site.

When we move our mouse over the video links we get details of the <a> tag. All the video links we need have below <a> tag details (marked in green below).

Video Scraping 005

Video Scraping 006

The details marked in green rectangle are actually class of the <a> tag.

Video scraping 007

This is what we are going to use to pull out all the <a> tags we need from the HTML source.

Step 5 – Beautiful Soup and find_all

To use Beautiful Soup functions we first need to ensure that the HTML we have is in a format recognized by Beautiful Soup. Below command takes care of it.

 soupeddata = BeautifulSoup(sb_get.content, "html.parser")

The variable soupeddata has HTML content in a format which is recognized by Beautiful Soup.

We now need to find all <a> tags with a specific class, as those are the <a> tags of interest to us.

 yt_links = soupeddata.find_all("a", class_ = "yt-uix-tile-link")

We will use the find_all function to get all <a> tags, which have class as yt-uix-tile-link. All these <a> tags are stored in a variable called yt_links (which will eventually be a list in Python).

Step 6 – Grab Links and Title From <a> Tag

We now have a Python list of <a> tags which has all the information we need. We still have some information to filter out as we only need the URL and title. So, we need to get href and title attributes of <a> tag. Since yt_links is a list, we use a Python for loop to process the list and grab the href and title attributes.

for x in yt_links:
 yt_href = x.get("href")
 yt_title = x.get("title")

Let’s pick one of the <a> tags we filtered out.

<a href=”/watch?v=pE2wcBeyNdk” class=”yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ” data-sessionlink=”itct=CIwBENwwGAEiEwiIhOnNt8rVAhXZxFUKHe9uDzMo9CRSD2dhbWUgb2YgdGhyb25lcw” title=”Game of Thrones: The Loot Train Attack (HBO)” aria-describedby=”description-id-568845″ rel=”spf-prefetch” dir=”ltr”>Game of Thrones: The Loot Train Attack (HBO)</a>

In this case

yt_href will be /watch?v=pE2wcBeyNdk

and

yt_title will be Game of Thrones: The Loot Train Attack (HBO)

Our video url is still not complete, but all we need to do is add

https://www.youtube.com

before it, which we have in variable scrape_url

yt_final = scrape_url + yt_href

Will give us the complete link in variable yt_final

Step 7 – Done

That’s it peeps, we now have the Youtube link, and Youtube Title of the Video. If you execute the code in IDLE and print the variables yt_final and yt_title you should get an output similar like below.

Video Scraping 009

Want to convert what we learnt into a Kodi addon? Click the link below for a detailed guide.

Python – How to Create Kodi Video Addon – Youtube Search

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://docs.python-requests.org/en/master/

Do you STREAM on Kodi WITHOUT a VPN?

We STRONGLY suggest you use a VPN Service like IPVanish.

Why VPN? 1. Stay Secure and Anonymous Online 2. Stream Content Anonymously 3. Access Geo Locked Content 4. Hide Your Activity from your ISP

Why IPVanish? 1. They Keep No Logs = No Tracking 2. Native apps for Android, Android TV, iOS, Mac, Linux 3. No Speed Slowdown 4. Highly Recommended 5. 7 Days Money Back Guarantee

Sign-Up for IPVanish

Follow Us

Close
Please support the site
By clicking any of these buttons you help our site to get better