Web Scraping Sports Data: Scrape sports.yahoo.com
By: Manthan Koolwal (Author Bio Below)
Web Scraping data is required for performance analytics. You can find this data on websites like NBA, FIFA, NFL, Yahoo Sports, etc. The data can also be used for creating your own sports app.
Using web scraping you can show near-to-real-time data on your app or web app. Today in this post we will learn to scrape sports.yahoo.com for FIFA 2022 data.
We will use Python as this is the most preferred language when it comes to web scraping. At the end of this article, you will be able to scrape live FIFA data. Scraping sports data is very simple and we’ll learn it in a step-by-step method.
Why use Python to Scrape Sports Data
Python is the most versatile language and is used extensively with web scraping. Moreover, it has dedicated libraries for scraping the web.
With a large community, you might get your issues solved whenever you are in trouble. If you are new to web scraping with python, I would recommend you to go through this guide comprehensively made for web scraping with it.
Requirements for scraping Yahoo sports
We assume for the duration of this post that you already have Python 3.x installed on your computer. Along with that, you need to install two more libraries which will be used further in this tutorial for web scraping.
- Requests will help us to make an HTTP connection with Bing.
- BeautifulSoup will help us to create an HTML tree for smooth data extraction.
Setup
First, create a folder and then install the libraries mentioned above. These commands should be entered directly into the anaconda prompt.
- mkdir sports
- pip install requests
- pip install beautifulsoup4
Inside this folder create a python file where the code for the scraper will live. We are going to scrape live game data from the target website.
How to Scrape Yahoo Sports
First, we should make a normal GET request to the target URL and check whether it returns status 200 or not.
- from bs4 import BeautifulSoup
- l=list()
- o={}
- target_url=”https://sports.yahoo.com/soccer/world-cup/scoreboard/”
- resp=requests.get(target_url)
- print(resp.status_code)
Let me explain what we have done here. We have declared a target URL and then we have made an HTTP GET request to the target URL.
If it prints 200 then your code has worked. Otherwise, you can pass some user agents to make it look like a real browser. Now, we can use BS4 to extract useful data. The code should continue
- soup=BeautifulSoup(resp.text, ‘html.parser’)
Let’s find the DOM location of live data.
All the live data is located under a tag with the class gamecard-in_progress. Let’s declare a variable where we can hold all this data in one place.
- allData = soup.find(“a”,{“class”:”gamecard-in_progress”})
- teams= allData.find_all(“li”,{“class”:”team”})
allData variable stores the complete tree of class gamecard-in_progress. teams is a list that holds all the information of two teams shown inside the box.
Now, you can inspect to find where the names and scores are located.
As you can see the name is stored under the span tag with an attribute of data-tst which has a value of first-name. Let’s see where the score is stored.
The score is stored under the div tag with the class “Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)”. We have all the information to extract all the data we need for live score updates.
- o[“Team-A”]=teams[0].find(“span”,{“data-tst”:”first-name”}).text
- o[“Score-A”]=teams[0].find(“div”,{“class”:”Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)”}).text
- o[“Team-B”]=teams[1].find(“span”,{“data-tst”:”first-name”}).text
- o[“Score-B”]=teams[1].find(“div”,{“class”:”Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)”}).text
- l.append(o)
- print(l)
After printing, you should get the LIVE updates from any FIFA game.
Complete Code
With just a few changes to the code, you can extract data for upcoming and old games also. But for now, the code will look like this.
- import requests
- from bs4 import BeautifulSoup
- l=list()
- o={}
- target_url=”https://sports.yahoo.com/soccer/world-cup/scoreboard/”
- resp=requests.get(target_url)
- soup=BeautifulSoup(resp.text, ‘html.parser’)
- allData = soup.find(“a”,{“class”:”gamecard-in_progress”})
- teams= allData.find_all(“li”,{“class”:”team”})
- o[“Team-A”]=teams[0].find(“span”,{“data-tst”:”first-name”}).text
- o[“Score-A”]=teams[0].find(“div”,{“class”:”Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)”}).text
- o[“Team-B”]=teams[1].find(“span”,{“data-tst”:”first-name”}).text
- o[“Score-B”]=teams[1].find(“div”,{“class”:”Whs(nw) D(tbc) Va(m) Fw(b) Fz(27px)”}).text
- l.append(o)
- print(l)
Conclusion
It is important to extract player and score information if you want to create your own app or website. Extracting data from certain non-English websites and then delivering them in normal English can be a boost for your app.
Python is altogether a great language to pull all this information with ease. It has great community support with a long list of libraries which makes web scraping super easy for beginners.
But scraping at scale would not be possible with this process. After some time yahoo sports will block your IP and your data pipeline will be blocked permanently. For seamless scraping use Web Scraping API which will rotate IPs on every new request and will use headless chrome to reduce any chance of blockage.
To receiver email updates when new articles are posted, use the subscription form below!
This post was contributed by Manthan Koolwal. Manthan loves to create web scrapers. He has been working on them for the last 10 years now. He has created data pipelines for multiples MNCs in the past. Currently, he is working on Scrapingdog – a web scraping API that can scrape any website without blockage at any scale.