Web Scraping Sports Data: A BeautifulSoup Tutorial with Selenium

In a nutshell, the timeline of an analytics project goes: start with a question or idea, formulate a method to study this question, web scraping sports data to feed into your model, and finally analyze the results to see what they actually said. This article is a tutorial on how to do web scraping for sports data using the Python packages `BeautifulSoup’ and `Selenium’. As a case study, we’ll do some very basic analysis on the 2021 US Open tournament but will truly focus on how to write the Python code to get the results you want.

Originally, I learned web scraping using this book as a guide, but I’ve tried to distill the information to a simple example in this article. Still, I would recommend that reference for more in depth details and learning.

What is Web Scraping?

Web scraping sports data is the very simple process of taking tables of data that show up on some website and writing a script (for us in Python) to harvest that data and convert it into a useful, ingestible format on your local machine to perform analysis. Most websites do not have a useful click here to download this data button and so extra effort is often needed to get the data we want.

To get an idea of what is going on, go to this sports-reference page and open up the HTML source code (on chrome: ctrl + F12, or right-click and select inspect element). An example of what you should see is below.

raw HTML, used for web scraping sports data

When you open a website, the server sends the raw HTML (shown on the right) to your browser. The browser’s job is to translate the source HTML code (together with something we don’t need to care about called ‘CSS’) into a visual medium. That is, everything you see on the screen is delivered to your browser in HTML.

For example, the HTML code on the right contains most of the necessary information to render the first row of the table on the left. For example the ‘<td class=”right” data-stat=”visitor_pts”>104</td> indicates that the visitor points column should have the entry “104”. Thus, web scraping sports data boils down to downloading the HTML, looking for the relevant table rows and features, and extracting the data. Anything that shows up on the screen can be found in the source code. Therefore, anything you see can be scraped.

Before turning to specifics and Python code showing how to do this with the BeautifulSoup package, we want to include a word of warning, While in general it is perfectly legal to pull data from a website for your own purposes, web scraping sports data can get into some gray territory. For example, sports-reference explicitly prohibits writing web scraping sports data “…in a manner that adversely impacts site performance or access“.

Generally speaking, don’t reproduce the data and claim it as your own and don’t use your scripts to send many, many requests to the server in a short period of time. For example, any time I request data from sports-reference inside of a loop, I include a line of code that forces my script to pause for one second in between requests. This is more friendly to the host server.

Basics of BeautifulSoup for Web Scraping Sports Data

The Python package called BeautifulSoup gives developers a way to efficiently search through the ‘soup’ of different tags in a page’s HTML to find the data you want. The first step, though, is to ask for a website to send the HTML over to you so that you can begin to work with it. For that, we’ll use the requests Python library. The first step in our code is creating the BeautifulSoup object; the code to do so is shown below.

python code used for data mining for sports data

(For many non-computer scientists, showing code you write publicly is akin to reading your diary aloud. It can be embarrassing to reveal your style. I’m on fully display here). To appreciate the power of the BeautifulSoup object, we first need to understand a little bit about the structure of tables in HTML. Tables are usually built from four tags:

  1. The <table> tag wraps around an entire table
  2. The <tr> tag indicates a row in the table
  3. The <th> tag indicates a table header cell
  4. The <td> tag indicates a data (non-header) cell

Because we typically want to pull data from the cells of a table, web scraping sports data mostly boils down to searching for the <td> tags and pulling the data from that cell. The BeautifulSoup object is designed specifically to allow for easy searching of the HTML for the tags we want. For example, the following call to the method “findAll()” lets us search for all occurrences of a specific tag. In our case, this function call finds all the table cells on the page and stores them in the list called cells.

find table rows with BeautifulSoup

In our case, this function-call finds 897 cells on the page. Each object in the list – e.g. cells[0], cells[1], etc. – is itself another BeautifulSoup object which may be manipulated in the same way as the original soup object. BeautifulSoup helps web scraping sports data by efficiently organizing the HTML for quick search.

In our case, if we want to scrape sports data off a page we need to retain a little bit more structure. In particular, I find it much easier to search for all the table rows then for each row extract the data in each cell. This way, there is less ambiguity as to what column and row each <td> tag originated from. The elements of a row kind of stick together.

The code below is typical for any of my web scraping sports data programs. First, I find all the rows in the table. Then, for each row I search for all the cells in that row in order so that I know which column the cell originated in. Finally, I pull the data out with the .get_text() method.

an examples of scraping for sports data

This code will go through every row in the table and grab the game’s start time and home team. Then, I can do whatever I want with this data. Typically I aggregate the data in lists and make a pandas dataframe to save off the data to a .csv file. There are certainly other methods, but that one is simple and effective. Of course, if this was all there was to it, web scraping sports data would be a very simple topic. Enter JavaScript.

JavaScript’s Effect on Web Scraping

Of course, everything isn’t as simple as it appears. Most websites do something a little bit more complicated than just rendering HTML. In fact, if you want any interactivity at all on your website, chances are you’re going to be using JavaScript. Together with HTML and CSS, Javascript is one of the main workhorses that makes the internet go.

Javascript is a lot like any coding language you would run locally on your own machine. But, JavaScript is run by your browser when it is called to perform a task. For a concrete example, we move to the world of golf. Suppose I wanted to know how each golfer did on each hole at last year’s US open. If we didn’t know any better, we might assume we could follow this link to the ESPN page for the tournament and scrape the data we want.

The data we want is there, but I had to click the little down arrow next to Bryson’s name to get it. In fact, that little click I did was enough to prevent us from getting the data in the simple manner we’re used to. If you follow the method in the last section and try to just access all table elements, you won’t get any player’s scorecards. These scorecards are “hidden” on the page before you actually click on a player’s name. Web scraping sports data requires one extra tool to get around this difficulty.

Selenium for Data Mining

The solution to all of our problems is a Python package that lets us automate the required clicking inside our Python code. The solution is a package called Selenium. Selenium is a tool to automate interaction with a web browser and the JavaScript on a web page. You can install selenium via pip in the same way you would install any other Python package.

You also need to install a webdriver (an executable) and put it in your path so that Python can find the program. My personal browser of choice is chrome so I installed chromedriver to make everything work. The first few lines of code to open a web page with Selenium is shown below.

Notice that this isn’t terribly different from what we’ve done before. However, now web scraping sports data is much easier. The above code opens up the page for the 2021 US open with Selenium. We need to instruct Selenium to click on each player’s name in order to open the dropdown menu and access the scorecards we want. Let’s see how we might make Selenium click on Bryson DeChambeau.

The selenium function ‘find_element_by_partial_link_text’ simply searches for a link containing the words you provided it. We ask Selenium to find a link containing the name ‘Bryson DeChambeau’ and to click it.

And that will do it! If you look at the chrome instance opened by Selenium when you called the method browser.get(url), you’ll now see that Bryson’s dropdown menu is visible. At this point, we can ask for the HTML for what is currently displayed on the page, the data we want will be there. The following code is written by examining that HTML to find the data we want

This code looks for the rows in Bryson’s scorecard, extracts the par and Bryson’s score on each hole. The contents of the lists ‘pars_we_want’ and ‘scores_we_want’ are shown below.

Just like that, we have all the data we wanted. Really, that is all the technical know-how you’ll need to get all the golf data you want. BeautifulSoup and Selenium together are all we need for web scraping sports data. You can filter out the word ‘Score’ as well as the ‘33’, ‘34’, and ‘67’ (which are his front 9, back 9, and total round score) to get precisely the data we want.

In order to scrape all the hole results from every round and every player, we’re going to need to implement some more code and some loops before dumping everything into a pandas dataframe. But, the tools I’ve shown above are enough. You just have to put all the pieces together in the right order.

Conclusions

Web scraping sports data is more an art than a science. This article described exactly the coding know-how you need to get the data into your python script. After that, you just need to understand the structure of a web page in order to search for the right HTML elements in the correct locations at the right time. Using selenium to access the specific parts of the web page we want is the last necessary step to be able to scrape any data we can find on the internet. To see another example of using Selenium for scraping sports data, check out our past article about NHL hockey shot charts