Scraping Racing Reference for NASCAR Data
Racing Reference is the comprehensive source for up-to-date historical NASCAR race results. In addition, statistics are recorded by driver, owner, and crew chief, making for an awesome foundation of racing data analysis.
Traditional stick and ball sports have seen their data and analytics movement, with a variety of tools to get data from authoritative sources into the hands of eager sports data geeks. NASCAR and racing in general have seemingly been behind on this movement, lacking the open source foundations to unleash the potential of years of historical data in tools like Python and the PyData stack.
I am planning to eventually develop a full fledged Python package to make it easier to retrieve structured NASCAR data with a few parameters. This post covers an attempt to begin to bridge that gap by scraping modern era race results in the Monster Energy NASCAR Cup Series from Racing Reference.
Making Requests
To get started, we’ll import the following four Python packages:
- requests - makes our HTTP requests
- BeautifulSoup - aids in parsing and searching HTML
- re - the Python standard regex module; useful for finding strings and extracting data from those strings
- pandas - the swiss army knife of working with data in Python; provides us with tools to get data into a nice, tidy format for analysis
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
BASE_URL = 'http://racing-reference.info'
Racing Reference uses a RESTful url scheme for accessing various pages. The raceyear endpoint contains high level results of all the races in a given year. For instance, we can receive a page with results for all races in 1979 at this url:
http://racing-reference.info/raceyear/1979/W
‘W’, I believe, stands for Winston Cup, the long-time name of the top level of NASCAR competition. In addition to the date, site, winner, and other race details, the raceyear page contains links to the full details of each race. We’ll need these links later.
First, we’ll request each raceyear from 1979 to present (2018); the NASCAR “modern era”
years = range(1979, 2019)
cup_results = [requests.get(BASE_URL + f'/raceyear/{year}/W') for year in years]
Checking HTTP status codes of our responses to make sure all our requests were successful
set([r.status_code for r in cup_results])
{200}
Each of these yearly cup series results pages has links to race details. Race detail pages have url’s like this:
http://racing-reference.info/race/1979_Winston_Western_500/W
The yearly results pages have HTML <a> tags with relative links. Lets find them all
race_anchors = []
href_regex = re.compile('/race/.*/W')
for c in cup_results:
race_anchors.extend(BeautifulSoup(c.text, 'lxml').find_all(href=href_regex))
race_anchors[:5]
[<a href="/race/1979_Winston_Western_500/W" title="Winston Western 500">1</a>,
<a href="/race/1979_Daytona_500/W" title="Daytona 500">2</a>,
<a href="/race/1979_Carolina_500/W" title="Carolina 500">3</a>,
<a href="/race/1979_Richmond_400/W" title="Richmond 400">4</a>,
<a href="/race/1979_Atlanta_500/W" title="Atlanta 500">5</a>]
We can now use the href attribute of these <a> tags to build a full url to request the race detail pages
races = [requests.get(BASE_URL + a.attrs['href']) for a in race_anchors]
Again, checking the status codes. All 200s is what we’re after
set([r.status_code for r in races])
{200}
Extracting Race Results
To extract the race results stored as an HTML table, we can use the Pandas read_html function.
Given the text of the page, read_html will return a list of dataframes from all tables found. We can filter by using the match argument to find tables containing the provided string or regex
[df.shape for df in pd.read_html(races[0].text, match='Sponsor / Owner', header=0)]
[(83, 398), (35, 11)]
A list of two dataframes was returned. This is due to the nesting of tables in the structure of the race pages.
The last element of the returned list is what we’re after
pd.read_html(races[0].text, match='Sponsor / Owner', header=0)[-1].head()
| Fin | St | # | Driver | Sponsor / Owner | Car | Laps | Money | Status | Led | Pts | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4 | 88 | Darrell Waltrip | Gatorade (DiGard Racing) | Chevrolet | 119 | 21150 | running | 87 | 185 |
| 1 | 2 | 1 | 21 | David Pearson | Purolator (Wood Brothers) | Mercury | 119 | 14200 | running | 9 | 175 |
| 2 | 3 | 2 | 11 | Cale Yarborough | Busch (Junior Johnson) | Oldsmobile | 119 | 12675 | running | 3 | 170 |
| 3 | 4 | 8 | 73 | Bill Schmitt | Old Milwaukee (Bill Schmitt) | Oldsmobile | 118 | 8000 | running | 0 | 160 |
| 4 | 5 | 17 | 1 | Donnie Allison | Hawaiian Tropic (Hoss Ellington) | Chevrolet | 118 | 7550 | running | 0 | 155 |
Extracting Race Details
To help with analysis, it will be useful to extract some further details about the race; laps, track length, track type, and race length in particular
r_details = re.compile(r'(\d+) laps\*? on a (\d?\.\d{3}) mile (.*) \((\d+\.\d+) miles\)')
details_match = r_details.search(races[0].text)
details_match[0]
'119 laps on a 2.620 mile road course (311.8 miles)'
In addition to matching the entire string, our regex captured the following
details_match[1], details_match[2], details_match[3], details_match[4]
('119', '2.620', 'road course', '311.8')
Furthermore, we can simply use the url to extract the year and race
races[0].url
'http://racing-reference.info/race/1979_Winston_Western_500/W'
race_id = races[0].url.split('/')[-2]
race_id
'1979_Winston_Western_500'
r_race_id = re.compile(r'(\d{4})_(.*)')
race_id_match = r_race_id.search(race_id)
race_id_match[1], race_id_match[2]
('1979', 'Winston_Western_500')
It would also be useful to extract the name of the track. We’ll again use a regex to find the url pattern of Racing Reference track pages like:
http://racing-reference.info/tracks/Riverside_International_Raceway
r_track_name = re.compile('/tracks/.*')
BeautifulSoup(races[0].text, 'lxml').find(href=r_track_name).text
'Riverside International Raceway'
Putting it all together to create a dataframe for each modern era race
race_data_frames = []
for r in races:
df = pd.read_html(r.text, match='Sponsor / Owner', header=0)[-1]
details_match = r_details.search(r.text)
df['race_length_laps'] = int(details_match[1])
df['track_length_miles'] = float(details_match[2])
df['track_type'] = details_match[3]
df['race_length_miles'] = float(details_match[4])
race_id = r.url.split('/')[-2]
race_id_match = r_race_id.search(race_id)
df['year'] = int(race_id_match[1])
df['race_name'] = race_id_match[2]
df['track_name'] = BeautifulSoup(r.text, 'lxml').find(href=r_track_name).text
race_data_frames.append(df)
race_data_frames[-1].head()
| Fin | St | # | Driver | Sponsor / Owner | Car | Laps | Status | Led | Pts | PPts | race_length_laps | track_length_miles | track_type | race_length_miles | year | race_name | track_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4 | 78 | Martin Truex, Jr. | Bass Pro Shops / 5-hour Energy (Barney Visser) | Toyota | 160 | running | 31 | 57 | 6 | 160 | 2.5 | paved track | 400.0 | 2018 | Pocono_400 | Pocono Raceway |
| 1 | 2 | 13 | 42 | Kyle Larson | DC Solar (Chip Ganassi) | Chevrolet | 160 | running | 0 | 43 | 0 | 160 | 2.5 | paved track | 400.0 | 2018 | Pocono_400 | Pocono Raceway |
| 2 | 3 | 5 | 18 | Kyle Busch | M&M's Red White & Blue (Joe Gibbs) | Toyota | 160 | running | 13 | 51 | 0 | 160 | 2.5 | paved track | 400.0 | 2018 | Pocono_400 | Pocono Raceway |
| 3 | 4 | 2 | 4 | Kevin Harvick | Busch Beer (Stewart Haas Racing) | Ford | 160 | running | 89 | 52 | 1 | 160 | 2.5 | paved track | 400.0 | 2018 | Pocono_400 | Pocono Raceway |
| 4 | 5 | 17 | 2 | Brad Keselowski | Wurth (Roger Penske) | Ford | 160 | running | 10 | 37 | 0 | 160 | 2.5 | paved track | 400.0 | 2018 | Pocono_400 | Pocono Raceway |
By converting each dataframe in the list to dicts, we can create a single dataframe. The resulting dataframe columns will be a super set of each individual race dataframe’s columns
df = pd.DataFrame([row for r_df in race_data_frames for row in r_df.to_dict(orient='records')])
df.head()
| # | Car | Driver | Fin | Laps | Led | Money | PPts | Pts | Sponsor / Owner | St | Status | race_length_laps | race_length_miles | race_name | track_length_miles | track_name | track_type | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 88 | Chevrolet | Darrell Waltrip | 1 | 119 | 87 | 21150.0 | NaN | 185.0 | Gatorade (DiGard Racing) | 4 | running | 119 | 311.8 | Winston_Western_500 | 2.62 | Riverside International Raceway | road course | 1979 |
| 1 | 21 | Mercury | David Pearson | 2 | 119 | 9 | 14200.0 | NaN | 175.0 | Purolator (Wood Brothers) | 1 | running | 119 | 311.8 | Winston_Western_500 | 2.62 | Riverside International Raceway | road course | 1979 |
| 2 | 11 | Oldsmobile | Cale Yarborough | 3 | 119 | 3 | 12675.0 | NaN | 170.0 | Busch (Junior Johnson) | 2 | running | 119 | 311.8 | Winston_Western_500 | 2.62 | Riverside International Raceway | road course | 1979 |
| 3 | 73 | Oldsmobile | Bill Schmitt | 4 | 118 | 0 | 8000.0 | NaN | 160.0 | Old Milwaukee (Bill Schmitt) | 8 | running | 119 | 311.8 | Winston_Western_500 | 2.62 | Riverside International Raceway | road course | 1979 |
| 4 | 1 | Chevrolet | Donnie Allison | 5 | 118 | 0 | 7550.0 | NaN | 155.0 | Hawaiian Tropic (Hoss Ellington) | 17 | running | 119 | 311.8 | Winston_Western_500 | 2.62 | Riverside International Raceway | road course | 1979 |
Track Type
Currently, track type is not very descriptive
df.track_type.unique()
array(['road course', 'paved track'], dtype=object)
Using both the scraped track_type and track_length_miles, we’ll classify a new track_type
def track_type(row):
if row['track_type'] == 'road course':
return 'road course'
elif row['track_length_miles'] >= 2.0:
return 'superspeedway'
elif row['track_length_miles'] >= 1.0:
return 'intermediate'
else:
return 'short track'
df['track_type'] = df.apply(track_type, axis=1)
df[['track_length_miles', 'track_type', 'track_name']].drop_duplicates()\
.sort_values('track_length_miles')\
.reset_index()\
.drop('index', axis=1)
| track_length_miles | track_type | track_name | |
|---|---|---|---|
| 0 | 0.525 | short track | Martinsville Speedway |
| 1 | 0.526 | short track | Martinsville Speedway |
| 2 | 0.533 | short track | Bristol International Raceway |
| 3 | 0.533 | short track | Bristol Motor Speedway |
| 4 | 0.533 | short track | Bristol International Speedway |
| 5 | 0.542 | short track | Richmond Fairgrounds Raceway |
| 6 | 0.596 | short track | Nashville Speedway |
| 7 | 0.625 | short track | North Wilkesboro Speedway |
| 8 | 0.750 | short track | Richmond Raceway |
| 9 | 0.750 | short track | Richmond International Raceway |
| 10 | 1.000 | intermediate | Jeff Gordon Raceway |
| 11 | 1.000 | intermediate | Dover International Speedway |
| 12 | 1.000 | intermediate | Phoenix International Raceway |
| 13 | 1.000 | intermediate | ISM Raceway |
| 14 | 1.000 | intermediate | Dover Downs International Speedway |
| 15 | 1.017 | intermediate | North Carolina Motor Speedway |
| 16 | 1.017 | intermediate | North Carolina Speedway |
| 17 | 1.058 | intermediate | New Hampshire Motor Speedway |
| 18 | 1.058 | intermediate | New Hampshire International Speedway |
| 19 | 1.366 | intermediate | Darlington Raceway |
| 20 | 1.500 | intermediate | Kentucky Speedway |
| 21 | 1.500 | intermediate | Lowe's Motor Speedway |
| 22 | 1.500 | intermediate | Kansas Speedway |
| 23 | 1.500 | intermediate | Las Vegas Motor Speedway |
| 24 | 1.500 | intermediate | Chicagoland Speedway |
| 25 | 1.500 | intermediate | Texas Motor Speedway |
| 26 | 1.500 | intermediate | Charlotte Motor Speedway |
| 27 | 1.500 | intermediate | Homestead-Miami Speedway |
| 28 | 1.522 | intermediate | Atlanta International Raceway |
| 29 | 1.522 | intermediate | Atlanta Motor Speedway |
| 30 | 1.540 | intermediate | Atlanta Motor Speedway |
| 31 | 1.949 | road course | Sears Point Raceway |
| 32 | 1.990 | road course | Infineon Raceway |
| 33 | 1.990 | road course | Sonoma Raceway |
| 34 | 1.990 | road course | Sears Point Raceway |
| 35 | 2.000 | road course | Sears Point Raceway |
| 36 | 2.000 | superspeedway | Auto Club Speedway |
| 37 | 2.000 | superspeedway | Michigan International Speedway |
| 38 | 2.000 | superspeedway | California Speedway |
| 39 | 2.000 | superspeedway | Michigan Speedway |
| 40 | 2.000 | superspeedway | Texas World Speedway |
| 41 | 2.428 | road course | Watkins Glen International |
| 42 | 2.450 | road course | Watkins Glen International |
| 43 | 2.500 | superspeedway | Pocono Raceway |
| 44 | 2.500 | superspeedway | Indianapolis Motor Speedway |
| 45 | 2.500 | superspeedway | Ontario Motor Speedway |
| 46 | 2.500 | superspeedway | Pocono International Raceway |
| 47 | 2.500 | superspeedway | Daytona International Speedway |
| 48 | 2.520 | road course | Sears Point Raceway |
| 49 | 2.520 | road course | Sears Point International Raceway |
| 50 | 2.620 | road course | Riverside International Raceway |
| 51 | 2.660 | superspeedway | Talladega Superspeedway |
| 52 | 2.660 | superspeedway | Alabama International Motor Speedway |
Wrapping Up
We now have single, tidy Pandas DataFrame with all Monster Energy NASCAR Cup Series results since 1979
df.tail()
| # | Car | Driver | Fin | Laps | Led | Money | PPts | Pts | Sponsor / Owner | St | Status | race_length_laps | race_length_miles | race_name | track_length_miles | track_name | track_type | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52614 | 99 | Chevrolet | Derrike Cope | 34 | 152 | 0 | NaN | 0.0 | 3.0 | StarCom Fiber (StarCom Racing) | 38 | running | 160 | 400.0 | Pocono_400 | 2.5 | Pocono Raceway | superspeedway | 2018 |
| 52615 | 11 | Toyota | Denny Hamlin | 35 | 146 | 0 | NaN | 0.0 | 8.0 | FedEx Office (Joe Gibbs) | 10 | crash | 160 | 400.0 | Pocono_400 | 2.5 | Pocono Raceway | superspeedway | 2018 |
| 52616 | 95 | Chevrolet | Kasey Kahne | 36 | 120 | 0 | NaN | 0.0 | 1.0 | FDNY Foundation (Leavine Family Racing) | 22 | transmission | 160 | 400.0 | Pocono_400 | 2.5 | Pocono Raceway | superspeedway | 2018 |
| 52617 | 32 | Ford | Matt DiBenedetto | 37 | 113 | 0 | NaN | 0.0 | 1.0 | Zynga Poker (Archie St. Hilaire) | 32 | brakes | 160 | 400.0 | Pocono_400 | 2.5 | Pocono Raceway | superspeedway | 2018 |
| 52618 | 43 | Chevrolet | Bubba Wallace | 38 | 108 | 4 | NaN | 0.0 | 1.0 | Weis Markets (Richard Petty Motorsports) | 19 | engine | 160 | 400.0 | Pocono_400 | 2.5 | Pocono Raceway | superspeedway | 2018 |
In future posts, I’ll revisit this data for further analysis. For now, I’ll save the DataFrame as a Python pickle file for easy ingestion
df.to_pickle('race_details.pkl')