Scraping Racing Reference for NASCAR Data

Racing Reference is the comprehensive source for up-to-date historical NASCAR race results. In addition, statistics are recorded by driver, owner, and crew chief, making for an awesome foundation of racing data analysis.

Traditional stick and ball sports have seen their data and analytics movement, with a variety of tools to get data from authoritative sources into the hands of eager sports data geeks. NASCAR and racing in general have seemingly been behind on this movement, lacking the open source foundations to unleash the potential of years of historical data in tools like Python and the PyData stack.

I am planning to eventually develop a full fledged Python package to make it easier to retrieve structured NASCAR data with a few parameters. This post covers an attempt to begin to bridge that gap by scraping modern era race results in the Monster Energy NASCAR Cup Series from Racing Reference.

Making Requests

To get started, we’ll import the following four Python packages:

requests - makes our HTTP requests
BeautifulSoup - aids in parsing and searching HTML
re - the Python standard regex module; useful for finding strings and extracting data from those strings
pandas - the swiss army knife of working with data in Python; provides us with tools to get data into a nice, tidy format for analysis

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd


BASE_URL = 'http://racing-reference.info'

Racing Reference uses a RESTful url scheme for accessing various pages. The raceyear endpoint contains high level results of all the races in a given year. For instance, we can receive a page with results for all races in 1979 at this url:

http://racing-reference.info/raceyear/1979/W

‘W’, I believe, stands for Winston Cup, the long-time name of the top level of NASCAR competition. In addition to the date, site, winner, and other race details, the raceyear page contains links to the full details of each race. We’ll need these links later.

First, we’ll request each raceyear from 1979 to present (2018); the NASCAR “modern era”

years = range(1979, 2019)

cup_results = [requests.get(BASE_URL + f'/raceyear/{year}/W') for year in years]

Checking HTTP status codes of our responses to make sure all our requests were successful

set([r.status_code for r in cup_results])

{200}

Each of these yearly cup series results pages has links to race details. Race detail pages have url’s like this:

http://racing-reference.info/race/1979_Winston_Western_500/W

The yearly results pages have HTML <a> tags with relative links. Lets find them all

race_anchors = []
href_regex = re.compile('/race/.*/W')

for c in cup_results:
    race_anchors.extend(BeautifulSoup(c.text, 'lxml').find_all(href=href_regex))

race_anchors[:5]

[<a href="/race/1979_Winston_Western_500/W" title="Winston Western 500">1</a>,
 <a href="/race/1979_Daytona_500/W" title="Daytona 500">2</a>,
 <a href="/race/1979_Carolina_500/W" title="Carolina 500">3</a>,
 <a href="/race/1979_Richmond_400/W" title="Richmond 400">4</a>,
 <a href="/race/1979_Atlanta_500/W" title="Atlanta 500">5</a>]

We can now use the href attribute of these <a> tags to build a full url to request the race detail pages

races = [requests.get(BASE_URL + a.attrs['href']) for a in race_anchors]

Again, checking the status codes. All 200s is what we’re after

set([r.status_code for r in races])

{200}

Extracting Race Results

To extract the race results stored as an HTML table, we can use the Pandas read_html function.

Given the text of the page, read_html will return a list of dataframes from all tables found. We can filter by using the match argument to find tables containing the provided string or regex

[df.shape for df in pd.read_html(races[0].text, match='Sponsor / Owner', header=0)]

[(83, 398), (35, 11)]

A list of two dataframes was returned. This is due to the nesting of tables in the structure of the race pages.

The last element of the returned list is what we’re after

pd.read_html(races[0].text, match='Sponsor / Owner', header=0)[-1].head()

	Fin	St	#	Driver	Sponsor / Owner	Car	Laps	Money	Status	Led	Pts
0	1	4	88	Darrell Waltrip	Gatorade (DiGard Racing)	Chevrolet	119	21150	running	87	185
1	2	1	21	David Pearson	Purolator (Wood Brothers)	Mercury	119	14200	running	9	175
2	3	2	11	Cale Yarborough	Busch (Junior Johnson)	Oldsmobile	119	12675	running	3	170
3	4	8	73	Bill Schmitt	Old Milwaukee (Bill Schmitt)	Oldsmobile	118	8000	running	0	160
4	5	17	1	Donnie Allison	Hawaiian Tropic (Hoss Ellington)	Chevrolet	118	7550	running	0	155

Extracting Race Details

To help with analysis, it will be useful to extract some further details about the race; laps, track length, track type, and race length in particular

r_details = re.compile(r'(\d+) laps\*? on a (\d?\.\d{3}) mile (.*) \((\d+\.\d+) miles\)')
details_match = r_details.search(races[0].text)
details_match[0]

'119 laps on a 2.620 mile road course (311.8 miles)'

In addition to matching the entire string, our regex captured the following

details_match[1], details_match[2], details_match[3], details_match[4]

('119', '2.620', 'road course', '311.8')

Furthermore, we can simply use the url to extract the year and race

races[0].url

'http://racing-reference.info/race/1979_Winston_Western_500/W'

race_id = races[0].url.split('/')[-2]
race_id

'1979_Winston_Western_500'

r_race_id = re.compile(r'(\d{4})_(.*)')
race_id_match = r_race_id.search(race_id)
race_id_match[1], race_id_match[2]

('1979', 'Winston_Western_500')

It would also be useful to extract the name of the track. We’ll again use a regex to find the url pattern of Racing Reference track pages like:

http://racing-reference.info/tracks/Riverside_International_Raceway

r_track_name = re.compile('/tracks/.*')
BeautifulSoup(races[0].text, 'lxml').find(href=r_track_name).text

'Riverside International Raceway'

Putting it all together to create a dataframe for each modern era race

race_data_frames = []

for r in races:
    df = pd.read_html(r.text, match='Sponsor / Owner', header=0)[-1]

    details_match = r_details.search(r.text)
    df['race_length_laps'] = int(details_match[1])
    df['track_length_miles'] = float(details_match[2])
    df['track_type'] = details_match[3]
    df['race_length_miles'] = float(details_match[4])

    race_id = r.url.split('/')[-2]
    race_id_match = r_race_id.search(race_id)
    df['year'] = int(race_id_match[1])
    df['race_name'] = race_id_match[2]

    df['track_name'] = BeautifulSoup(r.text, 'lxml').find(href=r_track_name).text

    race_data_frames.append(df)

race_data_frames[-1].head()

	Fin	St	#	Driver	Sponsor / Owner	Car	Laps	Status	Led	Pts	PPts	race_length_laps	track_length_miles	track_type	race_length_miles	year	race_name	track_name
0	1	4	78	Martin Truex, Jr.	Bass Pro Shops / 5-hour Energy (Barney Visser)	Toyota	160	running	31	57	6	160	2.5	paved track	400.0	2018	Pocono_400	Pocono Raceway
1	2	13	42	Kyle Larson	DC Solar (Chip Ganassi)	Chevrolet	160	running	0	43	0	160	2.5	paved track	400.0	2018	Pocono_400	Pocono Raceway
2	3	5	18	Kyle Busch	M&M's Red White & Blue (Joe Gibbs)	Toyota	160	running	13	51	0	160	2.5	paved track	400.0	2018	Pocono_400	Pocono Raceway
3	4	2	4	Kevin Harvick	Busch Beer (Stewart Haas Racing)	Ford	160	running	89	52	1	160	2.5	paved track	400.0	2018	Pocono_400	Pocono Raceway
4	5	17	2	Brad Keselowski	Wurth (Roger Penske)	Ford	160	running	10	37	0	160	2.5	paved track	400.0	2018	Pocono_400	Pocono Raceway

By converting each dataframe in the list to dicts, we can create a single dataframe. The resulting dataframe columns will be a super set of each individual race dataframe’s columns

df = pd.DataFrame([row for r_df in race_data_frames for row in r_df.to_dict(orient='records')])
df.head()

	#	Car	Driver	Fin	Laps	Led	Money	PPts	Pts	Sponsor / Owner	St	Status	race_length_laps	race_length_miles	race_name	track_length_miles	track_name	track_type	year
0	88	Chevrolet	Darrell Waltrip	1	119	87	21150.0	NaN	185.0	Gatorade (DiGard Racing)	4	running	119	311.8	Winston_Western_500	2.62	Riverside International Raceway	road course	1979
1	21	Mercury	David Pearson	2	119	9	14200.0	NaN	175.0	Purolator (Wood Brothers)	1	running	119	311.8	Winston_Western_500	2.62	Riverside International Raceway	road course	1979
2	11	Oldsmobile	Cale Yarborough	3	119	3	12675.0	NaN	170.0	Busch (Junior Johnson)	2	running	119	311.8	Winston_Western_500	2.62	Riverside International Raceway	road course	1979
3	73	Oldsmobile	Bill Schmitt	4	118	0	8000.0	NaN	160.0	Old Milwaukee (Bill Schmitt)	8	running	119	311.8	Winston_Western_500	2.62	Riverside International Raceway	road course	1979
4	1	Chevrolet	Donnie Allison	5	118	0	7550.0	NaN	155.0	Hawaiian Tropic (Hoss Ellington)	17	running	119	311.8	Winston_Western_500	2.62	Riverside International Raceway	road course	1979

Track Type

Currently, track type is not very descriptive

df.track_type.unique()

array(['road course', 'paved track'], dtype=object)

Using both the scraped track_type and track_length_miles, we’ll classify a new track_type

def track_type(row):
    if row['track_type'] == 'road course':
        return 'road course'
    elif row['track_length_miles'] >= 2.0:
        return 'superspeedway'
    elif row['track_length_miles'] >= 1.0:
        return 'intermediate'
    else:
        return 'short track'

df['track_type'] = df.apply(track_type, axis=1)
df[['track_length_miles', 'track_type', 'track_name']].drop_duplicates()\
                                                      .sort_values('track_length_miles')\
                                                      .reset_index()\
                                                      .drop('index', axis=1)

	track_length_miles	track_type	track_name
0	0.525	short track	Martinsville Speedway
1	0.526	short track	Martinsville Speedway
2	0.533	short track	Bristol International Raceway
3	0.533	short track	Bristol Motor Speedway
4	0.533	short track	Bristol International Speedway
5	0.542	short track	Richmond Fairgrounds Raceway
6	0.596	short track	Nashville Speedway
7	0.625	short track	North Wilkesboro Speedway
8	0.750	short track	Richmond Raceway
9	0.750	short track	Richmond International Raceway
10	1.000	intermediate	Jeff Gordon Raceway
11	1.000	intermediate	Dover International Speedway
12	1.000	intermediate	Phoenix International Raceway
13	1.000	intermediate	ISM Raceway
14	1.000	intermediate	Dover Downs International Speedway
15	1.017	intermediate	North Carolina Motor Speedway
16	1.017	intermediate	North Carolina Speedway
17	1.058	intermediate	New Hampshire Motor Speedway
18	1.058	intermediate	New Hampshire International Speedway
19	1.366	intermediate	Darlington Raceway
20	1.500	intermediate	Kentucky Speedway
21	1.500	intermediate	Lowe's Motor Speedway
22	1.500	intermediate	Kansas Speedway
23	1.500	intermediate	Las Vegas Motor Speedway
24	1.500	intermediate	Chicagoland Speedway
25	1.500	intermediate	Texas Motor Speedway
26	1.500	intermediate	Charlotte Motor Speedway
27	1.500	intermediate	Homestead-Miami Speedway
28	1.522	intermediate	Atlanta International Raceway
29	1.522	intermediate	Atlanta Motor Speedway
30	1.540	intermediate	Atlanta Motor Speedway
31	1.949	road course	Sears Point Raceway
32	1.990	road course	Infineon Raceway
33	1.990	road course	Sonoma Raceway
34	1.990	road course	Sears Point Raceway
35	2.000	road course	Sears Point Raceway
36	2.000	superspeedway	Auto Club Speedway
37	2.000	superspeedway	Michigan International Speedway
38	2.000	superspeedway	California Speedway
39	2.000	superspeedway	Michigan Speedway
40	2.000	superspeedway	Texas World Speedway
41	2.428	road course	Watkins Glen International
42	2.450	road course	Watkins Glen International
43	2.500	superspeedway	Pocono Raceway
44	2.500	superspeedway	Indianapolis Motor Speedway
45	2.500	superspeedway	Ontario Motor Speedway
46	2.500	superspeedway	Pocono International Raceway
47	2.500	superspeedway	Daytona International Speedway
48	2.520	road course	Sears Point Raceway
49	2.520	road course	Sears Point International Raceway
50	2.620	road course	Riverside International Raceway
51	2.660	superspeedway	Talladega Superspeedway
52	2.660	superspeedway	Alabama International Motor Speedway

Wrapping Up

We now have single, tidy Pandas DataFrame with all Monster Energy NASCAR Cup Series results since 1979

df.tail()

	#	Car	Driver	Fin	Laps	Led	Money	Pts	Sponsor / Owner	St	Status	race_length_laps	race_length_miles	race_name	track_length_miles	track_name	track_type	year
52614	99	Chevrolet	Derrike Cope	34	152	0	NaN	3.0	StarCom Fiber (StarCom Racing)	38	running	160	400.0	Pocono_400	2.5	Pocono Raceway	superspeedway	2018
52615	11	Toyota	Denny Hamlin	35	146	0	NaN	8.0	FedEx Office (Joe Gibbs)	10	crash	160	400.0	Pocono_400	2.5	Pocono Raceway	superspeedway	2018
52616	95	Chevrolet	Kasey Kahne	36	120	0	NaN	1.0	FDNY Foundation (Leavine Family Racing)	22	transmission	160	400.0	Pocono_400	2.5	Pocono Raceway	superspeedway	2018
52617	32	Ford	Matt DiBenedetto	37	113	0	NaN	1.0	Zynga Poker (Archie St. Hilaire)	32	brakes	160	400.0	Pocono_400	2.5	Pocono Raceway	superspeedway	2018
52618	43	Chevrolet	Bubba Wallace	38	108	4	NaN	1.0	Weis Markets (Richard Petty Motorsports)	19	engine	160	400.0	Pocono_400	2.5	Pocono Raceway	superspeedway	2018

In future posts, I’ll revisit this data for further analysis. For now, I’ll save the DataFrame as a Python pickle file for easy ingestion

df.to_pickle('race_details.pkl')