Racing Reference is the comprehensive source for up-to-date historical NASCAR race results. In addition, statistics are recorded by driver, owner, and crew chief, making for an awesome foundation of racing data analysis.

Traditional stick and ball sports have seen their data and analytics movement, with a variety of tools to get data from authoritative sources into the hands of eager sports data geeks. NASCAR and racing in general have seemingly been behind on this movement, lacking the open source foundations to unleash the potential of years of historical data in tools like Python and the PyData stack.

I am planning to eventually develop a full fledged Python package to make it easier to retrieve structured NASCAR data with a few parameters. This post covers an attempt to begin to bridge that gap by scraping modern era race results in the Monster Energy NASCAR Cup Series from Racing Reference.

Making Requests

To get started, we’ll import the following four Python packages:

  • requests - makes our HTTP requests
  • BeautifulSoup - aids in parsing and searching HTML
  • re - the Python standard regex module; useful for finding strings and extracting data from those strings
  • pandas - the swiss army knife of working with data in Python; provides us with tools to get data into a nice, tidy format for analysis
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd


BASE_URL = 'http://racing-reference.info'

Racing Reference uses a RESTful url scheme for accessing various pages. The raceyear endpoint contains high level results of all the races in a given year. For instance, we can receive a page with results for all races in 1979 at this url:

http://racing-reference.info/raceyear/1979/W

‘W’, I believe, stands for Winston Cup, the long-time name of the top level of NASCAR competition. In addition to the date, site, winner, and other race details, the raceyear page contains links to the full details of each race. We’ll need these links later.

First, we’ll request each raceyear from 1979 to present (2018); the NASCAR “modern era”

years = range(1979, 2019)

cup_results = [requests.get(BASE_URL + f'/raceyear/{year}/W') for year in years]

Checking HTTP status codes of our responses to make sure all our requests were successful

set([r.status_code for r in cup_results])
{200}

Each of these yearly cup series results pages has links to race details. Race detail pages have url’s like this:

http://racing-reference.info/race/1979_Winston_Western_500/W

The yearly results pages have HTML <a> tags with relative links. Lets find them all

race_anchors = []
href_regex = re.compile('/race/.*/W')

for c in cup_results:
    race_anchors.extend(BeautifulSoup(c.text, 'lxml').find_all(href=href_regex))
race_anchors[:5]
[<a href="/race/1979_Winston_Western_500/W" title="Winston Western 500">1</a>,
 <a href="/race/1979_Daytona_500/W" title="Daytona 500">2</a>,
 <a href="/race/1979_Carolina_500/W" title="Carolina 500">3</a>,
 <a href="/race/1979_Richmond_400/W" title="Richmond 400">4</a>,
 <a href="/race/1979_Atlanta_500/W" title="Atlanta 500">5</a>]

We can now use the href attribute of these <a> tags to build a full url to request the race detail pages

races = [requests.get(BASE_URL + a.attrs['href']) for a in race_anchors]

Again, checking the status codes. All 200s is what we’re after

set([r.status_code for r in races])
{200}

Extracting Race Results

To extract the race results stored as an HTML table, we can use the Pandas read_html function.

Given the text of the page, read_html will return a list of dataframes from all tables found. We can filter by using the match argument to find tables containing the provided string or regex

[df.shape for df in pd.read_html(races[0].text, match='Sponsor / Owner', header=0)]
[(83, 398), (35, 11)]

A list of two dataframes was returned. This is due to the nesting of tables in the structure of the race pages.

The last element of the returned list is what we’re after

pd.read_html(races[0].text, match='Sponsor / Owner', header=0)[-1].head()
Fin St # Driver Sponsor / Owner Car Laps Money Status Led Pts
0 1 4 88 Darrell Waltrip Gatorade (DiGard Racing) Chevrolet 119 21150 running 87 185
1 2 1 21 David Pearson Purolator (Wood Brothers) Mercury 119 14200 running 9 175
2 3 2 11 Cale Yarborough Busch (Junior Johnson) Oldsmobile 119 12675 running 3 170
3 4 8 73 Bill Schmitt Old Milwaukee (Bill Schmitt) Oldsmobile 118 8000 running 0 160
4 5 17 1 Donnie Allison Hawaiian Tropic (Hoss Ellington) Chevrolet 118 7550 running 0 155

Extracting Race Details

To help with analysis, it will be useful to extract some further details about the race; laps, track length, track type, and race length in particular

r_details = re.compile(r'(\d+) laps\*? on a (\d?\.\d{3}) mile (.*) \((\d+\.\d+) miles\)')
details_match = r_details.search(races[0].text)
details_match[0]
'119 laps on a 2.620 mile road course (311.8 miles)'

In addition to matching the entire string, our regex captured the following

details_match[1], details_match[2], details_match[3], details_match[4]
('119', '2.620', 'road course', '311.8')

Furthermore, we can simply use the url to extract the year and race

races[0].url
'http://racing-reference.info/race/1979_Winston_Western_500/W'
race_id = races[0].url.split('/')[-2]
race_id
'1979_Winston_Western_500'
r_race_id = re.compile(r'(\d{4})_(.*)')
race_id_match = r_race_id.search(race_id)
race_id_match[1], race_id_match[2]
('1979', 'Winston_Western_500')

It would also be useful to extract the name of the track. We’ll again use a regex to find the url pattern of Racing Reference track pages like:

http://racing-reference.info/tracks/Riverside_International_Raceway
r_track_name = re.compile('/tracks/.*')
BeautifulSoup(races[0].text, 'lxml').find(href=r_track_name).text
'Riverside International Raceway'

Putting it all together to create a dataframe for each modern era race

race_data_frames = []

for r in races:
    df = pd.read_html(r.text, match='Sponsor / Owner', header=0)[-1]

    details_match = r_details.search(r.text)
    df['race_length_laps'] = int(details_match[1])
    df['track_length_miles'] = float(details_match[2])
    df['track_type'] = details_match[3]
    df['race_length_miles'] = float(details_match[4])

    race_id = r.url.split('/')[-2]
    race_id_match = r_race_id.search(race_id)
    df['year'] = int(race_id_match[1])
    df['race_name'] = race_id_match[2]

    df['track_name'] = BeautifulSoup(r.text, 'lxml').find(href=r_track_name).text

    race_data_frames.append(df)
race_data_frames[-1].head()
Fin St # Driver Sponsor / Owner Car Laps Status Led Pts PPts race_length_laps track_length_miles track_type race_length_miles year race_name track_name
0 1 4 78 Martin Truex, Jr. Bass Pro Shops / 5-hour Energy (Barney Visser) Toyota 160 running 31 57 6 160 2.5 paved track 400.0 2018 Pocono_400 Pocono Raceway
1 2 13 42 Kyle Larson DC Solar (Chip Ganassi) Chevrolet 160 running 0 43 0 160 2.5 paved track 400.0 2018 Pocono_400 Pocono Raceway
2 3 5 18 Kyle Busch M&M's Red White & Blue (Joe Gibbs) Toyota 160 running 13 51 0 160 2.5 paved track 400.0 2018 Pocono_400 Pocono Raceway
3 4 2 4 Kevin Harvick Busch Beer (Stewart Haas Racing) Ford 160 running 89 52 1 160 2.5 paved track 400.0 2018 Pocono_400 Pocono Raceway
4 5 17 2 Brad Keselowski Wurth (Roger Penske) Ford 160 running 10 37 0 160 2.5 paved track 400.0 2018 Pocono_400 Pocono Raceway

By converting each dataframe in the list to dicts, we can create a single dataframe. The resulting dataframe columns will be a super set of each individual race dataframe’s columns

df = pd.DataFrame([row for r_df in race_data_frames for row in r_df.to_dict(orient='records')])
df.head()
# Car Driver Fin Laps Led Money PPts Pts Sponsor / Owner St Status race_length_laps race_length_miles race_name track_length_miles track_name track_type year
0 88 Chevrolet Darrell Waltrip 1 119 87 21150.0 NaN 185.0 Gatorade (DiGard Racing) 4 running 119 311.8 Winston_Western_500 2.62 Riverside International Raceway road course 1979
1 21 Mercury David Pearson 2 119 9 14200.0 NaN 175.0 Purolator (Wood Brothers) 1 running 119 311.8 Winston_Western_500 2.62 Riverside International Raceway road course 1979
2 11 Oldsmobile Cale Yarborough 3 119 3 12675.0 NaN 170.0 Busch (Junior Johnson) 2 running 119 311.8 Winston_Western_500 2.62 Riverside International Raceway road course 1979
3 73 Oldsmobile Bill Schmitt 4 118 0 8000.0 NaN 160.0 Old Milwaukee (Bill Schmitt) 8 running 119 311.8 Winston_Western_500 2.62 Riverside International Raceway road course 1979
4 1 Chevrolet Donnie Allison 5 118 0 7550.0 NaN 155.0 Hawaiian Tropic (Hoss Ellington) 17 running 119 311.8 Winston_Western_500 2.62 Riverside International Raceway road course 1979

Track Type

Currently, track type is not very descriptive

df.track_type.unique()
array(['road course', 'paved track'], dtype=object)

Using both the scraped track_type and track_length_miles, we’ll classify a new track_type

def track_type(row):
    if row['track_type'] == 'road course':
        return 'road course'
    elif row['track_length_miles'] >= 2.0:
        return 'superspeedway'
    elif row['track_length_miles'] >= 1.0:
        return 'intermediate'
    else:
        return 'short track'
df['track_type'] = df.apply(track_type, axis=1)
df[['track_length_miles', 'track_type', 'track_name']].drop_duplicates()\
                                                      .sort_values('track_length_miles')\
                                                      .reset_index()\
                                                      .drop('index', axis=1)
track_length_miles track_type track_name
0 0.525 short track Martinsville Speedway
1 0.526 short track Martinsville Speedway
2 0.533 short track Bristol International Raceway
3 0.533 short track Bristol Motor Speedway
4 0.533 short track Bristol International Speedway
5 0.542 short track Richmond Fairgrounds Raceway
6 0.596 short track Nashville Speedway
7 0.625 short track North Wilkesboro Speedway
8 0.750 short track Richmond Raceway
9 0.750 short track Richmond International Raceway
10 1.000 intermediate Jeff Gordon Raceway
11 1.000 intermediate Dover International Speedway
12 1.000 intermediate Phoenix International Raceway
13 1.000 intermediate ISM Raceway
14 1.000 intermediate Dover Downs International Speedway
15 1.017 intermediate North Carolina Motor Speedway
16 1.017 intermediate North Carolina Speedway
17 1.058 intermediate New Hampshire Motor Speedway
18 1.058 intermediate New Hampshire International Speedway
19 1.366 intermediate Darlington Raceway
20 1.500 intermediate Kentucky Speedway
21 1.500 intermediate Lowe's Motor Speedway
22 1.500 intermediate Kansas Speedway
23 1.500 intermediate Las Vegas Motor Speedway
24 1.500 intermediate Chicagoland Speedway
25 1.500 intermediate Texas Motor Speedway
26 1.500 intermediate Charlotte Motor Speedway
27 1.500 intermediate Homestead-Miami Speedway
28 1.522 intermediate Atlanta International Raceway
29 1.522 intermediate Atlanta Motor Speedway
30 1.540 intermediate Atlanta Motor Speedway
31 1.949 road course Sears Point Raceway
32 1.990 road course Infineon Raceway
33 1.990 road course Sonoma Raceway
34 1.990 road course Sears Point Raceway
35 2.000 road course Sears Point Raceway
36 2.000 superspeedway Auto Club Speedway
37 2.000 superspeedway Michigan International Speedway
38 2.000 superspeedway California Speedway
39 2.000 superspeedway Michigan Speedway
40 2.000 superspeedway Texas World Speedway
41 2.428 road course Watkins Glen International
42 2.450 road course Watkins Glen International
43 2.500 superspeedway Pocono Raceway
44 2.500 superspeedway Indianapolis Motor Speedway
45 2.500 superspeedway Ontario Motor Speedway
46 2.500 superspeedway Pocono International Raceway
47 2.500 superspeedway Daytona International Speedway
48 2.520 road course Sears Point Raceway
49 2.520 road course Sears Point International Raceway
50 2.620 road course Riverside International Raceway
51 2.660 superspeedway Talladega Superspeedway
52 2.660 superspeedway Alabama International Motor Speedway

Wrapping Up

We now have single, tidy Pandas DataFrame with all Monster Energy NASCAR Cup Series results since 1979

df.tail()
# Car Driver Fin Laps Led Money PPts Pts Sponsor / Owner St Status race_length_laps race_length_miles race_name track_length_miles track_name track_type year
52614 99 Chevrolet Derrike Cope 34 152 0 NaN 0.0 3.0 StarCom Fiber (StarCom Racing) 38 running 160 400.0 Pocono_400 2.5 Pocono Raceway superspeedway 2018
52615 11 Toyota Denny Hamlin 35 146 0 NaN 0.0 8.0 FedEx Office (Joe Gibbs) 10 crash 160 400.0 Pocono_400 2.5 Pocono Raceway superspeedway 2018
52616 95 Chevrolet Kasey Kahne 36 120 0 NaN 0.0 1.0 FDNY Foundation (Leavine Family Racing) 22 transmission 160 400.0 Pocono_400 2.5 Pocono Raceway superspeedway 2018
52617 32 Ford Matt DiBenedetto 37 113 0 NaN 0.0 1.0 Zynga Poker (Archie St. Hilaire) 32 brakes 160 400.0 Pocono_400 2.5 Pocono Raceway superspeedway 2018
52618 43 Chevrolet Bubba Wallace 38 108 4 NaN 0.0 1.0 Weis Markets (Richard Petty Motorsports) 19 engine 160 400.0 Pocono_400 2.5 Pocono Raceway superspeedway 2018

In future posts, I’ll revisit this data for further analysis. For now, I’ll save the DataFrame as a Python pickle file for easy ingestion

df.to_pickle('race_details.pkl')