Forecasting San Francisco's Air Quality with Gradient Boosted Trees

Image Copyright: Christopher Michel, Source: flickr

In this post, I scrape two years worth of data from 18 air quality sensors in San Francisco, model it, find some funny but consistent time patterns, and forecast the data for the next 24 hours for each sensor with an RMSE less than 6 µg/m3.

png

Scraping Two Years of PurpleAir Sensor Data

Air quality in The Bay Area has been a hot topic over the past year or so. Last year’s fires had an obviously bad effect on the amount of dangerous particulate matter in the air, including one day where the sun didn’t appear to rise, giving the city an apocalyptic tint like something out of Sam Delany’s Dhalgren. Recently, community and medical groups, citing health risks, have pushed government regulators to limit the air polution from oil refineries. Around the time of the fires, I started scraping air quality data from PurpleAir to send myself and my partner email alerts when the air quality in our neighbourhood was dangerously bad. Now that we’re at the start of another long and dangerous fire season, I’ve been interested in forecasting air quality.

To that end, I scraped data from PurpleAir’s API for air quality sensors in San Francisco and I used gradient boosted trees model it. All of the code for this project is available on my GitHub. The main dataset (1.1 GB) is available for download from the Harvard Dataverse

PurpleAir.com is a network of air quality sensors that are accessible from an API. I contacted the company for an API key and scraped sensor data for a rectangular area starting above Hunter’s Point and extending out to a point to the West of the Golden Gate Bridge. I excluded sensors that were listed as indoors. Here I analyse sensors that have at least two years of observations preceding July 2021, because 2020 was, in many senses, unusual. This results in 20 sensors, two of which appeared to be faulty (temperatures below -200ºF or PM2.5 measurements consistently greater than 2500 µg/m3) and were excluded. PurpleAir sensors sample every two minutes, so this gave me about 9.5 million observations to model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker


from IPython.display import clear_output

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.inspection import plot_partial_dependence

from multiOutputClass import multiOutputModel

from joblib import dump, load

def clearOutputAndPrint(msg):
    clear_output(wait = True)
    print(msg)
    
def recreateDatetime(X):
    return pd.to_datetime(X.year.astype(str) + ' ' + X.dayOfYear.astype(str) + ' ' + X.hour.astype(str), format = '%Y %j %H')


def partialDepSubplots(IDs, axs, features, models):
    for (thisID, ax) in zip(IDs, axs.flatten()):

        gradRegressorData = load(f'Modeling/GradBoostedReg/Data/gradRegressorData_{thisID}.joblib')

        X = gradRegressorData['X']

        plot_partial_dependence(models[str(thisID)].best_estimator_, X, features, contour_kw={'cmap':'jet'}, ax = ax);
        ax.set_title(str(thisID))
    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=4.0)

clearOutputAndPrint('Reading data...')
masterDF = pd.read_csv('wrangledData/completedObs.csv')
clearOutputAndPrint('Data loaded!')

Data loaded!
masterDF.head(4)
created_at entry_id PM1 CF=ATM ug/m3 y PM10 CF=ATM ug/m3 Uptime (Minutes) RSSI (WiFi Signal Strength) Temperature (F) Humidity % PM25 CF=1 ug/m3 latitude longitude ID dayOfYear dayOfWeek year hour
0 2019-06-01 00:00:26-07:00 144961 5.440000 11.600000 19.970000 831.000000 -78.000000 61.000000 67.000000 11.600000 37.750900 -122.420300 19189 152 5 2019 0
1 2019-06-01 00:00:29-07:00 140550 4.700000 8.780000 10.230000 930.000000 -77.000000 59.000000 73.000000 8.780000 37.740300 -122.409800 19711 152 5 2019 0
2 2019-06-01 00:00:37-07:00 171870 7.160000 10.940000 14.310000 954.000000 -70.000000 61.000000 68.000000 10.940000 37.772100 -122.442400 22989 152 5 2019 0
3 2019-06-01 00:00:46-07:00 113172 4.540000 8.180000 10.680000 945.000000 -85.000000 59.000000 70.000000 8.180000 37.739200 -122.439100 21227 152 5 2019 0

The Y column is a measurement of the PM2.5 particulate matter in micrograms per cubic meter of air (µg/m3). Plotting the data for each sensor reveals that the sampling has a lot of high-frequency noise. So I’ll average the observations within each hour because two-minute variation in the signal isn’t interesting to me. Then I’ll take a five-hour rolling average for each sensor in an attempt to low-pass it a little.

grouped = masterDF.groupby('ID')

fig, axes = plt.subplots(nrows=3, ncols=6, figsize=(16,9), sharey=True)

for (key, ax) in zip(grouped.groups.keys(), axes.flatten()):
    ax.plot(grouped.get_group(key).y)
    #fmt_half_year = mdates.MonthLocator(interval=6)
    #ax.xaxis.set_major_locator(fmt_half_year)
    
    ax.set_title(str(key))
    
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=4.0)

png

aggregatedDF = masterDF.groupby(
    ['latitude', 'longitude', 'year', 'dayOfYear', 'dayOfWeek','hour', 'ID']
    ).agg(
    {'y':'mean', 'Temperature (F)':'mean', 'Humidity %':'mean'}
)


aggregatedDF.reset_index(['latitude', 'longitude', 'year', 'dayOfYear', 'dayOfWeek', 'hour'], inplace = True)

rollingByID = aggregatedDF.groupby('ID').y.rolling(5).mean()

for ID in aggregatedDF.index.unique():
    aggregatedDF.loc[ID, 'y'] = rollingByID[ID].values
    

aggregatedDF = aggregatedDF[~aggregatedDF.y.isna()]
    
aggregatedDF.reset_index(inplace=True)

forPlotting = aggregatedDF.copy()
forPlotting['created_at'] = recreateDatetime(forPlotting)

grouped = forPlotting.groupby('ID')

fig, axes = plt.subplots(nrows=3, ncols=6, figsize=(16,9), sharey=True)

for (key, ax) in zip(grouped.groups.keys(), axes.flatten()):
    grouped.get_group(key).plot(x = 'created_at', y = 'y', ax=ax)
    fmt_half_year = mdates.MonthLocator(interval=6)
    ax.xaxis.set_major_locator(fmt_half_year)
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    ax.set_title(str(key))
    
    
    
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=4.0)

png

The smoothed data look much better. The huge peak of around 4000 µg/m3 for sensor 20179 is gone. We can clearly see peaks in particulate matter around the 2020 fire season.

Regressing PM2.5 on Time and Weather Covariates

The PurpleAir sensors give us the time of the observations as well as temperature and humidity measurements. Here I’m going to look for how air quality might be predicted by time and weather. I’ll use early stopping for the GBT tree growth and cross validate the learning rate across values less than .1 as in [1]. The covariates I’ll use are:

1. Temperature (F)
2. Humidity (%)
3. Day of year (0 - 365)
4. Day of week (0 - 6, starting on Sunday)
5. Year (2019 - 2021)
6. Hour of day (0 - 23)

Day of year allows me to look at seasonal changes. Day of week allows me to look at changes that relate to factors like commuting and transport, which we would expect to be more frequent from Monday to Friday.

GBTRegressions = {}

for thisID in aggregatedDF.ID.unique():
    
    theseData = aggregatedDF[aggregatedDF.ID == thisID]
    
    
    try:
        gradRegressorData = load(f'Modeling/GradBoostedReg/Data/gradRegressorData_{thisID}.joblib')

        X = gradRegressorData['X']
        y = gradRegressorData['y']

    except FileNotFoundError:

        X = theseData.loc[~(theseData.loc[:, 'Temperature (F)'].isna() & theseData.loc[:, 'Humidity %'].isna() ), ['Temperature (F)', 'Humidity %', 'dayOfYear', 'dayOfWeek', 'year', 'hour']]
        y = theseData.loc[~(theseData.loc[:, 'Temperature (F)'].isna() & theseData.loc[:, 'Humidity %'].isna() ), 'y']

        gradRegressorData = {
            'aggregatedDF': aggregatedDF,
            'X':X,
            'y':y
        }
    
        dump(gradRegressorData,f'Modeling/GradBoostedReg/Data/gradRegressorData_{thisID}.joblib')
    
    
    try:
        clearOutputAndPrint(f'Recovering model for {thisID}. Number {(aggregatedDF.ID.unique() == thisID).nonzero()[0][0] + 1} of {aggregatedDF.ID.unique().shape[0]}')

        weatherAndTimeGBT = load(f'Modeling/GradBoostedReg/Model/weatherAndTimeGBT_{thisID}.joblib')
    except FileNotFoundError:
        clearOutputAndPrint(f'Fitting model for {thisID}. Number {(aggregatedDF.ID.unique() == thisID).nonzero()[0][0] + 1} of {aggregatedDF.ID.unique().shape[0]}')
        grid = {
            'learning_rate': np.logspace(-2, -1, 10), 
            'max_depth': [4], 
        }


        gradBoost = GradientBoostingRegressor(
            loss = 'huber',
            n_estimators=400,
            n_iter_no_change=3
        )

        weatherAndTimeGBT = GridSearchCV(
            gradBoost, 
            grid,
            verbose = 3,
            n_jobs = 3
        )

        weatherAndTimeGBT.fit(X, y)
        dump(weatherAndTimeGBT, f'Modeling/GradBoostedReg/Model/weatherAndTimeGBT_{thisID}.joblib')
    
    GBTRegressions[str(thisID)] = weatherAndTimeGBT
    


Recovering model for 20175. Number 18 of 18

Fitted Values

fig, axs = plt.subplots(3, 6, figsize = (16,9))



for (thisID, ax) in zip(aggregatedDF.ID.unique(), axs.flatten()):
    
    gradRegressorData = load(f'Modeling/GradBoostedReg/Data/gradRegressorData_{thisID}.joblib')

    X = gradRegressorData['X']
    y = gradRegressorData['y']
    
    preds = GBTRegressions[str(thisID)].best_estimator_.predict(X)
    best_score = GBTRegressions[str(thisID)].best_score_
    X_datetimes = recreateDatetime(X)

    ax.plot(X_datetimes, y, c='b', label = 'True values', alpha = .3)
    ax.plot(X_datetimes, preds, label = 'Fitted Values', c = 'purple')
    ax.legend();
    fmt_half_year = mdates.MonthLocator(interval=6)
    ax.xaxis.set_major_locator(fmt_half_year)
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")


    
    ax.set_title(f'{thisID}, Best score = {np.round(best_score,2)}')
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=4.0)

png

The plots above show the fitted values in purple and the training values in light blue. The models tend to smooth over many of the large peaks in PM2.5 µg/m3 over time, but the cross-validated scores tend to be small - less than 10 µg/m3 for all but two sensors.

Feature Importance

The plots below show the feature importance for each model. Day of the year is typically the most important feature, typically followed by year. Sensor 20175 is an exception to this - the order of importance is swapped for those two features.

fig, axs = plt.subplots(3, 6, figsize = (16,9))


for (thisID, ax) in zip(aggregatedDF.ID.unique(), axs.flatten()):
    ax.bar(np.arange(len(GBTRegressions[str(thisID)].best_estimator_.feature_importances_)), GBTRegressions[str(thisID)].best_estimator_.feature_importances_)
    ax.xaxis.set_major_locator(mticker.FixedLocator(np.arange(6)))
    ax.set_xticklabels(X.columns, rotation = 45, ha="right")
    ax.set_xticks(np.arange(6));
    ax.set_title(str(thisID))

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0);

png

Partial Dependencies

Day of Year

My hunch is that the importance of day to the model corresponds to changes in air quality due to fire season. The plot below shows the partial dependence of modeled PM2.5 µg/m3 on day of the year. Increases are apparent around day 200, which is near August, the historic start of the fire season. These plots represent marginal effects, so the peaks are likely driven by data from 2021, which was much worse than other years.

fig, axs = plt.subplots(3, 6, figsize = (16,9))

partialDepSubplots(aggregatedDF.ID.unique(), axs, [2], GBTRegressions)

png

Temperature

Temperatures typically have an positive effect on PM2.5 µg/m3, although for some sensors this is not the case (i.e. 22989)

fig, axs = plt.subplots(3, 6, figsize = (16,9))

partialDepSubplots(aggregatedDF.ID.unique(), axs, [0], GBTRegressions)

png

Humidity

This seems to typically be an inverted “U” shaped relationship.

fig, axs = plt.subplots(3, 6, figsize = (16,9))

partialDepSubplots(aggregatedDF.ID.unique(), axs, [1], GBTRegressions)

png

Day of The Week

This is interesting. Air quality tends to get worse over the week - peaking on Friday and Saturday - which isn’t what I predicted at all. Perhaps this represents more commuting or travelling on weekends and after work on Fridays

fig, axs = plt.subplots(3, 6, figsize = (16,9))

partialDepSubplots(aggregatedDF.ID.unique(), axs, [3], GBTRegressions)

png

Year

Air quality seems to be worse this year than others, but we don’t have a full set of observations for 2021 yet.

fig, axs = plt.subplots(3, 6, figsize = (16,9))
    
partialDepSubplots(aggregatedDF.ID.unique(), axs, [4], GBTRegressions)

png

Hour of the Day

The decrease in PM2.5 µg/m3 around 1500 hours every day is interesting. I’m not sure what this represents. Possibly it’s a lull in work before everyone commutes home. Delivery and trucking companies might try to finish their work before peak commuting hours to avoid getting stuck in traffic. An alternative is that the dip something to do with the marine layer, which might have a periodic relationship with temperature, and thus the time of the day.

fig, axs = plt.subplots(3, 6, figsize = (16,9))

partialDepSubplots(aggregatedDF.ID.unique(), axs, [5], GBTRegressions)

png

Contour plots for the effect of hour and day of the week show that this dip is consistent across days, with the exception of sensor 23167.

fig, axs = plt.subplots(3, 6, figsize = (16,9))

partialDepSubplots(aggregatedDF.ID.unique(), axs, [(5, 3)], GBTRegressions)

png

Forecasting

Now I’ll forecast the air quality for a 24 hour period using PM2.5 µg/m3 observations from the 4 days preceding the start of the forecasting horizon, as well as weather covariates. To do this, I’ll flatten the data by appending each of the training days’ measurements and covariates into a single row vector. I use these as the features for multiple GBT regressors, each of which is fit to the data from one hour of the forecasting horizon. This results in 24 regressors per sensor, one for each hour of the horizon period. This is the strategy used by Elsayed et al. in their recent paper Do We Really Need Deep Learning Models for Time Series Forecasting?. The code for feature engineering and modeling uses sklearn.multioutput’s multioutputregressor and can be found in my github repo for this project. I fit the multioutput-model for each sensor using early stopping and a learning rate of .01.

horizonLen = 24
trainLen = 24*4
multiOutModels = {}

for thisID in aggregatedDF.ID.unique():
    
    mod = multiOutputModel(aggregatedDF, thisID)

    mod.flattenAndTrainTestSplit(horizonLen, trainLen, trainLen, 0)
    mod.fit(verbose = 2)
    
    multiOutModels[str(thisID)] = mod
    
Retrieving model for 20175
fig, axs = plt.subplots(4,5, figsize = (20, 20));

for thisID in aggregatedDF.ID.unique():
    thisAx = axs.flatten()[(aggregatedDF.ID.unique() == thisID).nonzero()][0]

    mod = multiOutModels[str(thisID)]
    
    rmse = mod.rmse(mod.testPredictors, mod.y_test)
    
    mod.predictAndPlot(mod.testPredictors, mod.y_test, ax = thisAx) #Prediction from held-out data
    
    thisAx.set_title(f'{thisID}, RMSE = {np.round(rmse, 3)}')
    thisAx.set_ylim(0, 40);
    

png

The RMSE for these predictions is good. It’s always less than 6 PM2.5 µg/m3 and often closer to 1 or 2 PM2.5 µg/m3.

Summary

So there we have it. There are consistent relationships between air quality in SF and time of day, day of the year (which seems to track fire season) and day of the week. I’m able to forecast the next day’s air quality pretty well using data from the preceding four days. I’m tempted to combine these models with a daily scraping script to create personal daily forecasts of air quality for the upcoming fire season.

Bibliography

[1] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer series in statistics, New York.

[2] Elsayed, S., Thyssens, D., Rashed, A., Schmidt-Thieme, L., & Jomaa, H. S. (2021). Do We Really Need Deep Learning Models for Time Series Forecasting? ArXiv:2101.02118 [Cs, Stat]. http://arxiv.org/abs/2101.02118