%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import os
import glob
import geopy
from geopy.distance import vincenty
from pandas.plotting import scatter_matrix
import seaborn as sns
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from IPython.display import Image
from IPython.core.display import HTML
from IPython.display import HTML
from pylab import rcParams
import itertools
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import acf, pacf, adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Toggle Code" style="background:white; border: 1px solid; border-radius: 4px; padding: 10px;"></form>''')
Below is a summary of my deep dive analysis on bike-share and weather in SF.
From my analysis of weather and bike share in SF, I found that the weather indeed impacts bike riding behavior. I came to this conclusion by analyzing bike and weather data with EDA and correlation plots. I also built time series models to determine if the weather is a good predictor for daily bike-share rides. Following, I am going to present to you my findings.
People are more likely to use bike-share when the weather is good. The average number of bike-share rides is three times higher compared to all other conditions. We can conclude that people use bike-share more often when the weather is good.
bike_wx = pd.read_csv('tmp/bike_wx.csv')
days = list(bike_wx.groupby('date').nunique().count())[0]
rain_ct = list(bike_wx[bike_wx.precip>0].groupby('wx_feel').bike_id.count()/days)
no_rain_ct= list(bike_wx[bike_wx.precip<=0].groupby('wx_feel').bike_id.count()/days)
# Defining a color palette
light_blue = 'rgb(142, 212, 229)'
pink = 'rgb(231, 84, 128)'
green = 'rgb(199, 204, 118)'
grey = 'rgb(169,169,169)'
yellow = 'rgb(248, 222, 126)'
trace1 = go.Bar(
x=['Cold & ๐ง', 'Good & ๐ง', 'Hot & ๐ง'],
y=rain_ct, marker = dict(color=light_blue),
name='Rain')
trace2 = go.Bar(
x=['Cold & ๐ค', 'Good & ๐ค', 'Hot & ๐ค'],
y=no_rain_ct,
name='No Rain', marker = dict(color=yellow))
data = [trace1, trace2]
layout = go.Layout(
barmode='stack', title = 'Daily average number of bike share rides by weather and precipitation', xaxis=dict(title='Weather condition'),
yaxis=dict(
title='Average number of bike share rides'),)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename = "Weather")
$^*$Cold is below 53 F, Good is beween 53 F and 73 F and Hot is above 73 F.
The weather also impacts the bike-share ride duration. On sunny days with no rain, the rides are almost 20.8% longer, and the duration increases with improving weather conditions.
increase = (np.mean(np.array(no_rain_avg_dur)) - np.mean(np.array(rain_avg_dur)))/np.mean(np.array(rain_avg_dur))
rain_avg_dur = list(bike_wx[bike_wx.precip>0].groupby('wx_feel').duration_sec.mean()/60)
no_rain_avg_dur = list(bike_wx[bike_wx.precip<=0].groupby('wx_feel').duration_sec.mean()/60)
trace1 = go.Bar(
x=['Cold & ๐ง', 'Good & ๐ง', 'Hot & ๐ง'],
y=rain_avg_dur, marker = dict(color=light_blue),
name='Rain')
trace2 = go.Bar(
x=['Cold & ๐ค', 'Good & ๐ค', 'Hot & ๐ค'],
y=no_rain_avg_dur,
name='No Rain', marker = dict(color=yellow))
data = [trace1, trace2]
layout = go.Layout(
barmode='stack', title = 'Average duration of a bike share ride (in minutes) by weather and precipitation', xaxis=dict(title='Weather condition'),
yaxis=dict(
title='Average duration of a bike share ride'),)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename = "Weather")
I found out that weather also impacts the behavior of different customer segments. The proportion of Customers increases by 10.8 percentage points vs. Subscribers when the weather goes from cold to hot.
bt = pd.read_csv('bike_sharing_weather_daily.csv')
# Good
cust_g = bt[bt.wx_feel == 'good'].count_customer.sum() / bt[bt.wx_feel == 'good'].count_shares.sum()
sub_g = bt[bt.wx_feel == 'good'].count_subscriber.sum() / bt[bt.wx_feel == 'good'].count_shares.sum()
# Cold
cust_c = bt[bt.wx_feel == 'cold'].count_customer.sum() / bt[bt.wx_feel == 'cold'].count_shares.sum()
sub_c = bt[bt.wx_feel == 'cold'].count_subscriber.sum() / bt[bt.wx_feel == 'cold'].count_shares.sum()
# Hot
cust_h = bt[bt.wx_feel == 'hot'].count_customer.sum() / bt[bt.wx_feel == 'hot'].count_shares.sum()
sub_h = bt[bt.wx_feel == 'hot'].count_subscriber.sum() / bt[bt.wx_feel == 'hot'].count_shares.sum()
trace1 = go.Bar(
x=['Cold', 'Good', 'Hot'],
y=[cust_c, cust_g, cust_h], marker = dict(color=grey),
name='Customer')
trace2 = go.Bar(
x=['Cold', 'Good', 'Hot'],
y=[sub_c, sub_g, sub_h],
name='Subscriber', marker = dict(color=pink))
data = [trace1, trace2]
layout = go.Layout(
barmode='stack', title = 'Comparing customers vs. subscribers behaviour by weather',
xaxis=dict(title='Weather condition'),
yaxis=dict(
title='% of total bike share rides'),)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename = "Bike Shares")
I computed Pearson's correlation between the number of bike rides and trip duration with every weather feature. The daylight and average temperature are correlated with bike-share ride duration both withย 0.27ย correlation. The number of bike rides is less correlated to the same features. This finding agrees with my insight from above that with a good weather condition, people tend to ride longer rather than more often.ย
Not surprisingly, both the number of rides and duration are negatively correlated with precipitation with a correlation ofย -0.13ย andย -0.16, respectively. See full analysis for more on correlation incl. scatter-matrix.
attr = ['count_shares', 'avg_temp', 'daylight', 'daily_temp_diff', 'precip']
corr_matrix = bt[attr].corr()
print(corr_matrix['count_shares'].sort_values(ascending=False))
attr = ['avg_duration', 'avg_temp', 'daylight', 'daily_temp_diff', 'precip']
corr_matrix = bt[attr].corr()
print(corr_matrix['avg_duration'].sort_values(ascending=False))
By constructing time series models, I discovered that the weather is a good predictor of daily bike-share rides.
I built a SARIMA and a SARIMAX models and compared their RMSE and R2. I proved that including a weather feature improves the prediction accuracy of a model.
I chose to predict the number of bike-share rides per day to restrict the scope. But similar models could be applied to predict the daily duration. Both metrics are important for a bike-share business. The number of rides can be used for planning the capacity. The duration would be helpful to forecast the company's revenue.
SARIMA (Seasonal Autoregressive Integrated Moving Average ) is a simple and commonly used time series model. The model makes prediction only based on its historical values. In contrast, SARIMAX can include one or more exogenous variables. Below is my SARIMAX, with precipitation as an exogenous factor.
Using weather factor improved both performance metrics vs. the univariate model.
RMSE of SARIMAX is 673.52 (vs. 1069.76 of SARIMA)
R2 of SARIMAX is 0.8 (vs. 0.49 of SARIMA)
bt_ts = bt.iloc[:, [0, 1, 10]].set_index('date')
train = bt_ts['2017-06-28':'2018-04-20']
test = bt_ts['2018-04-21':]
model_2 = sm.tsa.statespace.SARIMAX(bt_ts.count_shares,
order=(0, 1, 0),
seasonal_order=(1, 1, 1, 7),
enforce_stationarity=False,
enforce_invertibility=False,
exog=bt_ts.precip)
result_2 = model_2.fit(train.count_shares)
# I'm showing here the last 3 months roughtly for better visualization:
pred_2 = result_2.get_prediction(start='2018-04-21', dynamic=False)
pred_ci = pred_2.conf_int()
pred = pred_2.predicted_mean.to_frame()
pred_plus_ci = pd.merge(pred_ci, pred, how='left', on='date')
pred_plus_ci.rename(columns={0: 'pred'}, inplace=True)
tr = bt_ts.iloc[-90:,0].to_frame()
pred_to_plot = pd.merge(tr, pred_plus_ci, how='left', on='date')
ax = pred_to_plot.count_shares.plot(label='Observed', color='dodgerblue')
pred_to_plot.pred.plot(ax=ax, label='Forecast', alpha=.7, figsize=(18, 6), color='deeppink')
ax.fill_between(pred_to_plot.index,
pred_to_plot.iloc[:, 1],
pred_to_plot.iloc[:, 2], color='darkgrey', alpha=.2)
xticks = pred_to_plot.index
ax.xaxis.set_ticks(xticks)
ax.set_xticklabels(xticks, alpha=1, rotation=45)
ax.set_xticks(ax.get_xticks()[::5])
ax.set_xlabel('Date')
ax.set_ylabel('Bike share rides')
plt.legend()
plt.show()
More details on time series decomposition, Augmented-Dickey-Fuller stationarity test, finding the correct parameters through iteration with AIC as well as residuals diagnostic, is under forecasting section of my full analysis.
๐ To conclude, the weather does impact bike-share behavior. I could show that with EDA, correlation analysis as well as with time series forecasting. Including weather features improves model predictions.ย
๐ As a next step it would be interesting to try to improve SARIMAX model by 1) tuning the hyperparameters 2) adding more exogenous features.ย
๐ Another option would be trying more advanced techniques for modeling e.g., Random Forest Regressor or Neural Network modelsย
๐ In terms of data analysis, we could consider hourly bike rides forecast with corresponding hourly weather data.