DSFB II, A1 solutions, 2019

This notebook contains one possible solution to Assignment 1 on Data Science for Business II, 2019. Please note that different solutions are possible and even expected.

Lauri Neuvonen (based heavily on work by Anton Frantsev)

Motivation and the general idea

What are we trying to achieve?

The main benefit from this kind of analysis is to be able to forecast the behaviour of some variable in to the future. Good predictions and simulations can really help in making better decisions, which is always valuable. In this case, we are interested in studying how Heinz's sales might develop in the near future. This forecasting, i.e. genereating predictions, is done by first creating a model based on data available and then simulating sales during the coming weeks using that model.

In task 1, we'll create and fit an AR, short for autoregressive (i.e. a model based solely on previous sales figures), model and use that to simulate future sales. However, we have more data available than just previous sales numbers, so in Task 2, we'll take those data into use and try to achieve more precise predictions.

Between generating a model and starting to simulate with it, it is a good idea to verify that your model is somewhat correct. Bad predictions can lead to big problems!. Almost allways a model is just an approximation so we cannot expect perfect performance. However, some models are better than others, and we'll want to make sure ours is at least OK on some criteria. For this purpose, we can compare the model we generated to e.g. the original data, and see that the simulations based on it do not deviate too much from the original sales series.

Learning goals

The main goals of the assginment are to introduce the use of modeling techniques (AR & e.g. Lasso, Ridge, Elastic Net) through hands on exercise, learn different ways of checking their performance and finally to simulate the future with the models.

This assignment has not been designed to serve as an introduction to programming, statistics or simulation. The focus is on time series analysis techniques. The assignment by no means contains an exhaustive list of available techniques but can hopefully serve as a starting point into the subject.

Contents of the model solution

The solution progresses in the order of the assignment tasks. In addition to the solution and code examples, we try to discuss why e.g. a method was chosen instead of another one.

In the first part, we'll also go through something not requested in the assignment, a check for stationarity of the original time series. The assignment data happens to be stationary in the beginning, so, for simplicity, this phase was dropped from the assignment. However, it is an important concept in time series analysis and thus deserves a place here too.

The Model solution

Importing the necessary libraries

One of the strengths of Python is its vast library of packages. One of the most commonly used in modeling are NumPy for numerical computing, StatsModels for statistics, Pandas for dataframes and Matplotlib for plotting.

(This being said, for time series analysis, R might be a better option as it contains more and more established time series analysis libraries. So why do we use Python here? 1) 1 programming language per course is enough and 2) Python is gaining more ground in data analysis also, so in this sense it's good to get acquainted.)

In [925]:
import pandas as pd
import numpy as np
import itertools
import time
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as stats

import scipy.stats as scs

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm

We can now use Pandas to import data into a dataframe (an array with titles for columns and rows + some other handy features).

a) Descriptive statistics

To gain a better understanding about what kind of data we're working with, we'll first do some basic descriptive analysis. The point here is simply to spot potential patterns, statistical key figures and such, and determine if any preprocessing needs to be done. This step should be the first one in any data analysis.

In [926]:
data = pd.read_csv('heinz-sales.txt') # reads the data file into a Pandas dataframe
column_names = ("Sales", "Price", "TP", "CP", "DP", "MC1", "MC2", "MC3", "MC4") # creates names for columns
data.columns = column_names # sets the column names for the dataframe
data.head().style #Call the head() function to see how the dataset looks (.style() shows all columns)
Out[926]:
Sales Price TP CP DP MC1 MC2 MC3 MC4
0 62.11 1.09 0 0 0 5 4 5 4
1 38.8602 0.911505 0 0 0 8 5 8 6
2 67.9877 0.976296 0 0 0 6 5 4 6
3 79.6575 0.918906 0 0 0 6 4 5 3
4 55.4555 0.937487 0 0 0 1 3 3 3

Pandas offers some nice functionalities for descriptive statistics. For example, we can call the .describe() method to get a nice summary of the data:

In [927]:
data.describe()
Out[927]:
Sales Price TP CP DP MC1 MC2 MC3 MC4
count 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000
mean 241.001060 0.615349 0.056911 0.097561 0.268293 4.764228 4.390244 4.650407 4.723577
std 244.417077 0.098282 0.232619 0.297934 0.444883 2.142778 2.067049 1.995831 2.041780
min 19.637194 0.463273 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000
25% 98.031610 0.560230 0.000000 0.000000 0.000000 3.000000 3.000000 3.000000 3.000000
50% 167.365924 0.594411 0.000000 0.000000 0.000000 5.000000 4.000000 5.000000 5.000000
75% 261.110276 0.640880 0.000000 0.000000 1.000000 6.000000 5.500000 6.000000 6.000000
max 1237.030076 1.090000 1.000000 1.000000 1.000000 11.000000 11.000000 11.000000 11.000000

To visualize the data, we can use pyplot (part of the matplotlib package). We'll draw histrograms of sales and prices, a scatterplot of sales-price pairs and a correlation plot to study the correlations between variables.

In [928]:
f = plt.figure(figsize=(15,15))
hist_sales = plt.subplot(3,1,1) # 3 subplots, this being the first one, in position (1,1)
hist_sales.hist(data["Sales"], bins = 40) # this plots the Sales data as a histogram with 40 bins
plt.xlabel("Sales")
plt.ylabel("Count")
plt.title("Histogram of Sales")

hist_price = plt.subplot(3,1,2)
hist_price.hist(data["Price"], bins = 40)
plt.xlabel("Price")
plt.ylabel("Count")
plt.title("Histogram of Prices")

ps_scatter = plt.subplot(3,1,3)
ps_scatter.scatter(data["Price"], data["Sales"]) # the 3rd plot is a scatter plot of Price and Sales
plt.xlabel("Price")
plt.ylabel("Sales")
plt.title("Relationship between price and sales")
plt.tight_layout()
For the correlations, we'll use the seaborn package, which allows us to create a bit more fancier and refined heat map of correlations. (A similar and good enough figure could be drawn with pyplot as well.)
In [929]:
correlations = data.corr()
cmap = sns.diverging_palette(220, 10, as_cmap=True) # color palette variable

# this mask hides the upper triangle from the map (contains duplicate info) 
# Data will not be shown in cells where mask is True. Cells with missing values are automatically masked
mask = np.zeros_like(correlations, dtype=np.bool) # creates a matrix full of zeros and of the same shape as correlations
mask[np.triu_indices_from(mask)] = True # the triu_indices_from returns the indices for the upper-triangle of argument array

# We're now ready to draw the heatmap:
sns.heatmap(correlations, mask=mask, cmap = cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}) 
Out[929]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c24e3ecf8>

As we can see from the above histograms, the sales and price both seem to follow what looks to be a log-normal distribution. We don’t want this to cause problems for our future analysis, so we are going to take the logarithms of sales and price moving forward.

At least visually, the link between price and sales doesn't seem to be very strong, there's a lot of scatter in the scatter plot. However, there's a hint of the expected "the lower the price the higher the sales" effect.

The correlations between the variables aren’t too large. The mail campaigns seem to be rather strongly and positively correlated, but all that means is that if there were more mail campaigns at location 1 (MC1), for example, then it’s also likely that there were more mail campaigns at, say, location 4 (MC4). That’s not extraordinary since we would assume that mail campaigns would be done pretty uniformly at all locations.

EXTRA: Stationarity analysis

To check for stationarity of the variables we can do two things: visually inspect the behaviour of mean and standard deviation (in a stationary series, they should remain roughly constant), and use a statistical test. In reality, this check should be done to all variables, not just the target one, but here we'll just do it for sales as an example. The possibly required preprosessing steps ar then a whole new topic and outside the scope of this solution report.

Let's first plot the series and after that we can use the Augmented Dickey-Fuller test, readily available through statsmodels and used for exactly this purpose. It is a statistical test that tries to interpret how strongly a series can be defined by a trend. The null hypothesis H0 is that the times series has a unit root, meaning it is non-stationary. The tested hypothesis H1 is that there is no unit root meaning that the series is stationary.

So, first the plots:

In [930]:
rolmean = data["Sales"].rolling(12).mean() # 3 month window
rolstd = data["Sales"].rolling(12).std() # same

#Plot rolling statistics:
plt.figure(figsize=(15,5))
orig = plt.plot(data["Sales"], color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std  = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='right')
plt.title('Rolling Mean & Standard Deviation')
plt.show()

The plot shows that the 12 week rolling meand and standard deviations are relatively stable. There is no clear trend or seasonality. Based on this, we could assume stationarity. But let's check with the ADF to be sure:

In [931]:
from statsmodels.tsa.stattools import adfuller
sales = np.log(data["Sales"]).values
adf_result = adfuller(sales)
print('ADF Statistic: %f' % adf_result[0])
print('p-value: %f' % adf_result[1])
print('Critical Values:')
for key, value in adf_result[4].items():
	print('\t%s: %.3f' % (key, value))
ADF Statistic: -8.087745
p-value: 0.000000
Critical Values:
	1%: -3.485
	5%: -2.886
	10%: -2.580

The p-value for the test is practically 0, so we can safely reject the null hypothesis and hold on to H1, this concluding that the series is indeed stationary. A p-value of over 1% (or other, depending on desired level of certainty) would indicate that we cannot trust the series to be stationary.

b) Building AR(p) models with 1 and 2 lags

Now, let's move on to autoregressive modeling. The process we follow (for both lags) is: 1) Create a dataframe with the target variable (Sales) and its explanatory variables (Sales with lag 1 and Sales with lag 2). 2) Then we'll fit an ordinary least squares (OLS) model with the data 3) Plot autocorrelation functions to study the autocorrelation characteristics of the model

In [932]:
# define embedding function (familiar from the tutorial)
# This is used to generate datasets with lagged variables

def embed(data, lag, var_name):
    embedded = data.rename(var_name)
    for i in range(1,lag+1):
        embedded = pd.concat([embedded, data.shift(i).rename(var_name+'_'+str(i))], axis=1)
    
    return embedded[lag::] 
In [933]:
lag = 2
AR_dat = embed(np.log(data["Sales"]), lag, var_name='Sales')
AR_dat.head()
Out[933]:
Sales Sales_1 Sales_2
2 4.219326 3.659970 4.128907
3 4.377737 4.219326 3.659970
4 4.015580 4.377737 4.219326
5 4.993798 4.015580 4.377737
6 4.150569 4.993798 4.015580
In [934]:
# Fitting the ols models. The StatsModels formula package (smf.) allows us to use R-type formulas to generate models
AR_mod1 = smf.ols(formula='Sales ~ Sales_1', data=AR_dat).fit() # The formula says: "predict saled dependent on Sales_1 variable"
AR_mod2 = smf.ols(formula='Sales ~ Sales_1 + Sales_2', data=AR_dat).fit() # The same as above but: "...on Sales_1 and Sales_2" variables
print(AR_mod1.summary())
print(AR_mod2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Sales   R-squared:                       0.081
Model:                            OLS   Adj. R-squared:                  0.074
Method:                 Least Squares   F-statistic:                     10.54
Date:                Wed, 13 Mar 2019   Prob (F-statistic):            0.00152
Time:                        11:17:49   Log-Likelihood:                -139.59
No. Observations:                 121   AIC:                             283.2
Df Residuals:                     119   BIC:                             288.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.7028      0.453      8.171      0.000       2.805       4.600
Sales_1        0.2822      0.087      3.246      0.002       0.110       0.454
==============================================================================
Omnibus:                        4.221   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.121   Jarque-Bera (JB):                3.696
Skew:                           0.411   Prob(JB):                        0.158
Kurtosis:                       3.242   Cond. No.                         34.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Sales   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.074
Method:                 Least Squares   F-statistic:                     5.802
Date:                Wed, 13 Mar 2019   Prob (F-statistic):            0.00395
Time:                        11:17:49   Log-Likelihood:                -139.05
No. Observations:                 121   AIC:                             284.1
Df Residuals:                     118   BIC:                             292.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.0454      0.562      7.199      0.000       2.933       5.158
Sales_1        0.3095      0.091      3.406      0.001       0.130       0.489
Sales_2       -0.0938      0.091     -1.031      0.305      -0.274       0.086
==============================================================================
Omnibus:                        4.135   Durbin-Watson:                   2.012
Prob(Omnibus):                  0.127   Jarque-Bera (JB):                3.591
Skew:                           0.401   Prob(JB):                        0.166
Kurtosis:                       3.263   Cond. No.                         59.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [935]:
# Alternative way of doing this:
AR_mod1_alt = sm.tsa.ARMA(np.log(data["Sales"]), order=(1,0)) # creates an AR model (order of MA part = 0)
AR_mod2_alt = sm.tsa.ARMA(np.log(data["Sales"]), order=(2,0)) # creates an AR model (order of MA part = 0)

As we can see, the models are very similar. The coefficient for the first lagged variable is only slightly different in both models, and its significance level also varies a bit in both models, the first lagged variable is very statistically significant, while the second lag of Sales is not statistically significant. The intercept is significant in both cases. This is further supported by the ACF graphs of the original Sales data, which show that sales are correlated with their previous (one period ago) values. In the plots, the significant autocorrelations are the ones that stand out from the blue zone of insignificance. The partial autocorrelation plot controls for the shorter lags, whereas the regular doesn't. Checking both gives a more full picture of the situation.

In [936]:
acfplot = plot_acf(AR_dat["Sales"], lags=20) # assigned to variable to avoid double plots
pacfplot = plot_pacf(AR_dat["Sales"], lags =20)

c) Simulating trajectories for comparison

We will next simulate some trajectories from these AR(p) models to see how well our models actually fit the data.

In [937]:
# This is a function that plots into one figure 5 subplots:
# 1) the time series as is, and then 2) autocorrelation, 3) partial ac, 4) QQ plot and a 5) probability plot
def tsplot(y, lags=None, figsize=(12, 10), style='bmh'):
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        #mpl.rcParams['font.family'] = 'Ubuntu Mono'
        layout = (2, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        #qq_ax = plt.subplot2grid(layout, (2, 0))
        #pp_ax = plt.subplot2grid(layout, (2, 1))
        
        y.plot(ax=ts_ax)
        ts_ax.set_title('Time Series Analysis Plots')
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)
        sm.qqplot(y, line='s', ax=qq_ax)
        #qq_ax.set_title('QQ Plot')        
        #scs.probplot(y, sparams=(y.mean(), y.std()), plot=pp_ax)

        plt.tight_layout()
    return 
In [938]:
# Simulating observations from AR(1) model

const1 = AR_mod1.params[0] #intercept for model 1
const2 = AR_mod2.params[0] # ... for model 2
phi1 = AR_mod1.params[1]  # coefficients for model 1
phi2 = AR_mod2.params[1:3] # ... model 2

sigma1 = np.sqrt(AR_mod1.scale) #residual standard error is obtained in Python via sqrt(scale)
sigma2 = np.sqrt(AR_mod2.scale)
y_0 = AR_dat.iloc[0]['Sales']
N = len(AR_dat["Sales"])

# This function can be used to simulate values from the AR(1) model
# It is just a linear function of a constant (the intercept), the first lag term & its coefficient and the error e.
def simulate_from_ar1(const, phi, sigma, N, initial_value = 0.0):
    y = np.zeros(N) 
    e = np.random.normal(size=N, scale=sigma)
    y[0] = initial_value
    for t in range(1,N): 
        y[t] = const + phi*y[t-1] + e[t] #linear regresssion!
    return y

# This function can be used to simulate values from the AR(2) model
# It is just a linear function of a constant (the intercept), the two lag terms & their coefficients and the error e.
def simulate_from_ar2(const, phi, sigma, N, initial_value = 0.0):
    y = np.zeros(N) 
    e = np.random.normal(size=N, scale=sigma)
    y[0:2] = initial_value
    for t in range(2,N): 
        y[t] = const + phi[0]*y[t-1] + phi[1]*y[t-2] + e[t] #linear regresssion!
    return y

To see if our simulated trajectories are in line with the real data, we will compare the simulated series with the original series using Student’s t-test for two samples. We do this instead of comparing the descriptive statistics because we cannot easily judge if a difference in the descriptive statistics is significant or not.

First, however, we will check with the help of the Levene test if the variances of the two series are homogeneous. The answer affects our choice of t-test.

In [939]:
AR_sales = AR_dat["Sales"].reset_index(drop=True) # this just resets the indexing in the Sales and saves it as a new variable

Levene_results1 = scs.levene(sim1, AR_sales, center="median")
print(Levene_results1)

Levene_results2 = scs.levene(sim2, AR_sales, center="median")
print(Levene_results2)
LeveneResult(statistic=0.9729586488593093, pvalue=0.3249367747651095)
LeveneResult(statistic=0.04643589387761117, pvalue=0.8295684076959431)

The null hypothesis H0 of the Levene test is that variances are equal, and the tested hypothesis H1, that they are unequal. The large p-value indicates, that we should reject our test hypothesis and keep the null hypothesis. So, we conclude that based on the results, we can assume equal variances between the original data series and simulations.

Next up, the t-test:

In [940]:
ttest1 = scs.ttest_ind(sim1, AR_sales) # variances assumed equal by default
print(ttest1)
ttest2 = scs.ttest_ind(sim2, AR_sales) # variances assumed equal by default
print(ttest2)
Ttest_indResult(statistic=0.9059078509313528, pvalue=0.3658931944659489)
Ttest_indResult(statistic=0.6388323481212247, pvalue=0.5235411877761567)

From the output above we can see that the p-values are both above 5%, which tells us that the differences between the true series and the simulated series are not statistically significant; i.e., our simulated series can be considered statistically identical to the true series of sales. Now let’s visualize the series and see how they look like. This is basically a sanity check of the simulations - if they would look very different from the original we would probably have to look for errors of refine the models.

In [941]:
sims = pd.DataFrame({"AR1":sim1, "AR2":sim2})
pd.concat([AR_sales, sims], axis = 1).plot(figsize=(15,5))
Out[941]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c218036a0>

As we can see, the simulated trajectories are heavily overlapping with the real data. They also follow a similar pattern, so we can assume that the simulated trajectories and the estimated models fit the data pretty well.

Task 2 Modeling sales and promotion effectiveness

Let's move on to adding more explanatory variables and using a bit more sophisticated tools to select an efficient model for the data. Here the point is to see how adding more variables might help with modeling and then to try to figure out which variables are actually useful. There are many approaches one might use here and this one below is just one.

d) Creating the lagged dataset

First, we'll create a dataset which includes all explanatory variables and all variables with lags 1 and 2. We do not want to include the non-lagged Sales as an explanatory variable as that would a) be pointless, trying to predict a value by itself and b) this would manifest itself in the model excluding all other variables (fitting them with 0 values).

In [942]:
data = pd.read_csv('heinz-sales.txt') # let's import the data again
column_names = ("Sales", "Price", "TP", "CP", "DP", "MC1", "MC2", "MC3", "MC4") # creates names for columns
data.columns = column_names # sets the column names for the dataframe

data_lag0 = data # just duplicate the dataset first
data_lag0["Sales"] = np.log(data["Sales"])
data_lag0["Price"] = np.log(data["Price"])
data_lag1 = data_lag0.shift(1).add_suffix('_lag1') # creates the lag 1 dataset
data_lag2 = data_lag0.shift(2).add_suffix('_lag2') # ...lag 2 dataset


data_lags = pd.concat([data_lag0, data_lag1, data_lag2], axis=1)[2::] # combined dataset of all lags
data_lags.head()
Out[942]:
Sales Price TP CP DP MC1 MC2 MC3 MC4 Sales_lag1 ... MC4_lag1 Sales_lag2 Price_lag2 TP_lag2 CP_lag2 DP_lag2 MC1_lag2 MC2_lag2 MC3_lag2 MC4_lag2
2 4.219326 -0.023989 0 0 0 6 5 4 6 3.659970 ... 6.0 4.128907 0.086177 0.0 0.0 0.0 5.0 4.0 5.0 4.0
3 4.377737 -0.084571 0 0 0 6 4 5 3 4.219326 ... 6.0 3.659970 -0.092659 0.0 0.0 0.0 8.0 5.0 8.0 6.0
4 4.015580 -0.064553 0 0 0 1 3 3 3 4.377737 ... 3.0 4.219326 -0.023989 0.0 0.0 0.0 6.0 5.0 4.0 6.0
5 4.993798 -0.214243 0 0 0 6 4 6 7 4.015580 ... 3.0 4.377737 -0.084571 0.0 0.0 0.0 6.0 4.0 5.0 3.0
6 4.150569 -0.270605 0 0 0 3 2 2 2 4.993798 ... 7.0 4.015580 -0.064553 0.0 0.0 0.0 1.0 3.0 3.0 3.0

5 rows × 27 columns

e) Splitting data into training and testing sets

We can use a 2:1 ratio for training and testing set sizes. As the dataset has 121 rows, i.e. observations, we'll have 80 first observations as the training set and the last 41 as the testing set. At the same time, we'll set our objective (the y's).

In [943]:
X_train, y_train = data_lags.iloc[0:80, 1:], data_lags["Sales"][0:80] # lose the Sales column
X_test, y_test = data_lags.iloc[80:, 1:], data_lags["Sales"][80:] # lose the Sales column
X_test.tail()
Out[943]:
Price TP CP DP MC1 MC2 MC3 MC4 Sales_lag1 Price_lag1 ... MC4_lag1 Sales_lag2 Price_lag2 TP_lag2 CP_lag2 DP_lag2 MC1_lag2 MC2_lag2 MC3_lag2 MC4_lag2
118 -0.527387 0 0 0 5 4 6 5 3.510654 -0.541150 ... 8.0 5.388899 -0.457566 0.0 1.0 0.0 2.0 1.0 1.0 2.0
119 -0.579416 0 0 0 2 4 4 3 3.780571 -0.527387 ... 5.0 3.510654 -0.541150 0.0 0.0 0.0 6.0 4.0 5.0 8.0
120 -0.560290 0 0 0 6 4 6 7 4.926037 -0.579416 ... 3.0 3.780571 -0.527387 0.0 0.0 0.0 5.0 4.0 6.0 5.0
121 -0.520750 0 0 0 3 3 2 1 4.737477 -0.560290 ... 7.0 4.926037 -0.579416 0.0 0.0 0.0 2.0 4.0 4.0 3.0
122 -0.532620 0 0 0 5 7 6 7 4.019532 -0.520750 ... 1.0 4.737477 -0.560290 0.0 0.0 0.0 6.0 4.0 6.0 7.0

5 rows × 26 columns

In [944]:
X_train.head() # just checking the first rows of the training set to see if it's OK.
Out[944]:
Price TP CP DP MC1 MC2 MC3 MC4 Sales_lag1 Price_lag1 ... MC4_lag1 Sales_lag2 Price_lag2 TP_lag2 CP_lag2 DP_lag2 MC1_lag2 MC2_lag2 MC3_lag2 MC4_lag2
2 -0.023989 0 0 0 6 5 4 6 3.659970 -0.092659 ... 6.0 4.128907 0.086177 0.0 0.0 0.0 5.0 4.0 5.0 4.0
3 -0.084571 0 0 0 6 4 5 3 4.219326 -0.023989 ... 6.0 3.659970 -0.092659 0.0 0.0 0.0 8.0 5.0 8.0 6.0
4 -0.064553 0 0 0 1 3 3 3 4.377737 -0.084571 ... 3.0 4.219326 -0.023989 0.0 0.0 0.0 6.0 5.0 4.0 6.0
5 -0.214243 0 0 0 6 4 6 7 4.015580 -0.064553 ... 3.0 4.377737 -0.084571 0.0 0.0 0.0 6.0 4.0 5.0 3.0
6 -0.270605 0 0 0 3 2 2 2 4.993798 -0.214243 ... 7.0 4.015580 -0.064553 0.0 0.0 0.0 1.0 3.0 3.0 3.0

5 rows × 26 columns

f)

Then we'll start building the models. In this model solution, we'll try out both Lasso and Ridge. The implementation (and results) for Elastic Net would be quite similar to Lasso.

To be able to build a good model using a regularization method such as Lasso, we need to pick an alpha value (or 'lambda' in some sources), i.e. a penalty factor for adding more variables. The point of the penalty is to reduce the number of variables so that the model we end up with doesn't overfit when we move on from training to testing. Reasonably good performance with the testing set indicates reasonably trustworthy simulation results as well.

The selected method for finding out the best alpha value could be described as "try & observe", so nothing too fancy. But it is also an approach that fits many very different situations, also outside time series analysis. The idea is to generate a large number of alpha values in a range where we guess the correct alpha value might be, say between 0.0000001 and 1, and use an information criterion metric to rate the performance of each tried alpha. In this case, we'll try two different informaiton criteria, Akaike and Bayesian Information Criteria (AIC and BIC). If we don't find a good alpha value in our selected range, we can try again with a different range. We'll select the alpha value that gets the lowest IC value.

In [945]:
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import scale 

# The following script can be used to study both Lasso and Ridge regression. 
# You just have to select which model to use and play a little with the to-be tested alphas.

nobs = X_train.shape[0] # number of observations

# an array of values to be tried, on a logarithmic spread. For Ridge, use e.g. (-4, 3, num=1000)
alphas = np.logspace(-6, 0, num=1000) 

# model = Ridge(max_iter = 10000, normalize = False, copy_X = True, tol=0.0001)
model = Lasso(max_iter = 10000, normalize = False, copy_X = True, tol=0.0001)
coefs = []
bic = []
aic = []
alt_aic = []
alt_bic = []

for a in alphas:
    model.set_params(alpha=a) # sets the alpha
    model.fit(X_train.astype(float), y_train) # fits the model
    nvar = model.coef_[model.coef_!=0].shape[0] + 1 #1 added for variance, should be included in this approach
    # print(nvar, " for alpha " , a) # you can use this to verify that to model kicks out variables when alpha increases
    
    coefs.append(model.coef_) # saves the calculated coefficients
    mse = mean_squared_error(y_train, model.predict(X_train))
    aic.append(2*nvar + nobs*np.log(mse)) # saves the AIC calue
    bic.append(nobs*np.log(mse) + nvar*np.log(nobs)) # saves the BIC value
    
    # These are alternative formulas for calculating the AIC and BIC
    # alt_aic.append(2*nvar + nobs*mse/np.var(y_train))
    # alt_bic.append(np.log(nobs)*nvar + nobs*mse/np.var(y_train))
    
# minimum indexes for bic and aic:
bic_i = np.argmin(bic)
aic_i = np.argmin(aic)
bic_alpha = alphas[bic_i]
aic_alpha = alphas[aic_i]

# alternative AIC and BIC
# alt_bic_i = np.argmin(alt_bic)
# alt_aic_i = np.argmin(alt_aic)
# alt_bic_alpha = alphas[alt_bic_i]
# alt_aic_alpha = alphas[alt_aic_i]

print("BIC: alpha = ", bic_alpha, "; AIC: alpha = ", aic_alpha)
# print(" Alternatives: BIC: alpha = ", alt_bic_alpha, "; AIC: alpha = ", alt_aic_alpha)

# lets plot how the different weights of variables behave when alpha changes:
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights');
BIC: alpha =  0.019692202554791736 ; AIC: alpha =  0.01101645949633657

From the plot above we can see that many variable weights approach 0 when alpha increases to values close to 1. Around 10^-3 and 10^-1 there are "some but not too many" left, which is in line with the optimal alpha values of 0.01969 and 0.01101 for BIC and AIC respectively.

There's also a package called LassoLarsIC that's supposed to calculate the AIC and BIC values. Our initial analysis indicates that it's not working perfectly, so use with caution!

In [946]:
# Lasso AIC and BIC with LassoLars: It seems that the LassoLarsIC package does not calculate the values correctly so use with caution!
model_bic = LassoLarsIC(criterion='bic', normalize=True, max_iter=10000)
model_bic.fit(X_train, y_train)
alpha_bic_ = model_bic.alpha_

model_aic = LassoLarsIC(criterion='aic', normalize=True, max_iter=10000)
model_aic.fit(X_train, y_train)
alpha_aic_ = model_aic.alpha_
print("BIC with Lars: alpha = ", alpha_bic_, "AIC with Lars: alpha = ", alpha_aic_)
BIC with Lars: alpha =  0.008394430755697369 AIC with Lars: alpha =  0.008394430755697369
In [947]:
ax = plt.gca()
ax.plot(alphas, aic)
ax.plot(alphas, bic)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('IC');

For Lasso we find that there is a clear minimum between 0.01 and 0.02 depending on criterion. For this case, we pick the more strict one, BIC, that has an alpha of 0.01969. Ridge however does not have a clear minimum but the criteria go as low as they can. This means Ridge is trying to pick as many variables as it can. Next, let's see how they perform:

g) Choosing the best model

Using the alphas we just found out, let's run those models again and save some results from them. Then, we'll use the models to estimate the Sales based on the testing dataset and compare results.

After comparing Lasso to Ridge, we will still compare the better one to a simple OLS model with the same variables, but naturally refitted.

In [948]:
alpha_L = 0.01987
alpha_R = 0.0001

model_L = Lasso(alpha=alpha_L, max_iter = 10000, normalize = False, tol=0.0001)
model_R = Ridge(alpha=alpha_R, max_iter = 10000, normalize = False, tol=0.0001)

model_L.fit(X_train.astype(float), y_train)
model_R.fit(X_train.astype(float), y_train)

L_coeffs = model_L.coef_
R_coeffs = model_R.coef_

#MSE
mse_L = np.round(mean_squared_error(y_test, model_L.predict(X_test)), 4)
mse_R = np.round(mean_squared_error(y_test, model_R.predict(X_test)), 4)

# IC
bic_L = nobs*np.log(mse_L) + nvar*np.log(nobs)
bic_R = nobs*np.log(mse_R) + nvar*np.log(nobs)

aic_L = 2*nvar + nobs*np.log(mse_L)
aic_R = 2*nvar + nobs*np.log(mse_R)

#Testing: 
    
ridge_vars = pd.DataFrame(R_coeffs[R_coeffs>0]).T
ridge_vars.columns = X_train.columns[R_coeffs>0]
ridge_intercept = model_R.intercept_
print("\n Ridge produces an MSE of ", mse_R, " with variables: \n", ridge_vars)

lasso_vars = pd.DataFrame(L_coeffs[L_coeffs>0]).T
lasso_vars.columns = X_train.columns[L_coeffs>0]
lasso_intercept = model_L.intercept_
print("\n Lasso produces an MSE of ", mse_L, " with variables: \n", lasso_vars)
 Ridge produces an MSE of  0.3525  with variables: 
          TP        CP        DP       MC3       MC4  Sales_lag1  Price_lag1  \
0  2.079804  0.998791  0.380709  0.011537  0.101421    0.263091     0.76614   

   MC2_lag1   TP_lag2  MC1_lag2  MC4_lag2  
0  0.015437  0.289773  0.008429  0.012455  

 Lasso produces an MSE of  0.3956  with variables: 
          TP        CP        DP       MC4  Sales_lag1  MC1_lag2
0  1.605353  0.814718  0.311112  0.072845    0.197957  0.010245

From the results for the testing set, we can see that performance for both models is quite close to each other. Ridge produces a slightly lower mean squared error but on the other hand, uses 11 variables compared to Lasso's 6. It could be said that Lasso is more efficient in results vs. number of variables sense. From this point on, Lasso with the optimal alpha we discovered (alpha = 0.01969) is used.

Next, for comparison, let's fit an OLS model to the variables. This can be done as before in task 1 c).

In [949]:
# fit the ols model using the training set. This method requires the target variable (Sales, lag=0) to be included in the dataset
# Otherwise the OLS function doesn't find it...
ols_m = smf.ols(formula='Sales ~ TP + CP + DP + MC4 + Sales_lag1 + MC1_lag2', data=data_lags.iloc[0:80]).fit()

# Get the parameters from the fitted model
ols_coeffs = ols_m.params
ols_sigma = ols_m.scale

# calculate the mean squared error
mse_ols = np.round(mean_squared_error(y_test, ols_m.predict(X_test)), 4)
print(ols_coeffs, "\n Mean squared error = ", mse_ols)
Intercept     3.653014
TP            2.059062
CP            1.299848
DP            0.511680
MC4           0.085078
Sales_lag1    0.158427
MC1_lag2     -0.018188
dtype: float64 
 Mean squared error =  0.3276

Well, what do you know! The ols-fitted linear model actually gives a slightly lower residual mean squared error than the Lasso. Actually, this is not very suprising as Ordinary Least Squares minimizes just that, and Lasso doesn't.

In [950]:
# Let's compare the parameters of the Lasso model and OLS side-by-side:
lvars = lasso_vars.values.T
lvars = np.insert(lvars, 0, lasso_intercept) # this just fetches the intercept and joins it among other coefficients
l_df = pd.DataFrame(data=lvars, columns=["Lasso"], index=ols_coeffs.index)
ols_df = pd.DataFrame(ols_coeffs, columns=["OLS"])
comparison = l_df.join(ols_df)
comparison
Out[950]:
Lasso OLS
Intercept 3.939979 3.653014
TP 1.605353 2.059062
CP 0.814718 1.299848
DP 0.311112 0.511680
MC4 0.072845 0.085078
Sales_lag1 0.197957 0.158427
MC1_lag2 0.010245 -0.018188

The coefficients in the linear model fitted using OLS are slightly different, as expected, and they seem to be pushed further to the extremes, as their absolute values are larger than those of their counterparts produced by the Lasso model. Because of this, the Lasso coefficients are likely to give better results on the testing dataset.

Also, arriving at this particular model, with exactly these variables, would have been difficult without first fitting the Lasso model. Basically, you would have had to guess the correct variables to include. In this case there were "only" 26 explanatory variables, but even this would have led to a painful number of possible combinations. In a larger set with more lags, such a guessing approach would be practically impossible.

i) Simulating sales for the next month

To simulate the coming sales, we'll extend our dataset for 4 weeks. The following piece of code creates data for these weeks. The values could be interpreted as our "decisions", for example in the data below we would have made decision that price is 0.587 for the first two prediction weeks, 0.576 for the 3rd and 0.610 for the 4th. These could of course be changed to simulate how different choices affect the predictions.

In [951]:
sim_week = np.array([123, 124, 125, 126])
sim_Sales = np.array(["NaN", "NaN", "NaN", "NaN"])
sim_Price = np.array(np.log([0.587, 0.587, 0.576, 0.610]))
sim_CP = np.array([0, 0, 0, 0])
sim_DP = [1, 1, 0, 0]
sim_TP = [0, 0, 0, 1]
sim_MC1 = [2, 3, 5, 3]
sim_MC2 = [3, 3, 4, 2]
sim_MC3 = [2, 2, 1, 1]
sim_MC4 = [5, 7, 8, 6]

sim_arr = np.column_stack((sim_Sales, sim_Price, sim_TP, sim_CP, sim_DP, sim_MC1, sim_MC2, sim_MC3, sim_MC4))
sim_dec = pd.DataFrame(data=sim_arr, index=sim_week)
sim_dec.columns = column_names # defined before

sim_df = pd.concat([data, sim_dec], axis=0).astype("float64")
In [952]:
# This is again a simulation function built for this purpose...

def simulate_from_lm(intercept, coeffs, data, sigma):
    N = data.shape[0]
    y = data["Sales"]
    e = np.random.normal(size=N, scale=sigma)
    for t in range(123,N): 
        y[t] = intercept + coeffs["TP"]*data["TP"][t] + coeffs["CP"]*data["CP"][t] + coeffs["DP"]*data["DP"][t] + coeffs["MC4"]*data["MC4"][t] + coeffs["MC1_lag2"]*data["MC1"][t-2] + coeffs["Sales_lag1"]*data["Sales"][t-1] + e[t] #linear regresssion!
    return y
In [953]:
# The following plots a figure of the end of the whole simulated sales time series
# Note the exponentiation of the simulated sales value - this is because earlier we used a log of it in modeling

plt.figure(figsize =(15,8))
plt.plot(np.exp(data["Sales"]))
plt.xlim(120, 126)
sims = 30
for s in range(0,sims):
    sim_y = simulate_from_lm(lasso_intercept, lasso_vars, sim_df, np.sqrt(ols_sigma))
    plt.plot([122, 123, 124, 125, 126], np.exp(sim_y[122:]))

    

We can see from the chart of our simulations that we are likely to have pretty stable sales in weeks 1-3 with a large potential surge in sales in week 4 of the next month. This is a result of our promotion strategy, and your results may look totally different.

Again, note that the selected marketing strategy was based on (visually) looking for patterns in the data and trying to capitalize on what seemed like a reoccurring theme. You can go back and try out different values for the promotions and price and see how your results differ.

So, now we have predicted how the sales could look like for the coming 4 weeks. As our model includes some stochastic elements, the same model produces different results on each simulation. We could now take these simulations and calculate for example how often sales are higher than 500 units, or 1500 units. This information could then be used for many purposes, e.g. to plan production so that we don't overproduce or underproduce compared to the expected demand. One could even use some optimization model to suggest an optimal production strategy that takes into account the information from realized sales during previous weeks!

THE END