Timeseries Part 3: ARIMA & SARIMA

6 min readDec 1, 2021

The full code for this article can be found here.

In this article, I am going to look at forecasting or making future predictions on time-series data using Python. I will continue using the statsmodels library, along with the pmdarima library. I will cover 2 very widely used methods for forecasting in time-series data, namely ARIMA and SARIMA

Terms like trend, seasonality, stationarity have already been covered in part 2 of this series and you can refer to it if needed.

ARIMA

ARIMA or AutoRegressive Integrated Moving Average is one of the most widely used models in time-series forecasting. The model is composed of three components :

AR or Autoregression Model

This is a regression model that builds upon the dependent relationship between a current observation and observations over a previous period.

2. Integrated component

Differencing of observations (subtracting an observation from another observation at the previous time step) in order to make the time series stationary.

3. MA or Moving Average Model

A model that uses the dependency between an observation and a residual error from a moving average model, applied to lagged observations.

Each of these components is explicitly specified in the model as a parameter. A standard notation used is ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used.

The parameters of the ARIMA model are defined as follows:

p: The number of lag observations included in the model, also called the lag order.
d: The number of times that the raw observations are differenced, also called the degree of differencing.
q: The size of the moving average window, also called the order of moving average.

AIC

AIC or Akaike Information Criterion is a metric to measure the relative quality of statistical models. The best-fit model according to AIC is the one that explains the greatest amount of variation using the fewest possible independent variables.

Lets now start with the code demo on how to estimate the parameters p, d and q for an ARIMA model and use them for training and forecasting. I am going to use the ‘airlines passengers’ data which contains the number of passengers carried monthly by an airlines, from 1949 to 1960.

Install the necessary libraries pmdarima and statsmodels using

!pip3 install pmdarima 
!pip3 install statsmodels

Load the data and drop the ‘na’ rows :

Convert the frequency of the index column to ‘MS’ (month-start) since the data is for every month starting. This specification is required for the statsmodels library to process the data.

Plot the data

Check for Seasonality

Looking at the plot of our data we can say that the data exhibits some seasonality as we can observe the same patterns being repeated every month. To confirm this we will perform ETS decomposition and check the results

We can clearly see from the results plot above the data is seasonal with patterns repeating every year. Since this is monthly data, we can say that the seasonality factor or frequency is 12. We will ignore this seasonality at first and perform ARIMA.

Check for stationarity using AD Fuller Test

Conclusion: Data is non-stationary as the p-value obtained from AD fuller test is quite high.

ARIMA parameters grid-search using pmdarima library

Next, we use the auto_arima method of the pmdarima library to perform a grid-search for the p,d and q parameters.

Here we have taken ‘seasonal = False’ as we are for now checking out the ARIMA model without seasonality. For more details on the parameters that can be passed to the auto_arima function, check out this link.

We get ARIMA(4,1,3) as our best model. We will use this to perform training and testing and check the performance of our model.

After training, get the predictions on the test data

Plot the predictions vs test data:

We can observe that though the ARIMA model has captured the mean of the time-series, it is not doing very good at predicting the overall trends of the data.

Check the numerical error

We will use the ‘root mean square error’ as our error metric.

We see that the error is almost the same as the standard deviation of the test data, and is about 1/5th of the mean. We can say that the model is good for predicting values within the first standard deviation.

SARIMA

Now let us take seasonality into account and build a model. For this we use SARIMA, which is nothing but an ARIMA model with seasonality . Using SARIMA, additional parameters P, D, and Q are added to our ARIMA(p,d,q) model, where the capital terms denote the seasonal regression, differencing and moving average coefficients respectively.

Let's look at the code :

First we use pmdarima to run a gridsearch again, but this time we make ‘seasonality=True’ and also pass in the parameter ‘m=12’ which specifies the frequency of the seasonality.

The best model returned ARIMA(0,1,1)(2,1,0)[12] now has now two sets of parameters, the values (p,d,q) for the ARIMA model and the values (P,D,Q,m) for the seasonal model.

Train and fit the best model

We can easily see that by taking seasonality into account, the model performs quite well and follows the trend and seasonality of the test data.

Let us confirm this by finding the error :

We can see that the error has reduced by one-fifth once we took seasonality into account.

We will now use the model to make future predictions on our data. A general rule of thumb is to make forecasts for the same duration as the length of the test data. Also while making forecasts, we train the best model again on our whole data.

We can plot to see how the forecasts look

As we can see the forecasts look very much in sync with the general trend and seasonality observed over the years.

Thanks for reading. Please feel free to post any queries, comments or suggestions.

Timeseries Part 3: ARIMA & SARIMA

Written by Prakhar S

Responses (1)