ARIMA > 상관분석

본문 바로가기
서울논문컨설팅 / 무료상담 010-2556-8816
신뢰할수 있는 서울대 박사님들이 함께합니다. seoulpaper@daum.net, 02-715-6259


Home > 통계 > 시계열분석
상관분석

ARIMA


 

Autoregressive integrated moving average

In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). ARIMA models are applied in some cases where data show evidence of non-stationarity, where an initial differencing step (corresponding to the "integrated" part of the model) can be applied one or more times to eliminate the non-stationarity.[1]

The AR part of ARIMA indicates that the evolving variable of interest is regressed on its own lagged (i.e., prior) values. The MA part indicates that the regression error is actually a linear combination of error terms whose values occurred contemporaneously and at various times in the past. The I (for "integrated") indicates that the data values have been replaced with the difference between their values and the previous values (and this differencing process may have been performed more than once). The purpose of each of these features is to make the model fit the data as well as possible.

Non-seasonal ARIMA models are generally denoted ARIMA(p,d,q) where parameters pd, and q are non-negative integers, p is the order (number of time lags) of the autoregressive modeld is the degree of differencing (the number of times the data have had past values subtracted), and q is the order of the moving-average model. Seasonal ARIMA models are usually denoted ARIMA(p,d,q)(P,D,Q)m, where m refers to the number of periods in each season, and the uppercase P,D,Q refer to the autoregressive, differencing, and moving average terms for the seasonal part of the ARIMA model.[2][3]

When two out of the three terms are zeros, the model may be referred to based on the non-zero parameter, dropping "AR", "I" or "MA" from the acronym describing the model. For example, ARIMA (1,0,0) is AR(1), ARIMA(0,1,0) is I(1), and ARIMA(0,0,1) is MA(1).

ARIMA models can be estimated following the Box–Jenkins approach. 

Definition[edit]

Given a time series of data Xt where t is an integer index and the Xt are real numbers, an ARMA(p,q) model is given by

{\displaystyle X_{t}-\alpha _{1}X_{t-1}-\dots -\alpha _{p'}X_{t-p'}=\varepsilon _{t}+\theta _{1}\varepsilon _{t-1}+\cdots +\theta _{q}\varepsilon _{t-q},}{\displaystyle X_{t}-\alpha _{1}X_{t-1}-\dots -\alpha _{p'}X_{t-p'}=\varepsilon _{t}+\theta _{1}\varepsilon _{t-1}+\cdots +\theta _{q}\varepsilon _{t-q},}

or equivalently by

{\displaystyle \left(1-\sum _{i=1}^{p'}\alpha _{i}L^{i}\right)X_{t}=\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}\,}\left(1-\sum _{i=1}^{p'}\alpha _{i}L^{i}\right)X_{t}=\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}\,

where {\displaystyle L}L is the lag operator, the {\displaystyle \alpha _{i}}\alpha _{i} are the parameters of the autoregressive part of the model, the {\displaystyle \theta _{i}}\theta _{i} are the parameters of the moving average part and the {\displaystyle \varepsilon _{t}}\varepsilon _{t} are error terms. The error terms {\displaystyle \varepsilon _{t}}\varepsilon _{t} are generally assumed to be independent, identically distributed variables sampled from a normal distribution with zero mean.

Assume now that the polynomial {\displaystyle \textstyle \left(1-\sum _{i=1}^{p'}\alpha _{i}L^{i}\right)}{\displaystyle \textstyle \left(1-\sum _{i=1}^{p'}\alpha _{i}L^{i}\right)} has a unit root (a factor {\displaystyle (1-L)}{\displaystyle (1-L)}) of multiplicity d. Then it can be rewritten as:

{\displaystyle \left(1-\sum _{i=1}^{p'}\alpha _{i}L^{i}\right)=\left(1-\sum _{i=1}^{p'-d}\phi _{i}L^{i}\right)\left(1-L\right)^{d}.}{\displaystyle \left(1-\sum _{i=1}^{p'}\alpha _{i}L^{i}\right)=\left(1-\sum _{i=1}^{p'-d}\phi _{i}L^{i}\right)\left(1-L\right)^{d}.}

An ARIMA(p,d,q) process expresses this polynomial factorisation property with p=p'−d, and is given by:

{\displaystyle \left(1-\sum _{i=1}^{p}\phi _{i}L^{i}\right)(1-L)^{d}X_{t}=\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}\,}{\displaystyle \left(1-\sum _{i=1}^{p}\phi _{i}L^{i}\right)(1-L)^{d}X_{t}=\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}\,}

and thus can be thought as a particular case of an ARMA(p+d,q) process having the autoregressive polynomial with d unit roots. (For this reason, no ARIMA model with d > 0 is wide sense stationary.)

The above can be generalized as follows.

{\displaystyle \left(1-\sum _{i=1}^{p}\phi _{i}L^{i}\right)(1-L)^{d}X_{t}=\delta +\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}.\,}{\displaystyle \left(1-\sum _{i=1}^{p}\phi _{i}L^{i}\right)(1-L)^{d}X_{t}=\delta +\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}.\,}

This defines an ARIMA(p,d,q) process with drift δ/(1 − Σφi).

Other special forms[edit]

The explicit identification of the factorisation of the autoregression polynomial into factors as above, can be extended to other cases, firstly to apply to the moving average polynomial and secondly to include other special factors. For example, having a factor {\displaystyle (1-L^{s})}{\displaystyle (1-L^{s})} in a model is one way of including a non-stationary seasonality of period s into the model; this factor has the effect of re-expressing the data as changes from s periods ago. Another example is the factor {\displaystyle \left(1-{\sqrt {3}}L+L^{2}\right)}\left(1-{\sqrt {3}}L+L^{2}\right), which includes a (non-stationary) seasonality of period 2.[clarification needed] The effect of the first type of factor is to allow each season's value to drift separately over time, whereas with the second type values for adjacent seasons move together.[clarification needed]

Identification and specification of appropriate factors in an ARIMA model can be an important step in modelling as it can allow a reduction in the overall number of parameters to be estimated, while allowing the imposition on the model of types of behaviour that logic and experience suggest should be there.

Differencing[edit]

Differencing in statistics is a transformation applied to time-series data in order to make it stationary. A stationary time series' properties do not depend on the time at which the series is observed.

In order to difference the data, the difference between consecutive observations is computed. Mathematically, this is shown as

{\displaystyle y_{t}'=y_{t}-y_{t-1}\,}{\displaystyle y_{t}'=y_{t}-y_{t-1}\,}

Differencing removes the changes in the level of a time series, eliminating trend and seasonality and consequently stabilizing the mean of the time series.

Sometimes it may be necessary to difference the data a second time to obtain a stationary time series, which is referred to as second order differencing:

{\displaystyle {\begin{aligned}y_{t}^{*}&=y_{t}'-y_{t-1}'\\&=(y_{t}-y_{t-1})-(y_{t-1}-y_{t-2})\\&=y_{t}-2y_{t-1}+y_{t-2}\end{aligned}}}{\displaystyle {\begin{aligned}y_{t}^{*}&=y_{t}'-y_{t-1}'\\&=(y_{t}-y_{t-1})-(y_{t-1}-y_{t-2})\\&=y_{t}-2y_{t-1}+y_{t-2}\end{aligned}}}

Another method of differencing data is seasonal differencing, which involves computing the difference between an observation and the corresponding observation in the previous year. This is shown as:

{\displaystyle y_{t}'=y_{t}-y_{t-m}\quad {\text{where }}m={\text{number of seasons}}.}{\displaystyle y_{t}'=y_{t}-y_{t-m}\quad {\text{where }}m={\text{number of seasons}}.}

The differenced data is then used for the estimation of an ARMA model.

Examples[edit]

Some well-known special cases arise naturally or are mathematically equivalent to other popular forecasting models. For example:

  • An ARIMA(0,1,0) model (or I(1) model) is given by {\displaystyle X_{t}=X_{t-1}+\varepsilon _{t}}X_{t}=X_{t-1}+\varepsilon _{t} — which is simply a random walk.
  • An ARIMA(0,1,0) with a constant, given by {\displaystyle X_{t}=c+X_{t-1}+\varepsilon _{t}}X_{t}=c+X_{t-1}+\varepsilon _{t} — which is a random walk with drift.
  • An ARIMA(0,0,0) model is a white noise model.
  • An ARIMA(0,1,2) model is a Damped Holt's model.
  • An ARIMA(0,1,1) model without constant is a basic exponential smoothing model.[4]
  • An ARIMA(0,2,2) model is given by {\displaystyle X_{t}=2X_{t-1}-X_{t-2}+(\alpha +\beta -2)\varepsilon _{t-1}+(1-\alpha )\varepsilon _{t-2}+\varepsilon _{t}}{\displaystyle X_{t}=2X_{t-1}-X_{t-2}+(\alpha +\beta -2)\varepsilon _{t-1}+(1-\alpha )\varepsilon _{t-2}+\varepsilon _{t}} — which is equivalent to Holt's linear method with additive errors, or double exponential smoothing.[4]

Choosing the order[edit]

To determine the order of a non-seasonal ARIMA model, a useful criterion is the Akaike information criterion (AIC) . It is written as

{\displaystyle {\text{AIC}}=-2\log(L)+2(p+q+k+1),}{\displaystyle {\text{AIC}}=-2\log(L)+2(p+q+k+1),}

where L is the likelihood of the data, p is the order of the autoregressive part and q is the order of the moving average part. The parameter k in this criterion is defined as the number of parameters in the model being fitted to the data. For AIC, if k = 1 then c ≠ 0 and if k = 0 then c = 0.

The corrected AIC for ARIMA models can be written as

{\displaystyle AICc=AIC+(2(p+q+k+1)(p+q+k+2))/(T-p-q-k-2).}{\displaystyle AICc=AIC+(2(p+q+k+1)(p+q+k+2))/(T-p-q-k-2).}

The Bayesian Information Criterion can be written as

{\displaystyle BIC=AIC+(\log(T)-2)(p+q+k+1).}{\displaystyle BIC=AIC+(\log(T)-2)(p+q+k+1).}

The objective is to minimize the AIC, AICc or BIC values for a good model. The lower the value of one of these criteria for a range of models being investigated, the better the model will suit the data. It should be noted however that the AIC and the BIC are used for two completely different purposes. Whilst the AIC tries to approximate models towards the reality of the situation, the BIC attempts to find the perfect fit. The BIC approach is often criticized as there never is a perfect fit to real-life complex data; however, it is still a useful method for selection as it penalizes models more heavily for having more parameters than the AIC would.

AICc can only be used to compare ARIMA models with the same orders of differencing. For ARIMAs with different orders of differencing, RMSE can be used for model comparison.

Estimation of coefficients[edit]

Forecasts using ARIMA models[edit]

The ARIMA model can be viewed as a "cascade" of two models. The first is non-stationary:

{\displaystyle Y_{t}=(1-L)^{d}X_{t}}{\displaystyle Y_{t}=(1-L)^{d}X_{t}}

while the second is wide-sense stationary:

{\displaystyle \left(1-\sum _{i=1}^{p}\phi _{i}L^{i}\right)Y_{t}=\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}\,.}{\displaystyle \left(1-\sum _{i=1}^{p}\phi _{i}L^{i}\right)Y_{t}=\left(1+\sum _{i=1}^{q}\theta _{i}L^{i}\right)\varepsilon _{t}\,.}

Now forecasts can be made for the process {\displaystyle Y_{t}}Y_{t}, using a generalization of the method of autoregressive forecasting.

Forecast intervals[edit]

The forecast intervals (confidence intervals for forecasts) for ARIMA models are based on assumptions that the residuals are uncorrelated and normally distributed. If either of these assumptions does not hold, then the forecast intervals may be incorrect. For this reason, researchers plot the ACF and histogram of the residuals to check the assumptions before producing forecast intervals.

95% forecast interval: {\displaystyle {\hat {y}}_{T+h|T}\pm 1.96{\sqrt {v_{T+h|T}}}}{\displaystyle {\hat {y}}_{T+h|T}\pm 1.96{\sqrt {v_{T+h|T}}}}, where 

번호 제목 글쓴이 날짜 조회 수
8 네트워크분석-시계열 서울논문 10-13 1094
7 지수이동평균 서울논문 03-20 5065
6 자기상관 서울논문 03-20 1895
5 ARCH Autoregressive conditional heteroskedasticity 서울논문 03-20 1061
4 단위근검정-ADF 서울논문 03-20 2079
열람중 ARIMA 서울논문 03-20 1050
2 공적분 서울논문 03-20 1222
1 시계열분석 서울논문 03-20 1225

대표:이광조ㅣ사업자등록번호: 643-09-02202ㅣ대표전화: 02-715-6259ㅣ서울시 용산구 효창원로 188