Granger Causality for nonstationary series: Toda and Yamamoto (1995)

In this post I will describe how to use a code that applies the Toda and Yamamoto (TY) procedure to a whole database. As always I try to achieve, you only need a very basic knowledge of R to use the code. TY requires a previous knowledge about the order of integration of the series. If you want to use a code that applies unit root tests to a whole database following the Enders and Kennedy (2001) strategy for unit roots (so the code will find out if you need constant or trend), you can read this post.

TY is one of the ways that you can follow for Granger Causality testing in the context of nonstationary series. I will not go through the explanation of the procedure because it is very clearly explained by the Professor Dave Giles in his post Testing for Granger Causality. If you are not very familiar with the procedure, I strongly recommend to read the post.

I will rather explain how the code works. First of all, depending of the data included in the database, you may get several error messages, mainly related to the data in the database and the No autocorrelation tests (Breush Godfrey) or the Homoskedasticity tests (White). For example, near singularity issues. Of course, it is important to consider how many lags you will include in the regressions.

It is also very important to explain that the code will yield warning messages for the VARs where it was not possible to correct residual Autocorrelation and Heteroskedasticity.

As a recurrent practice in my posts, I will try to keep the codes as simple to understand as I can (or better, I think I can). My main purpose is that anyone that want to start learning coding, this codes can help them to improve their coding skills. If you are not able to understand the code, you can work with it for a while and get familiar and confident with it. On the other hand, the readers with coding experience, can easily check by them selves what the code is doing.

How should be prepared the database?

The database should be in data.frame format. Take a look at the following diagram

How to understand this. Imagine that you created this table. You did that because you wanted to find out the causality in this way:

GDP – CPI – IIP

GDP – CPI – IIP – FDI

GDP – CPI – IIP – EXPORTS

GFKF – CPI – IIP

GFKF – CPI – IIP – FDI

GFKF – CPI – IIP – EXPORTS

As you can see from the series n+w+1 to lenght(x), will be the rest of variables that will change in each group of variables (FDI and EXPORTS). Also, w include the “fixed variables” do you want in each group of variables to test TY and should be between the column n+1 and column n+w in the database. In this case w includes the variables CPI and IIP

If you just want to check causality for GDP, you just need to delete the column of GFKF and set n = 2 and you will get

GDP – CPI – IIP

GDP – CPI – IIP – FDI

GDP – CPI – IIP – EXPORTS

If instead, you don’t want to include FDI and EXPORTS you can delete those columns and if you keep n = 3, but you need to include one of them (If you don’t include it, you will get an error message), let’s say FDI and you will get

GDP – CPI – IIP

GDP – CPI – IIP – FDI

GFKF – CPI – IIP

GFKF – CPI – IIP – FDI

But then, you delete the rows corresponding FDI and you will get the result that you wanted.

GDP – CPI – IIP

GFKF – CPI – IIP

Finally, as you can see, the first column should be always the time.

Explanation of the code

I called the function toda_yamamoto and it has the following arguments:

toda_yamamoto<-function(x, n = 5, w = 1, lag.max = 10, type = c(“const”, “trend”, “both”, “none”), m = 1, o = 7, corrhet = 3, season = NULL, exogen = NULL, digits = 4, probability = 0.05)

Now I will explain the arguments of the code:

lag.max, type, season and exogen are the arguments of the R function VARselect. Although exogen appear in the arguments, until now you cannot set exogenous variables so far. Then you only can leave exogen = NULL. The argument lag.max is the maximum number of lags that you want to consider to obtain the optimal lag in the regressions.
x is the data.frame with all the variables.
m is the maximum order of integration that you will consider. This implies a tricky part. As its value may change depending on the group of variables involved, the easiest way to deal with this is the following: m can’t be 0 because in that case it is not pertinent the use of TY procedure and also it is very difficult to consider m > 2 because economic series are mostly I(0) and I(1) or in a few cases I(2). Then the plausible options are m = 1 and m = 2. Then you can just evaluate the code for both cases and take the output of m = 2 when working with the variables that are I(2).
o indicates how many lags you are willing to add to the optimal lag in order to correct the autocorrelation in the VAR (If there is residual correlation). For example, you want to set lag.max = 8. However, it can happen that after obtaining the optimal lag, there is residual correlation and then, it is required to add more lags that sometimes can surpass the number of lags that you consider to obtain the optimal lag (8).
corrhet is up to how many lags you want to check for the Autocorrelation and Heteroskedasticity tests. So, if corrhet = 3, the code will check for Autocorrelation and Heteroskedasticity using first 1 lag, after 2 lags, and finally 3 lags.
digits is with how many digits you want to have the results in the table.
probability is the level of significance for the Autocorrelation and Heteroskedasticity tests for the VAR that was cleaned for Autocorrelation and Heteroskedasticity.
There are several arguments from functions used in the code that you can set inside the code, if you have a certain knowledge about coding (For example, the argument type appears in breush, stability and cusumvar).

The output of the code

The code checks the stability of the VAR through CUSUM tests and also the roots of the system (I really don’t know why I add this “feature”). So, you will get the charts of this tests. But the main output will be a table where are summarized the results.

In the first w+2 columns of the output table (w+1 if you don’t want to work with the variables FDI and EXPORTS in the example) will appear the name of the variables included in the regressions. In the next column appear the number of variables included in the VAR (maybe an irrelevant information).

In the next 4 columns will appear the optimal lag of the VAR indicated by the Akaike Information Criterion, the Hannan Quinn criterion, the Schwarz Criterion and the FPE criterion, respectively. In the next column you will find the lag indicated by the best criterion (considering the sample size).

In the following column will appear the lag that corrects residual Autocorrelation and Heteroskedasticity.

In the next 3 columns will appear the p-value of the Breush Godfrey, Jarque Bera, and White test of the VAR cleaned of residual Autocorrelation and Heteroskedasticity.

The next w+1 columns will show the p-value of the causality tests regarding to if each of the w variables and the ‘additional’ one cause the ‘main’ variable.

Finally the last w+1 columns will show the p-value of the causality tests regarding to if the ‘main’ variable causes each of the w variables and the ‘additional’ one.

Final thoughts

Finally, you can download the code from its Github repository. To see how the code works, you can watch this YouTube video.

I think this explanation is enough, If you have any question or suggestion please let me know by commenting in the post.

Granger Causality for nonstationary series: Code for the Toda-Yamamoto (1995) procedure to a whole database.

How should be prepared the database?

Explanation of the code

The output of the code

Final thoughts

Related