Anonymous:
45213
My question is related to the selection of variables in a model. Apparently, there are two approaches to decide which variables should enter in our model. One is related to estimation techniques of the model (Forward selection, Backward deletion, Stepwise regression). The other is related more to the nature of the phenomenon that we want to estimate (or predict). In this second, I would include the multicollinearity issue that we discussed in class. Reading one article for my other classes, I found that the endogeneity could be another criterion for not including some variables in particular models. My question refers to this criterion.
Endogeneity also looks like what I call “an approach related to the nature of the phenomenon of interest”. I saw that we are going to cover endogeneity later in the class, at the same time as simultaneous equations. However, my question refers to the explanation for why specific variables that could be endogenous should not be in the model. In Wikipedia (looking for endogeneity), it says, “In econometrics the problem of endogeneity occurs when the independent variable is correlated with the error term in a regression model. This implies that the regression coefficient in an OLS regression is biased”. This explanation confuses me. I understand that we want our independent variables to be correlated with the dependent variable, right? However, since there is also a direct relationship between the distribution of the dependent variable and the distribution of the error term, is not logical then to expect the independent variables to be correlated with the error term? Or perhaps what wikipedia refers to error terms are the differences between the observed dependent values and the predicted dependent values (residuals)? Thus, the independent variables should not be correlated with these residuals (endogeneity); nor should they be correlated among themselves (multicollineatiry). Is this logic in the right direction? What could be a simple explanation of endogeneity as a problem to avoid when selecting variables to include in a model?
Note: I know that we are going to cover endogeneity later. However, since I have read a lot about this problem when deciding which variables to include or not to include in the model, or with problematic models, I am feeling that I need to understand the problem of endogeneity as a criterion of selection of variables.
This is the wrong type of question to ask for the "specific" question. The question should be specific and about something we have talked about. This question is better for the "general" type.
But one topic is indeed relevant: "Thus, the independent variables should not be correlated with these residuals (endogeneity); nor should they be correlated among themselves (multicollineatiry)."
But this is false, why shouldn;t the IV's be correlated? They are always correlated. Please review the examples.
As far as X's uncorrelated with residuals, it is automatically true in many cases (eg MVN pdf of Y and X) and is not a concern.
I would not connect the endogeneity issue with variable selection as covered so far. It's more an issue of *a priori model selection*. Endogeneity is a property of the model you choose to use. If you choose to use a model where there is a system of simultaneous equations , and where the IV in one is a DV in another and vice versa, then you are likely to have an endogeneity concern.
90 80 50 80