Anonymous:
I. Specific Question (Model and Understanding Fit Measures)
This week we discussed four fit measures, where a fit measure indicates how well the model fits the data and we discussed this in the context of CFA. I have two questions regarding fit measures. First, do we look at how well the loadings load on the various factors in our model first and then look at the fit measures to determine if we have a good model? It would seem to me that if the loadings don’t seem to load high on any factor and load on most factors then we don’t have a very good model and looking at the fit measures wouldn’t be necessary or hopefully would only confirm this? Is this correct? Second, for the various fit measures discussed in class, it would seem to me that GFI and RMSEA are the best to use but that RMR is not that good because of the two problems indicated in class: 1) it is not clear from the covariance matrices what large or small values should be or what they mean and 2) different sizes of correlations should be given different weights. Is this correct? Further, when would someone want to use the Chi Squared because it doesn’t seem to be a good measure because of the dependency with sample size? Do people just throw this in because if they have a large sample size, then the number looks good?
Correction: "First, do we look at how well the loadings load on ..." You mean "variables load on..." The loadings don't load; that is circular language.
As far as wanting the loadings to be high, sure, we want high reliability. Your point is well taken that if the measure is not reliable, then who cares about model fit? It's just like the case of a regression analysis where the model assumptions (normality, independence, linearity, homoscedasticity) are all valid but RSquare=.0001.
RMR is ok, and probably the most easily understood. Just look at the formula. It's easy to understand, right? The others are less comprehendable. But sure, the others weight according to size, and seem more popular. Pedagogically, I prefer RMR for its transparency, and for it's clear message of what we are trying to do, which is to compare two matrices. But indeed it seems less popular in the literature.
I think you have Chi-Square backwards - remember it gets bigger for large n, hence worse. Review the noncentrality parameter concept discussed in class.
90 90 100 70
II. General Question (EFA, CFA and SEM – Putting it all together for research) Thus far, we have discussed EFA, CFA and SEM. EFA is exploring the data to determine how many factors we may have in our data. CFA is estimating the correlation between the factors or latent variables based on theory and SEM is estimating the relationship or making an argument about the data based on theory because we can’t really get at cause. But when I think of all three of these together, it would seem to me that there could be an over-use of analysis on the data, or that data-snooping could be an issue here? Is there a threat for this or are these methods just intended for one to gain a better understanding of the data?
Further, when I think of these items, I also think of deductive versus inductive reasoning (from a research methods class I took) where deductive reasoning is reasoning which uses deductive arguments to move from given statements (premises) to conclusions whereas inductive reasoning reasons from a large number of particular examples to a general rule (both definitions obtained from Wikipedia). It would seem to me that data-snooping and inductive reasoning could be very similar here because if we run several analyses on the data and come up with a pattern this might be considered inductive reasoning vs. data snooping.
For example, let’s say I have an interest in accounting research and I have gained access to a special charitable contributions dataset. I have read the literature and understand what the concepts are related to charitable contributions and have developed some thoughts on what I would expect to see in the dataset. So then I run EFA and play with the factors to determine how many factors I feel are appropriate based on looking at the factor loadings. Then I feel that three factors are important so I run CFA on the dataset with these three factors and look at the fit measures and determine my three factors appear appropriate. Then I run SEM on the dataset to determine the relationship of the factors to make an argument for how I think things work. Is this how it usually works? Because I could see where there is room here for a researcher to change their mind. For example, if they find that in EFA there are other factors or if in CFA the factors are too highly correlated or there doesn’t appear to be convergent or discriminant validity, they might try other things until something works. Would this be data-snooping or inductive reasoning? I’m thinking it could be inductive reasoning because maybe the researcher has come up with something new to add to current theory?
Some corrections:
"Thus far, we have discussed EFA, CFA and SEM. EFA is exploring the data to determine how many factors we may have in our data. "
More importantly, EFA also allows you to name the factors by inferring their menaing through the loading patterns, and through the understanding of the model equations. Review the police department case from the HW.
The you said
"CFA is estimating the correlation between the factors or latent variables based on theory"
Again, very imcomplete. First you have to relate the factors to the variables through theory. You do this by relating the factors only to specific variables, and forcing many of the loadings to be zero. That is an absolutely crucial step.
Then you said
" and SEM is estimating the relationship or making an argument about the data based on theory because we can’t really get at cause."
Again very vague and incomplete. Specifically, SEM models direct relationships between the factors, which might be termed "causal", whereas the CFA model simply states that the factors are correlated.
"But when I think of all three of these together, it would seem to me that there could be an over-use of analysis on the data, or that data-snooping could be an issue here? Is there a threat for this or are these methods just intended for one to gain a better understanding of the data? "
Actually, that is the whole point of CFA and SEM, that is, *not* to data snoop. For example, consider a questionnaire that you create. *You* designed the questionnaire, and *you* designed it specifically to measure A, B, and C. *You* wrote all the questions that are supposed to measure A, the questions that are supposed to measure B, and the questions that are supposed to measure C. *You* had A,B, and C in mind when you wrote the questions. *You* know which variables should be related to which factors, and therefore how to write the model equations. You also started your study with a theory about how A, B and C are related to each other (a path diagram), so you write the model that corresponds to that diagram. All of this *you* specified a priori, and then you will analyze the data according to those a priori specifications.
In what I just described, there is absolutely no data snooping whatsoever. In reality, things don;t always work out - some varibles don't fit, some errors need to be correlated, variables are dropped or added, analyses are tried and scrapped. The more of this messing around you do to try and prove your point, the more you are data snooping. Data snooping is a slippery rope, rather than a "yes/no" issue.
In your description of the accounting example, it sounds like there is more data-snooping going on.
As far as deductive versus inductive, any time you use data to shed light on the "truth", it is inductive. If you assume the "truth" and then make predictions, then that is called deductive reasoning. Probability is deductive; statistics inductive. Both data snooping and confirmatory analyses are, from this view, inductive. But data snooping leaves you with more questions about the validity of your inductive inferences.
Statistical theory starts with the model, a statement that the data are produced by a model with deterministic and probabilistic components and unknown parameters. That part of statistical science is "deductive."
After collecting the data, you attempt to infer the values of the unknown parameters, using your observed data. That is part of statistical science is "inductive."
90 100 90 100