L2 Research: Statistical Analysis and Asking the Right Questions

It's no secret that there are some problems with how research and knowledge-creation works in the social sciences (and medical science, too, by the way). Long story short is that social science researchers have a very hard time replicating results when the same study is run again (a team of psychologists recently replicated 100 published studies, read about it here).

One source of this problem is a research framework called null-hypothesis significance testing (NHST). For those of you who aren't knee-deep in doing or reading primary research, this is where the now-common phrase "significant results" comes from. Basically, NHST begins with the idea that there is no effect of whatever it is you are researching- a type of classroom instruction, a pill, whatever. If you collect your data, run your stats, and then get something called a p-value that is small enough, you get to reject the idea that there was no effect of the instruction or the pill. This lets you say your study (or, your instruction or your pill) has siginificant results, because it is different than nothing. p-values are malleable little things, and very sensitive to sample size; large samples with very small effects will be "significant" while smaller samples with medium to large effect sizes might end up being "non-significant" (this is a concept called power in statistics). The thing is, this p-value says nothing about how large the effect is; the word "significant" only means that it passed some (arbitrary) threshold. Now, people do tend to look at how big the difference is (an effect size), but the results of the significance test tend to overshadow other results, and lead to biases in publication (no significant effect = much lower chance of getting published).

Andrew Gelman, a statistician working in the social sciences, recently wrote a piece about how the idea of things having absolutely no effect is kind of absurd. Rather, he advocates moving from attempting to "discover" an effect to "measuring" the size of the effect, or creating models that best explain what is going on. This is right in line with a recent statement by the American Statistical Association calling for a major move away from p-values as the be-all-end-all in research (on a side note: if you do quantitative research, or know people who do, read that statement). Thinking about this for L2 research, it really strikes a chord for me- after all, even very bad L2 instruction results in some kind of learning or development. The more important question is how much?

In L2 research, I think it is fair to say that we're still primarily in an NHST mode of thinking (though we are seeing changes and improvements). We're asking questions like Does working memory have an effect on reading comprehension? and Does explicit instruction have an effect on the processing of case marking?  instead of asking how much of an effect or to what degree. And while L2 research is not unique among social science disciplines in its preference for NHST and p-value uber alles evaluation of research findings, I do wonder if certain theoretical concerns have contributed to the popularization and entrenchment of NHST in the field.

Specifically, in second language acquisition, a distinction has long been made between acquisition (integration of forms into the implicit linguistic system- you have 'real' language in your head) and learning (gaining declarative knowledge of linguistic forms- you can talk about how past tense works), which runs parallel with with the ideas of implicit knowledge and explicit knowledge, which extends further to implicit instruction (providing lots of input, perhaps featuring a particular form) and explicit instruction (offering metalinguistic explanation of a form). There are then theoretical positions which hold explicit knowledge has NO effect on acquisition, and in turn that explicit instruction has NO effect on acquisition (in both cases, it is able to have an effect on learning, however). This theoretical position is arguably well-served by NHST, but I think there are still shortcomings. First, the idea that explicit knowledge or instruction has absolutely zero effect on learning just seems extremely unlikely, even though I'd be inclined to agree that it has relatively little effect. Second, it tends to result in glossing over effect sizes. An implicit instructional treatment might have a statistically significant effect but small effect size, or an explicit instructional treatment might not be statistically significant but have a non-trivial effect size, but the focus is often the two being "significantly different."

Why does this matter? Well, unfortunately, the bits of a study about statistical significance are what help it get published in the first place, and they are primarily what comes out of a published study when it is picked up by teachers, media, and other interested parties. I also think this way of using stats and asking questions prevents us from considering more informative and interesting results that could come out of studies. Here's hoping that L2 research keeps moving away from NHST and starts taking more heed of folks like Gelman and the ASA.

2 comments:

  1. Wouldn't you think that experimental design plays a large role in the lack of replicability in research rather than just the role of the P value or NHST? Do you think the effect sizes of previous second language acquisition research a replicable? Or more replicable than the results of statistical tests of significance?

    ReplyDelete
    Replies
    1. Sure, it no doubt also plays a large role. Sampling and sample size jump out as design features that play a major role in replicability. But I also think about how NHST type of thinking affects study designs- with some of the implicit/explicit instruction and knowledge work, for example, criteria are set at passing GJTs with better than 50% accuracy (i.e., chance)... so if your implicit group gets 51.07% of the GJTs right and your explicit group gets 48.6% right, and your sample is just large enough, you get to say that implicit instruction is significantly more effective than explicit instruction. Regardless of how small or large the effects are, people are going into it looking for AN effect (defined as anything greater than chance, "discovery" in Gelman's words) rather than the size ("measurement"). Replicability gets hampered by NHST too, because even if you find a similar pattern, let's say a slightly beneficial effect for implicit instruction leading to marginally better-than-chance GJT performance, if it's not statistically significant, you all the sudden have a replicability problem even though the findings of the second study are not *really* contradictory to the first.

      Delete