It's no secret that there are some
problems with how research and knowledge-creation works in the social sciences (and medical science, too, by the way). Long story short is that social science researchers have a very hard time replicating results when the same study is run again (a team of psychologists recently replicated 100 published studies, read about it
here).
One source of this problem is a research framework called
null-hypothesis significance testing (NHST). For those of you who aren't knee-deep in doing or reading primary research, this is where the now-common phrase "significant results" comes from. Basically, NHST begins with the idea that there is no effect of whatever it is you are researching- a type of classroom instruction, a pill, whatever. If you collect your data, run your stats, and then get something called a
p-value that is small enough, you get to
reject the idea that there was no effect of the instruction or the pill. This lets you say your study (or, your instruction or your pill) has siginificant results, because it is different than nothing.
p-values are malleable little things, and very sensitive to sample size; large samples with very small effects will be "significant" while smaller samples with medium to large effect sizes might end up being "non-significant" (this is a concept called
power in statistics). The thing is, this
p-value says nothing about how large the effect is; the word "significant" only means that it passed some (arbitrary) threshold. Now, people do tend to look at how big the difference is (an
effect size), but the results of the significance test tend to overshadow other results, and lead to biases in publication (no significant effect = much lower chance of getting published).
Andrew Gelman, a statistician working in the social sciences, recently wrote a piece about how the idea of things having absolutely no effect is kind of absurd. Rather, he advocates moving from attempting to "discover" an effect to "measuring" the size of the effect, or creating models that best explain what is going on. This is right in line with a recent
statement by the American Statistical Association calling for a major move away from
p-values as the be-all-end-all in research (on a side note: if you do quantitative research, or know people who do, read that statement). Thinking about this for L2 research, it really strikes a chord for me- after all, even very bad L2 instruction results in
some kind of learning or development. The more important question is
how much?
In L2 research, I think it is fair to say that we're still primarily in an NHST mode of thinking (though we are seeing changes and improvements). We're asking questions like
Does working memory have an effect on reading comprehension? and
Does explicit instruction have an effect on the processing of case marking? instead of asking
how much of an effect or
to what degree. And while L2 research is not unique among social science disciplines in its preference for NHST and
p-value uber alles evaluation of research findings, I do wonder if certain theoretical concerns have contributed to the popularization and entrenchment of NHST in the field.
Specifically, in second language acquisition, a distinction has long been made between
acquisition (integration of forms into the implicit linguistic system- you have 'real' language in your head) and
learning (gaining declarative knowledge of linguistic forms- you can talk
about how past tense works), which runs parallel with with the ideas of
implicit knowledge and
explicit knowledge, which extends further to
implicit instruction (providing lots of input, perhaps featuring a particular form) and
explicit instruction (offering metalinguistic explanation of a form). There are then theoretical positions which hold explicit knowledge has
NO effect on acquisition, and in turn that explicit instruction has
NO effect on acquisition (in both cases, it is able to have an effect on learning, however). This theoretical position is arguably well-served by NHST, but I think there are still shortcomings. First, the idea that explicit knowledge or instruction has
absolutely zero effect on learning just seems extremely unlikely, even though I'd be inclined to agree that it has relatively little effect. Second, it tends to result in glossing over effect sizes. An implicit instructional treatment might have a statistically significant effect but small effect size, or an explicit instructional treatment might not be statistically significant but have a non-trivial effect size, but the focus is often the two being "significantly different."
Why does this matter? Well, unfortunately, the bits of a study about statistical significance are what help it get published in the first place, and they are primarily what comes out of a published study when it is picked up by teachers, media, and other interested parties. I also think this way of using stats and asking questions prevents us from considering more informative and interesting results that could come out of studies. Here's hoping that L2 research keeps moving away from NHST and starts taking more heed of folks like Gelman and the ASA.