Coding Linguistic Data: Down the Rabbit Hole

Language is complicated. It's an intricate system of concrete signs that link to mental abstractions. Looking directly at those mental abstractions is... difficult, to say the least (though things being done in neurolinguistics are getting us a bit closer). So we often look at the concrete signs of language- spoken or written words, which we can analyze by pinning down soundwaves or letters on a page and making inferences about what is going on mentally.

One tool for analyzing linguistic data is coding. Not writing computer programs, but tagging segments or sections of text (spoken or written, we use the word "text"). One thing we're particularly interested in in L2 research is errors- in some sense, error analysis kick-started the whole field of second language acquisition (SLA). So take the following sentence, for example (just made up, but typical of a learner of English):

  • He go to store today.
You don't need to be a linguist or even an armchair grammar guru to spot a couple errors there: "go" should be "goes", "store" should be preceded by an article (most likely "the" in this context, but "a" could work). So at a very basis level of coding, we could say this sentence has 2 errors.  That's informative, but it might not be fine-grain enough to answer many questions about SLA, so we end up with more elaborate coding schemes. The first error becomes an agreement error (the verb does not agree with the subject) and the second error becomes an article error (missing or misuse of an article).

Coding can get harder when you're dealing with less isolated chunks of text, requiring more inferences on the researcher's part.  Let's add a little context:

  • Teacher: What did he do today?
  • Student:  He go to store today.
Now that verb error is harder to classify. On the surface, it's an agreement error, because English does not permit "He go". But, given the context, it would be more appropriate to say "He went" (the question was about completed actions). So now it could also be considered a tense error.

Coding, and reliably pinning down increasingly fine-grained subcategories, is just as hard for analyzing other linguistic features. I'm working on a project that deals with L2 pronunciation, and I'm knee-deep in coding phonological errors in transcripts of learner speech samples. Let's consider pronunciation related error (another hypothetical L2 English example):

  • I like bet dug
First, we assume that the speaker meant "pet dog" for "bet dug" (this inference is easier to support when the speakers were asked to produce very controlled speech samples, like reading a sentence aloud or filling in a sentence template based on a picture). We see two apparent errors- "b" for "p", and "u" ("uh" sound) for "o" ("aw" sound). Do we stop there, or do we jump down the rabbit hole? The first error is a consonant, the second a vowel. We could also categorize both errors as substitutions, a common phonological error where one sound is switched for another (usually, a sound that's easier to produce or found in the L1). But wait, there's more- do we care about the particulars of the substitutions- do we want to note what the specific sound swap was and count them up? We could even look at the context- "b" is word and syllable initial, "u" is found inside a word/syllable. As you can see, this gets to be potentially labyrinthine.

For me, it's tempting to keep falling down that rabbit hole while coding- my logic is something like "might as well knock it all out while I'm here." But ultimately this slows down your progress, and you might not end up needing such a fine grain size to answer your research questions. You also might not be able to reliably code when your scheme is overly elaborate. I also know that you can always go back to your data later for a different analysis. I'm a relatively novice researcher, so I haven't had the personal experience of doing that so much with my own data, but a recent project I worked on did involve going back to my colleague's dataset and doing more detailed phonological analyses of learner-learner interactions.

Is there a moral to this story? I don't know, I just needed a break from coding! But I'll try to leave a couple bits of advice, mostly for myself:
  1. Keep your original goals in sight. Research can/does evolve, but your original RQs can provide guidance.
  2. Get comfortable with the idea of going back to your data for subsequent analysis. It might be a post-hoc in the same project/article, or if you get a really novel inspiration while doing primary coding, you can return to it later for a fresh analysis and write-up.

No comments:

Post a Comment